Visually representing the content of a text document is one of the most important tasks in the field of text mining (also called exploratory text analysis). As a data scientist or NLP specialist, we not only explore the content of documents from different aspects and at different levels of detail, but we also summarize a single document, show words and topics, detect events and create scenarios .
However, there are discrepancies between the data visualization unstructured (text) and structured data. For example, many text visualizations don't represent the text directly, they represent an output of a language model (word count, character length, word sequences, etc.).
In this article, we will use the Womens Clothing E-Commerce Reviews dataset and try to explore and visualize as much as possible, using Plotly's Python graphics library and Bokeh visualization library. Not only will we explore textual data, but we will also visualize numerical and categorical features.
Contents
ToggleExploratory analysis of texts: the data
After a brief inspection of the data, we found that there are a series of data pre-processing that we need to perform.
Delete the "Title" function.
Delete the lines where "Review Text" was missing.
Clean up the "Revision Text" column.
Using TextBlob to calculate sentiment polarity which is in the range of [-1,1] where 1 means positive sentiment and -1 means negative sentiment.
Create a new feature for the duration of the exam.
Create a new feature for the exam word count.
To preview if the sentiment polarity score is working, we randomly select 5 reviews with the highest sentiment polarity score (1):
print('5 random reviews with the highest positive sentiment polarity: \n')
cl = df.loc[df.polarity == 1, ['Review Text']].sample(5).values
for c in cl:
print(c[0])
Then randomly select 5 reviews with the most neutral sentiment polarity score (zero):
print('5 random reviews with the most neutral sentiment(zero) polarity: \n')
cl = df.loc[df.polarity == 0, ['Review Text']].sample(5).values
for c in cl:
print(c[0])
There were only 2 reviews with the most negative sentiment polarity score:
print('2 reviews with the most negative polarity: \n')
cl = df.loc[df.polarity == -0.97500000000000009, ['Review Text']].sample(2).values
for c in cl:
print(c[0])
It seems to work
Univariate visualization
Single-variable or univariate visualization is the simplest type of visualization that consists of observations on a single feature or attribute. The univariate visualization includes a histogram, bar charts, and line graphs.
The vast majority of sentiment polarity scores are above zero, which means that most of them are quite positive.
The notes are aligned to the polarity score, i.e. most of the notes are quite high at 4 or 5 ranges.
It is possible to do the same with the age of the reviewers, the number of characters per review and the number of words per review but this is not the heart of this tutorial.
The n-grams
Now we come to the feature we are interested in, before exploring this feature, we need to extract the N-Gram features. N-grams are used to describe the number of words used as observation points, for example, unigram means a single word, bigram means a 2-word sentence, and trigram means a 3-word sentence. To do this, we use the CountVectorizer function from scikit-learn.
To do unigram analysis, it is very important to clean the text of stopwords.
def get_top_n_words(corpus, not=None): | |
with = CountVectorizer(stop_words = 'english').fit(corpus) | |
bag_of_words = with.transform(corpus) | |
sum_words = bag_of_words.sum(axis=0) | |
words_freq = [(word, sum_words[0, idx]) for word, idx in with.vocabulary_.items()] | |
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True) | |
return words_freq[:not] | |
common_words = get_top_n_words(df['Review Text'], 20) | |
for word, frequency in common_words: | |
print(word, frequency) | |
df2 = pd.DataFrame(common_words, columns = ['ReviewText' , 'count']) | |
df2.groupby('ReviewText').sum()['count'].sort_values(ascending=False).iplot( | |
kind='bar', yTitle='Count', linecolor=black, title='Top 20 words in review after removing stop words') |
def get_top_n_bigram(corpus, not=None): | |
with = CountVectorizer(ngram_range=(2, 2), stop_words='english').fit(corpus) | |
bag_of_words = with.transform(corpus) | |
sum_words = bag_of_words.sum(axis=0) | |
words_freq = [(word, sum_words[0, idx]) for word, idx in with.vocabulary_.items()] | |
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True) | |
return words_freq[:not] | |
common_words = get_top_n_bigram(df['Review Text'], 20) | |
for word, frequency in common_words: | |
print(word, frequency) | |
df4 = pd.DataFrame(common_words, columns = ['ReviewText' , 'count']) | |
df4.groupby('ReviewText').sum()['count'].sort_values(ascending=False).iplot( | |
kind='bar', yTitle='Count', linecolor=black, title='Top 20 bigrams in review after removing stop words') |
Part-of-Speech
Part-Of-Speech Tagging (POS) is a process of assigning parts of speech to each word, such as noun, verb, adjective, etc.
blob = TextBlob(str(df['Review Text'])) | |
pos_df = pd.DataFrame(blob.tags, columns = ['word' , 'pos']) | |
pos_df = pos_df.pos.value_counts()[:20] | |
pos_df.iplot( | |
kind='bar', | |
xTitle='POS', | |
yTitle='count', | |
title='Top 20 Part-of-speech tagging for review corpus') |
Analysis by class
The boxplot is used to compare sentiment polarity score, rating, review text length of each department or division of the e-commerce store.
The highest sentiment polarity score was achieved by all six departments except the Trending department, and the lowest sentiment polarity score was achieved by the Tops department. And the Trend department has the lowest median polarity score. If you recall, the Trend department has the least reviews. This explains why it does not have as wide a variety of score distribution as other departments.
With the exception of the Trending department, the median rating for all other departments was 5. Overall, the ratings are high and the sentiment is positive in this review dataset.
And so on.
Bivariate analysis
Bivariate visualization is a type of visualization that consists of two characteristics at once. It describes the association or relationship between two characteristics.
Let's look at sentiment analysis based on whether or not the person recommends the product.
x1 = df.loc[df['Recommended IND'] == 1, 'polarity'] | |
x0 = df.loc[df['Recommended IND'] == 0, 'polarity'] | |
trace1 = go.Histogram( | |
x=x0, name='Not recommended', | |
opacity=0.75 | |
) | |
trace2 = go.Histogram( | |
x=x1, name = 'Recommended', | |
opacity=0.75 | |
) | |
data = [trace1, trace2] | |
layout = go.Layout(bartender='overlay', title='Distribution of Sentiment polarity of reviews based on Recommendation') | |
fig = go.figure(data=data, layout=layout) | |
iplot(fig, filename='overlaid histogram') |
trace1 = go.Scatter( | |
x=df['polarity'], y=df['Rating'], fashion='markers', name='points', | |
marker=say(color='rgb(102,0,0)', size=2, opacity=0.4) | |
) | |
trace2 = go.Histogram2dContour( | |
x=df['polarity'], y=df['Rating'], name='density', contours=20, | |
color scale='Hot', reversescale=True, showscale=False | |
) | |
trace3 = go.Histogram( | |
x=df['polarity'], name='Sentiment polarity density', | |
marker=say(color='rgb(102,0,0)'), | |
yaxis='y2' | |
) | |
trace4 = go.Histogram( | |
y=df['Rating'], name='Rating density', marker=say(color='rgb(102,0,0)'), | |
xaxis='x2' | |
) | |
data = [trace1, trace2, trace3, trace4] | |
layout = go.Layout( | |
showlegend=False, | |
auto size=False, | |
width=600, | |
height=550, | |
xaxis=say( | |
domain=[0, 0.85], | |
showgrid=False, | |
zeroline=False | |
), | |
yaxis=say( | |
domain=[0, 0.85], | |
showgrid=False, | |
zeroline=False | |
), | |
margin=say( | |
t=50 | |
), | |
hovermode='closest', | |
bargap=0, | |
xaxis2=say( | |
domain=[0.85, 1], | |
showgrid=False, | |
zeroline=False | |
), | |
yaxis2=say( | |
domain=[0.85, 1], | |
showgrid=False, | |
zeroline=False | |
) | |
) | |
fig = go.figure(data=data, layout=layout) | |
iplot(fig, filename='2dhistogram-2d-density-plot-subplots') |
Content modeling
Finally, we want to explore the topic modeling algorithm for this dataset, to see if it would provide benefits and fit with what we are doing for our review text functionality.
We will experiment with the Latent Semantic Analysis (LSA) technique in topic modeling.
- Generating our document term matrix from review text to a TF-IDF feature matrix.
- The LSA model replaces the raw counts in the document term matrix with a TF-IDF score.
- Perform dimensionality reduction on the document term matrix using a truncated SVD.
- Since the number of departments is 6, we set n_topics=6.
- Taking the argmax of each review text in this topic array will yield the predicted topics of each review text in the data. We can then sort them by number of each subject.
- To better understand each subject, we will find the three most frequent words in each subject.
reindexed_data = df['Review Text'] | |
tfidf_vectorizer = TfidfVectorizer(stop_words='english', use_idf=True, smooth_idf=True) | |
reindexed_data = reindexed_data.values | |
document_term_matrix = tfidf_vectorizer.fit_transform(reindexed_data) | |
n_topics = 6 | |
lsa_model = TruncatedSVD(n_components=n_topics) | |
lsa_topic_matrix = lsa_model.fit_transform(document_term_matrix) | |
def get_keys(topic_matrix): | |
'' | |
returns an integer list of predicted topic | |
categories for a given topic matrix | |
'' | |
keys = topic_matrix.argmax(axis=1).list() | |
return keys | |
def keys_to_counts(keys): | |
'' | |
returns a tuple of topic categories and their | |
accompanying magnitudes for a given list of keys | |
'' | |
count_pairs = Counter(keys).items() | |
categories = [peer[0] for peer in count_pairs] | |
counts = [peer[1] for peer in count_pairs] | |
return (categories, counts) | |
lsa_keys = get_keys(lsa_topic_matrix) | |
lsa_categories, lsa_counts = keys_to_counts(lsa_keys) | |
def get_top_n_words(not, keys, document_term_matrix, tfidf_vectorizer): | |
'' | |
returns a list of n_topic strings, where each string contains the n most common | |
words in a predicted category, in order | |
'' | |
top_word_indices = [] | |
for thread in tidy(n_topics): | |
temp_vector_sum = 0 | |
for i in tidy(then(keys)): | |
yew keys[i] == thread: | |
temp_vector_sum += document_term_matrix[i] | |
temp_vector_sum = temp_vector_sum.toarray() | |
top_n_word_indices = n.p..flip(n.p..argsort(temp_vector_sum)[0][–not:],0) | |
top_word_indices.append(top_n_word_indices) | |
top_words = [] | |
for thread in top_word_indices: | |
topic_words = [] | |
for index in thread: | |
temp_word_vector = n.p..zeros((1,document_term_matrix.shape[1])) | |
temp_word_vector[:,index] = 1 | |
the_word = tfidf_vectorizer.inverse_transform(temp_word_vector)[0][0] | |
topic_words.append(the_word.encode('asci').decode('utf-8')) | |
top_words.append( » « .join(topic_words)) | |
return top_words | |
top_n_words_lsa = get_top_n_words(3, lsa_keys, document_term_matrix, tfidf_vectorizer) | |
for i in tidy(then(top_n_words_lsa)): | |
print("Topic{}:" .format(i+1), top_n_words_lsa[i]) |
top_3_words = get_top_n_words(3, lsa_keys, document_term_matrix, tfidf_vectorizer)
labels = ['Topic {}: \n'.format(i) + top_3_words[i] for i in lsa_categories]fig, ax = plt.subplots(figsize=(16,8))
ax.bar(lsa_categories, lsa_counts);
ax.set_xticks(lsa_categories);
ax.set_xticklabels(labels);
ax.set_ylabel('Number of review text');
ax.set_title('LSA topic counts');
plt. show();
Looking at the most frequent words in each subject, we have the feeling that we may not achieve any degree of separation between subject categories. In other words, we could not separate exam texts by department using topic modeling techniques.
Thematic modeling techniques have a number of important limitations. To begin with, the term “topic” is somewhat ambiguous, and it is perhaps now clear that topic models will not produce a highly nuanced text classification for our data.