A lot of times I get asked about the applications of Natural Language Processing.  If you look at most tutorials online, a lot of NLP is based on books and chat bots, but NLP can be applied to many other areas.  A big are where NLP can be applied successfully is consumerism and merchandising.  It can helps us understand trends, consumer sentiment and where our products succeed or fail.  The example I am working through today attempts exactly that – how can we use NLP to understand the ups and downs of women clothing items.

The data can be found here.

Because the code outputs can be big, I am only showing snippets in here; hence, if you want the full code and output go here.

Data Inspection

Let’s start by setting up the ground work.

#Import Libraries
%matplotlib inline
import keras
import nltk
import pandas as pd
import numpy as np
import re
from time import time
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.decomposition import NMF
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.util import ngrams
from nltk.corpus import stopwords
import datetime
from collections import Counter
import matplotlib.pyplot as plt
from itertools import cycle, islice
import seaborn as sns
from nltk.corpus import stopwords
stop = stopwords.words('english')

Now, let’s lay down some useful functions.

def sanitize_text(df, col):
    df[col+' Cleaned'] = df[col].str.replace(r"[^A-Za-z0-9(),!?@\'\`\"\_\n]", " ")
    df[col+' Cleaned'] = df[col+' Cleaned'].replace('[^\w\s]', '')
    df[col+' Cleaned'] = df[col+' Cleaned'].str.lower()
    return df

def avg_word(sentence):
    words = sentence.split()
    return (sum(len(word) for word in words)/len(words))

def rank_words(terms, feature_matrix):
    Display top comments by topic.
    sums = feature_matrix.sum(axis=0)
    data = []
    for col, term in enumerate(terms):
        data.append( (term, sums[0,col]) )
    ranked = pd.DataFrame(data, columns=['term','rank']).sort_values('rank', ascending=False)
    return ranked

Read in the data.

#Read data
df = pd.read_csv("data.csv")

Screen Shot 2019-07-18 at 11.34.13 AM.png

Let’s inspect data types and drop that Unnamed column.

#Check data types

#Drop columns we don't need.
df.drop('Unnamed: 0', axis=1, inplace=True)

Screen Shot 2019-07-18 at 11.35.32 AM

Data Overview

From the section above, we can tell not all columns have the same amount of values and that can be problematic, so let’s go ahead and see how many values are missing.

#Look at missing data
my_colors = (('#ED6A5A', '#5CA4A9'))
ax = pd.isnull(df).sum().plot(kind='bar', color=my_colors, width=0.8)
ax.set_ylabel('Number of Missing Values', size=12, alpha = 0.8)
#ax.yaxis.grid(color='#E6EBE0', linestyle='-', linewidth=1)


We will indeed need to remove these values later, but for now, since we want to visualize the data first, it is important we keep all values – even the missing values.

Pivot Tables
tab = pd.crosstab(df['Division Name'], df["Department Name"])

Screen Shot 2019-07-18 at 11.42.21 AM

Key Finding: The most common item are normal-sized tops and least common items are most intimates and trendy petite items.

#Look at ratings

Screen Shot 2019-07-18 at 11.45.13 AM

The above shows our classes are not completely balanced so if you want to do any ML later, you will deal with that later (resampling, over/under sampling). This is not always a bad thing per se, but when your classes are very disproportionate, your model will naturally favor the majority class given that it has the highest probability.

Key Finding: Most ratings are positive.

#Look at the types of items being reviewed
df.groupby("Class Name").count()

Screen Shot 2019-07-18 at 11.51.02 AM

I am not too surprised to find the biggest reviews come from dresses, knits and sweaters. At this point, we don’t know yet whether these are positive or negative reviews, but we know women are discussing them often. This is likely due to the fact that all three categories are staple items in most women’s closets. Dropping ‘casual bottoms’ and ‘chemises’ as the sample size is very small.

#Look at the number of reviews per class
df = df[(df['Class Name'] != 'Casual bottoms') & (df['Class Name'] != 'Chemises')]
ax = sns.countplot(df['Class Name'], palette = "GnBu_d", order= df["Class Name"].value_counts().index)
ax.set_ylabel('Number of Reviews per Class', size=13, alpha = 0.8)
ax.set_xlabel('Class Name', size=13, alpha = 0.8)

download (2)

Let’s take a quick look at the positive feedback score. I assume this is the equivalent of a thumbs up or down on a review.

<span class="c1">#Let's look at the mean positive feedback count and rating by class name</span>
<span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s2">"Class Name"</span><span class="p">)[</span><span class="s1">'Positive Feedback Count'</span><span class="p">]</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>

Screen Shot 2019-07-18 at 12.00.32 PM

The classes with the most positive feedback are trendy clothes and dresses. Trendy clothes, I believe, makes sense given that most people follow trends religiously and once a handful of people like an item, many more will follow. The class names with the least positive feedback are intimates.


Now, I want to see how the Age variable behaves.

#Look at the number of reviews per Age
ax = sns.distplot(df.Age, color = '#ED6A5A',hist_kws=dict(alpha=1), bins=15,)
ax.set_ylabel('Density', size=13, alpha = 0.8)

download (3)

Key Finding: It seems most reviews come from women between the ages of 35 and 50.

#Age vs. Rating
my_colors = (('#ED6A5A', '#5CA4A9', '#F77F00', '#9BC1BC', '#F4F1BB'))
sns.boxplot(x = 'Rating', y = 'Age', data = df, palette = my_colors)
ax.set_ylabel('Age', size=13, alpha = 0.8)
ax.set_xlabel('Rating', size=13, alpha = 0.8)

download (4)

Key Finding: While the previous post showed that women between the ages of 35 and 50 wrote more reviews, when it comes to ratings (the previous plot), it seems age does not have an effect.

Recommended vs. Not Recommended Products

Let’s explore the recommended field. This is a boolean value and it means whether the reviewer would recommend or not recommend the product.

ax = sns.countplot(y='Class Name', data=df[df['Recommended IND']==1] ,color='#ED6A5A', label = "Recommended", order= df["Class Name"].value_counts().index)
ax = sns.countplot(y='Class Name', data=df[df['Recommended IND']==0] ,color='#5CA4A9', label = "Not Recommended", order= df["Class Name"].value_counts().index)
ax.set_ylabel('Class Name', size=13, alpha = 0.8)
ax.set_xlabel('Count', size=13, alpha = 0.8)
ax = plt.legend()

download (5)

Key Insight: It seems that dresses has the most recommended and not recommended items. Do keep in mind this is the biggest class group; hence, we can’t make a strong case on this one.


Let’s do a quick correlation through Pandas and then decide whether a regression is necessary.

#Initial exploration

Screen Shot 2019-07-18 at 1.20.37 PM

A quick glance at this shows correlations between the recommended IND and the rating – which makes sense – if you’re willing to recommend the product, you’re also willing to rate it highly. There are also seems to be correlations between Age and Recommend IND and Rating.

Let’s do a couple regressions. Recommended Likelihood vs. Age Mean and . Recommended Likelihood vs. Rating Mean,




We can certainly see proportional relationships among these variables. A word of caution though the data itself is disproportionate; hence, we have to be careful when making these claims, but it seems that recommended likelihood is related to both age and ratings.


Okay, so far, what do we know? We know that there are proportional relationships between product recommendations, age and ratings. We also know that Dresses and Knits are the most popular departments. We know that 35-50 year old write the most reviews. Let’s see how NLP can help us understand these relationships better.

Let’s start by cleaning the text and dropping nulls and unnecessary columns.

#Clean text
df = sanitize_text(df, "Review Text")
df['Review Text Cleaned'].head()

#Drop nulls
df = df[df['Review Text'].notnull()].reset_index(drop=True)
df = df[df['Department Name'].notnull()].reset_index(drop=True)

Now, let’s start computing text features that will help us create the story.

#Compute features
#Number of words
df['word_count'] = df['Review Text Cleaned'].apply(lambda x: len(str(x).split(" ")))
df[['Review Text','word_count']].head()

#Average word length
df['avg_word'] = df['Review Text'].apply(lambda x: avg_word(x))
df[['Review Text','avg_word']].head()

#Character Count
df['char_count'] = df['Review Text'].str.len() ## this also includes spaces
df[['Review Text','char_count']].head()

df['stopwords'] = df['Review Text'].apply(lambda x: len([x for x in x.split() if x in stop]))
df[['Review Text','stopwords']].head()

#Visualize features
f, axes = plt.subplots(4,4, figsize=(40,35), sharex=False)
for ii, xvar in enumerate(['word_count', "char_count", "stopwords", "avg_word"]):
    for i,y in enumerate(["Rating","Department Name","Recommended IND"]):
        for x in set(df[y][df[y].notnull()]):
            sns.kdeplot(df[xvar][df[y]==x], label=x, shade=False, ax=axes[ii,i])
        if ii is 0:
            axes[ii,i].set_title('{} Distribution (X)\nby {}'.format(xvar, y))
            axes[ii,i].set_title('For {} (X)'.format(xvar))
    # Plot 4
    if ii is 0:
        axes[ii,3].set_title('{} Distribution (X)\n'.format(xvar))
        axes[ii,3].set_title('For {} (X)'.format(xvar))

df[["word_count","char_count", "avg_word", "stopwords"]].describe().T

download (8).png

Okay, this is some good stuff. Screen Shot 2019-07-18 at 1.28.10 PM

A lot of interesting information can be seen here. First thing I notice is that in-between scores like 3 and 4 have the most words, longest words, least stopwords which makes me think there’s more substance to those reviews, perhaps in between items need longer explanations for the given rating.

I also notice that the longest reviews come from the Dresses category.
Recommended items also have the longest reviews, not by a lot though – as we saw earlier this field may due to a popularity effect instead of an unbiased review.

Okay, now, the final thing I want to look at are the actual words. Can we tell age or class apart by the words they use to describe clothing?

I will start by removing stopwords. Then, I’ll take a look at the most frequent and unfrequent terms and decide whether to keep them in the data or not.

#remove stopwords
df['Review Text Cleaned'] = df['Review Text Cleaned'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))

#most frequent terms
freq = pd.Series(' '.join(df['Review Text Cleaned']).split()).value_counts()[:10]

Screen Shot 2019-07-18 at 1.33.23 PM

freq = pd.Series(' '.join(df['Review Text Cleaned']).split()).value_counts()[-10:]

Screen Shot 2019-07-18 at 1.34.00 PM

I’m actually ok not removing any of these words.

Now, we can determine the key terms.

#Important words by Recommended IND
vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(3,5))
labels = [0, 1]
for i in range(len(labels)):
    print("Recommended IND: "+str(labels[i]))
    temp = df[df['Recommended IND'] == labels[i]]
    tfidf = vectorizer.fit_transform(temp['Review Text Cleaned'])
    ranked = rank_words(terms=vectorizer.get_feature_names(), feature_matrix=tfidf)

I repeated the same chunk of code and filtered by class and age as well.
You can see the full output in the link shared at the beginning of the post.
Below is a summary.

Recommended Terms:

  • love
  • true to size
  • fit
  • received many compliments
  • unique
  • perfect
  • looks great

Not Recommended Terms:

  • like maternity top
  • too much fabric
  • ordered usual size
  • going back
  • sadly
  • really wanted to love


Recommended Terms By Class

(see remaining classes in link at the beginning of post):

  • Intimates
    • fits like a glove
    • wear around house
    • comfy
  • Dresses
    • got compliments
    • fits
    • absolutely love dress
  • Bottoms
    • worth every penny
    • right amount of stretch
    • fits like a glove

Not Recommended Terms By Class

(see remaining classes in link at the beginning of post):

  • Intimates
    • itchy
    • hole
    • hips
  • Dresses
    • like maternity dress
    • looked like sack
    • never write reviews
  • Bottoms
    • waist weird line
    • way distressed picture
    • ordered usual size

From this I gather, good reviews mostly contain positive, superlative adjectives and describe the fit and whether compliments or not were received.  When it is a bad review, it seems most of it has to do with fit and discrepancies with the picture shown on the website.

We can definitely go further and use other techniques to dig deep into the terms.  We can also use classifier models like a logistic regression or neural network and try to determine the exact terms that go into these reviews.

Posted by:Aisha Pectyo

Astrophysicist turned data rockstar who speaks code and has enough yarn and mod podge to survive a zombie apocalypse.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s