I am obsessed with skincare…like…easily 25% of my paycheck goes to trying out new products. I am always on Sephora.com or Dermstore.com, you name it. I am also always watching skincare content on YouTube and I am pretty much an rss feed for Reddit’s r/SkincareAddiction.  In fact, I recently made a post doing a deep dive of the skincare subreddit; mainly, looking at the overall brands and products being talked about or recommended by skin type according the subscribers of the sub – click here!

This time, I wanted to go to the source directly, Sephora.com, y’all. I wanted to scrape skincare product data and see how we can find dupes (La Mer, SK-II *wink* *wink*) or similar products based on product ingredients. It is well accepted that just because a product costs $200+ dollars, it doesn’t necessarily have to be an amazing product – it comes down to the ingredients!  So, can we measure the relationship between products using any of the data available throughout the site? Yes, I think we can.  We can use several unsupervised learning algorithms to figure out the relationship between product ingredients and; hence, make the appropriate recommendations. Mainly, we will calculate the cosine distance between two products using a tf-idf  matrix, count vectorizer and KNN. 

the data

I won’t go into the weeds about scraping the data from the site, but in summary, I scraped the product name, brand, price, rating, ingredients, skin type of the product, reviews, skin type per reviewer for products under the following categories:

    • moisturizers
    • face oils
    • face mists
    • cleansers
    • facial treatments (e.g. serums)
    • facial masks
    • eye cream
    • sunscreen

I got a total of 1653 products to work with.

data exploration

Before tackling the recommendation, I wanted to take a closer look at the numbers we were working with. Plus, it’s not every day you get to look at descriptive stats for skincare, *bursting*. Listen, I will have us both glowing.

The first thing I was curious about was price. Skincare can be expensive, to an absurd degree at times.  I wanted to know if there were any major differences between prices amongst the major skincare categories.


I am not too surprised that Treatments are pricey. I am no dermatologist, but from what I understand treatments such as serums have a higher concentration of active ingredients and; hence, why they are more expensive. Treatments are followed by moisturizers. The least expensive category are cleansers. 

I then wanted to look at ratings. Personally, I have a tough time finding good sunscreens (I have brown skin and normal to dry skin so most sunscreens are either drying or leave a white cast) so I was curious to find out if there are any categories that are problematic.


Interestingly, all categories have on average a rating of 4 or above stars. Eye creams had the lowest average rating at 3.8 stars.

A big thing in skincare are ingredients. I know that alcohols will affect dry skin and fragrances can also irritate sensitive skin so I tend to avoid products that have either of those products. So, I wanted to look at ingredient distributions over all and per skin type.


Glycerin is popping. This makes sense given that glycerin is a humectant.  There are handful of alcohols up there, but these are also “good” alcohols, e.g. alcohols that moisturize your skin.  Lastly, there’s silicones (dimethicone).

When I split the data by skin type, all skin types shared many common ingredients such as water, glycerin, etc. Below, I made a quick illustration showing the key difference in ingredients amongst Sephora products for targeted skin types.

Aromatherapy_Logos (2)

Interestingly, I found that most products in Sephora are targeted to ALL SKIN TYPES. I am skeptical of items that claim to work for all skin types given that skin can be so different from one person to the next.

Aromatherapy_Logos (1)

recommendation algorithm

There are different types of recommendation algorithms. There are content-based algorithms which will be the focus of our Sephora tool.  These algorithms work by using the content (e.g. ingredients) of an item as an input and then finding items with similar content. You could also do collaborative filtering where you’re given multiple covariates to make your predictions (e.g. ratings, ids, views, etc.)

The recommendation algorithm we will code up is fairly straight-forward. We will use the ingredients of the products as our main feature. We will apply a few natural language processing techniques to get the data in a numerical form. Then, we will compute the cosine distance and this will tell us the relationships between the products based on ingredients. Let’s look at each of these steps more carefully.

After we’ve loaded our data, let’s remove syntax, numbers, punctuation and lower case it. This will ensure our text data is uniform before we vectorize it (turn it into numbers).

def sanitize_text(df, col):
    df[col+' Cleaned'] = df[col].str.replace(r"[^A-Za-z0-9(),!?@™\'\`\"\_\n]", " ")
    df[col+' Cleaned'] = df[col+' Cleaned'].replace('[^\w\s]', '')
    df[col+' Cleaned'] = df[col+' Cleaned'].str.lower()
    return df
df = sanitize_text(df, "ingredients")

Now, it’s time to vectorize! Here, we can multiple options, we could use a neural net and go the word2vec route, we could use a tf-idf vectorizer or a count vectorizer. I will focus on the last two.

The TF-IDF algorithm is used to weigh a keyword in any document and assign the importance to that keyword based on the number of times it appears in the document. Whereas the count vectorizer will be a straight-forward frequency. If we had more data, I wouldn’t suggest using tf-idf given that this algorithm will punish unfrequent terms or terms that are too frequent and if we’re trying to make a recommendation we want to keep it all.

After vectorizing our text, what we’ll want to do is compute the cosine distance. In math terms, it measures the cosine of the angle between two vectors projected in a multi-dimensional space. Keep in mind, we could have used other methods to compute this relationship, but the cosine distance can give us slightly more information.

To test our recommendations, I chose my favorite toner of all time: FRESH Rose Floral Toner. The OG. I also created an index based on the product name to keep track of it as we compute the distances.

indices = pd.Series(df.name) #to track the name of the product within the matrix

count = CountVectorizer()
count_matrix = count.fit_transform(df['ingredients Cleaned'])

cosine_sim = cosine_similarity(count_matrix, count_matrix)
get_recommendations('Rose Floral Toner')

count = TfidfVectorizer()
count_matrix = count.fit_transform(df['ingredients Cleaned'])

cosine_sim = cosine_similarity(count_matrix, count_matrix)
get_recommendations('Rose Floral Toner')


Notice we’re getting calculating the cosine distance with itself. If we were to bring in new products outside of Sephora, you can see why this approach wouldn’t work (we are not tracking products outside of Sephora).

The output from both of these approaches turned out to be the same (and pretty on spot as well):

  • Rose Deep Hydration Facial Toner
  • Crème Ancienne®
  • Rose Face Mask
  • Crème Ancienne® Eye Cream
  • Umbrian Clay Pore Purifying Face Mask
  • Instant Eye Makeup Remover
  • Crème Ancienne® Ultimate Nourishing Honey Mask
  • Rose Jelly Mask

The results make sense…most of these products are actually from the same brand so they’re bound to have similar compositions.

The 2020 Sephora sale is this Friday and I will most certainly be using this to buy my goodies (full code here).


Posted by:Aisha Pectyo

Astrophysicist turned data rockstar who speaks code and has enough yarn and mod podge to survive a zombie apocalypse.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s