I am not sure what is it about text data that I find so appealing, but it is by far one of my favorite fields of machine learning. If I hadn’t studied astrophysics I would have probably done something in linguistics. Ever since I can recall, I’ve been interested in languages, my parents both taught me their native tongues growing up so maybe that’s part of it. I remember when I was a freshman in college and taking Spanish grammar (for native speakers) and thinking, “omg, what is you doing?” I had spoken Spanish my entire life and still, I was baffled at how complicated Spanish could be. After that class, I learned a few more languages. I was determine to learn them ALL, but, alas, it didn’t happen. I currently speak 5 languages and while I have not learned them all, I can say it has really opened my eyes to different perspectives and points of view. I think you can’t truly understand a group of people unless you can understand their language.

Jump in the future 5 years and text mining is a hot topic and I am tasked with making a computer smart enough to analyze Spanish and Portuguese text, like “umm, excuse me?, what?” I felt like I was in heaven, languages and computers?! However, I also felt that if it’s hard for a human being to understand another human being without speaking the same language, how on earth would a computer do it? Also, text data can be incredibly subjective and complex, especially, if say we are looking at customer opinions or social media data. Well, a lot of smart mathematicians came up with really cool models we can use to tackle some of these problems. Here, I am going to attempt to cover most of the bases of what text mining is like and what sorts of models you should have in your arsenal.

Okay, here’s what you will need to follow along this guide:

  • Knowledge of statistics (although I won’t go into the mathematical explanations, but you should know them)
  • Python proficiency (pandas, numpy, nltk, scikit-learn)
  • Python editor (I am using a jupyter notebook)+
  • This dataset (download the 2 csv files containing lyrics)

I – What is text mining and why do people care?

Text mining is the process of extracting insights from text data. The complexity in text mining mostly lies in the fact that text data is unstructured, a.k.a. data that doesn’t have a fixed format or organized in any apparent way. In extracting these insights from messy data, we can learn a lot about a variety of topics such as customer or employee turnover, social media patterns and trends, historical texts, etc.

II – Text Pre-Processing 

Before we can even think about models, we have to get our data in the right format and extract the correct features. In fact, I spend 80% of my time in this time and only 20% of my time running the actual model. Having your data in the right format will ensure you feed your model the proper parameters and you get a reasonable output from it.

import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import PorterStemmer

Let’s read our data:

lyrics1 = pd.read_csv("Lyrics1.csv)
lyrics2 = pd.read_csv("Lyrics1.csv)
#merge them into one dataframe
data = pd.merge(lyrics1, lyrics2)

Immediately, let’s take a look at our dataframe and what data types we’re working with,

data.head()

Screen Shot 2018-06-14 at 8.24.17 PM

Right away we can see we’ll have some text cleaning to do and we also know we’ll have to do a few extra steps too because of a bad import of characters like “2x” and “\r”, etc.

Let’s look at the data types to make sure we’re working with strings.

data.dtypes

Screen Shot 2018-06-14 at 8.24.34 PM

Good.

Now, as far as the next steps goes, this varies on the purpose of your project and the person. I like to start pre-cleaning my data asap.

Pre-processing or filtering is when we normalize the text by making everything lowercase, removing whitespace, stropwords, punctuation and any other issue we may have with the data. We do this so we can feed our model clean, useful data so we can get useful results back.

First thing, let’s lowecase all of our text.

data['Lyrics'] = data['Lyrics'].str.lower()
data["Lyrics"].head()

Screen Shot 2018-06-14 at 9.07.45 PM

Okay, looks good.

Now, let’s remove the punctuation.

data['Lyrics'] = data['Lyrics'].str.replace('[^\w\s]','')
data['Lyrics].head()

Screen Shot 2018-06-14 at 9.08.03 PM

As you can we still have some funky characters that happened during the import, so let’s take care of those too.

#Remove other funky symbols that replace wouldn't take care of bc they are in-between string
data['Lyrics'] = [x.replace("\r\n","") for x in data['Lyrics']]
data['Lyrics'] = [x.replace("\n","") for x in data['Lyrics']]
data['Lyrics'] = [x.replace("2x","") for x in data['Lyrics']]
data["Lyrics"].head()

Screen Shot 2018-06-14 at 9.08.22 PM

Now, I am satisfied with how this looks.

My last step in this pre-processing stage is removing stopwords. Stopwords are a list of words we’ve decided to ignore like “and”, “the”, etc, because they add substance to the text. In order to remove stopwords, we first have to tokenize our text data.

Tokenization is the process of breaking down sentences into words or “tokens.”

I noticed words in Spanish too, so we will use stopwords in both Spanish and English.

stop = stopwords.words('english')
stopS = stopwords.words('spanish')

#Tokenize data
data['tokenized_songs'] = data.apply(lambda row: nltk.word_tokenize(row['Lyrics']), axis=1)

#Remove stopwords
data['tokenized_songs'] = data['tokenized_songs'].apply(lambda x: [item for item in x if item not in stop])
data['tokenized_songs'] = data['tokenized_songs'].apply(lambda x: [item for item in x if item not in stopS])

Let’s now compute the frequency of the terms in our pool and plot it out. If you print out the data in data[‘tokenized_songs’] you’ll see that we a column where the rows are arrays. To compute the frequency and plot it out, we’ll have flatten the column and have it be one massive array,

#---Put all words in array, plot it out
all_words = data['tokenized_songs'].sum()

#---Get frequency and plot
freq = nltk.FreqDist(all_words)
# Print and plot most common words
freq.most_common(20)
freq.plot(10)

Screen Shot 2018-06-14 at 11.19.49 PM

From the plot, we can see most songs talk about “love” and use simple words like “like” and the such. This is not conclusive, but it gives a good idea of where we are headed.

There are two more steps left to finish the pre-processing of the data: (1) lemmatization and (2) stemming.

Lemmatization is in simple terms the morphology of words, but usually stemming is a lot simpler and accomplishes similar results.

Stemming is the process of getting the root of a word, e.g., loving -> love. Let’s look at how we’d do it.

#---Stemming
stemmer = PorterStemmer()
stemmed_words = []
for w in all_words:
    stemmed_words.append(stemmer.stem(w))

With this part done, we’re ready for some modeling. In the couple next post of this series we will cover structured and unstructured approaches to text mining.

Happy Coding!

Posted by:Aisha Pectyo

Astrophysicist turned data rockstar who speaks code and has enough yarn and mod podge to survive a zombie apocalypse.

One thought on “Comprehensive Guide to Text Mining – Part I, Preprocessing Data

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s