This is the second part of this text mining series. For part I click here. In this post, we’ll go over unstructured approaches to look at our text data.  A good way to learn about what’s inside our text data is to find topics or themes. In this example we will use a Latent Dirichlet Allocation (LDA) and Non-negative matrix factorization. I will do it with two different packages to see if we see any significant differences

In a nutshell, a LDA model uses a probability distributions to determine which words belong to which documents. Let’s set this up! Similarly, a NMF generates weighted sets of recurring terms in the documents.

So, in the last post we ended with tokenizing and stemming, but for an LDA, we will actually go back and use the data frame with the untokenized cleaned data.

Let’s start from the beginning…

import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from collections import Counter
from nltk.stem import PorterStemmer
import gensim
from gensim import corpora
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

I will speed through these steps since we’ve seen all these in the last post.

file1 = pd.read_csv("Lyrics1.csv")
file2 = pd.read_csv("Lyrics2.csv")
data = pd.merge(file1, file2)

Clean it up!

data['Lyrics'] = data['Lyrics'].str.lower()
data['Lyrics'] = data['Lyrics'].str.replace('[^\w\s]','')
data['Lyrics'] =data['Lyrics'].str.replace('\d+', '')
data['Lyrics'] = [x.replace("\r\n","") for x in data['Lyrics']]
data['Lyrics'] = [x.replace("\n","") for x in data['Lyrics']]
data['Lyrics'] = [x.replace("2x","") for x in data['Lyrics']]

stop = stopwords.words('english')
stopS = stopwords.words('spanish')
#Change your dataframe column to a list
docs = data['Lyrics'].tolist()

#Remove stopwords
for j in xrange(len(docs)):
    docs[j] = " ".join([i for i in docs[j].lower().split() if i not in stop])
#Spanish
for j in xrange(len(docs)):
    docs[j] = " ".join([i for i in docs[j].lower().split() if i not in stopS])

Let’s start by using the gensim package first.

#You need the next two lines to be able to use the gensim package
logging.basicConfig(format='%(levelname)s : %(message)s', level=logging.INFO)
logging.root.level = logging.INFO

#tokenize your sentences
from nltk.tokenize import word_tokenize
tokenized_sents = [word_tokenize(i) for i in docs]

#define the dictionary you will use to train the model
id2word = corpora.Dictionary(tokenized_sents)

#turn it into a dtm
corpus = [id2word.doc2bow(doc) for doc in tokenized_sents]

#Run the model and print
lda_model = gensim.models.ldamodel.LdaModel
result_lda_model = lda_model(corpus, num_topics=6, id2word = corpora.Dictionary(tokenized_sents), passes=50)

#the output from this is kind of messy, so we'll use pprint
from pprint import pprint
pprint(result_lda_model.print_topics(num_topics=6, num_words=5))

One of the things I REALLY like about gensim is that as you can see below it gives you the actual weights of each word per topic.

[(0, u’0.016*”love” + 0.014*”know” + 0.013*”im” + 0.012*”dont” + 0.009*”got”‘),
(1, u’0.017*”che” + 0.017*”di” + 0.011*”il” + 0.009*”non” + 0.006*”per”‘),
(2, u’0.012*”amor” + 0.010*”si” + 0.009*”quiero” + 0.006*”ms” + 0.005*”oh”‘),
(3,
u’0.007*”oh” + 0.005*”come” + 0.004*”one” + 0.003*”day” + 0.002*”little”‘),
(4, u’0.013*”ich” + 0.008*”du” + 0.008*”die” + 0.007*”und” + 0.005*”nicht”‘),
(5, u’0.013*”get” + 0.013*”got” + 0.013*”like” + 0.012*”im” + 0.009*”aint”‘)]

Now, let’s look at the NMF and LDA with SKlearn. To run these, you will need term frequencies which we will compute below.

number_of_features = 500 #you can play with this number

#NMF will use a tf-idf, refer to previous post for explanation
tfidf_vectorizer = TfidfVectorizer(max_df=0.90, min_df=5, max_features=number_of_features)
tfidf = tfidf_vectorizer.fit_transform(docs)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()

#LDA will need a term frequency in order to get the probabilities
tf_vectorizer = CountVectorizer(max_df=0.90, min_df=5, max_features=number_of_features)
tf = tf_vectorizer.fit_transform(docs)
tf_feature_names = tf_vectorizer.get_feature_names()

#Run the models
no_topics = 3
nmf = NMF(n_components=no_topics, random_state=1, alpha=.1, l1_ratio=.5, init='nndsvd').fit(tfidf)

lda = LatentDirichletAllocation(n_topics=no_topics, max_iter=5, learning_method='online', learning_offset=50.,random_state=0).fit(tf)

If we print the results, this is what we get:
NMF:
Topic 0:
know never youre one could cant im dont see go
Topic 1:
amor quiero si ms vida nunca ser voy solo siempre
Topic 2:
che di il non per si io sono ci da
Topic 3:
love heart baby need youi dont never know give im

LDA:
Topic 0:
got get like im dont aint know wanna want make
Topic 1:
know could never im right said say better think would
Topic 2:
love im dont cant oh know youre want baby like
Topic 3:
che di si il non amor da ich quiero per

So, in summary what we can gather is that no matter where you are in world the most songs out there talk about love, or some derivative of love. We’re all suckers :P!

Notes for the future: we could improve this by doing a number of things like stemming, removing mispelled words, changing the model parameters, etc.

Happy Coding!

Posted by:Aisha Pectyo

Astrophysicist turned data rockstar who speaks code and has enough yarn and mod podge to survive a zombie apocalypse.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s