This is the second part of this text mining series. For part I click here. In this post, we’ll go over unstructured approaches to look at our text data. A good way to learn about what’s inside our text data is to find topics or themes. In this example we will use a Latent Dirichlet Allocation (LDA) and Non-negative matrix factorization. I will do it with two different packages to see if we see any significant differences
In a nutshell, a LDA model uses a probability distributions to determine which words belong to which documents. Let’s set this up! Similarly, a NMF generates weighted sets of recurring terms in the documents.
So, in the last post we ended with tokenizing and stemming, but for an LDA, we will actually go back and use the data frame with the untokenized cleaned data.
Let’s start from the beginning…
import pandas as pd import numpy as np import nltk from nltk.corpus import stopwords from collections import Counter from nltk.stem import PorterStemmer import gensim from gensim import corpora from sklearn.decomposition import NMF, LatentDirichletAllocation from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
I will speed through these steps since we’ve seen all these in the last post.
file1 = pd.read_csv("Lyrics1.csv") file2 = pd.read_csv("Lyrics2.csv") data = pd.merge(file1, file2)
Clean it up!
data['Lyrics'] = data['Lyrics'].str.lower() data['Lyrics'] = data['Lyrics'].str.replace('[^\w\s]','') data['Lyrics'] =data['Lyrics'].str.replace('\d+', '') data['Lyrics'] = [x.replace("\r\n","") for x in data['Lyrics']] data['Lyrics'] = [x.replace("\n","") for x in data['Lyrics']] data['Lyrics'] = [x.replace("2x","") for x in data['Lyrics']] stop = stopwords.words('english') stopS = stopwords.words('spanish') #Change your dataframe column to a list docs = data['Lyrics'].tolist() #Remove stopwords for j in xrange(len(docs)): docs[j] = " ".join([i for i in docs[j].lower().split() if i not in stop]) #Spanish for j in xrange(len(docs)): docs[j] = " ".join([i for i in docs[j].lower().split() if i not in stopS])
Let’s start by using the gensim package first.
#You need the next two lines to be able to use the gensim package logging.basicConfig(format='%(levelname)s : %(message)s', level=logging.INFO) logging.root.level = logging.INFO #tokenize your sentences from nltk.tokenize import word_tokenize tokenized_sents = [word_tokenize(i) for i in docs] #define the dictionary you will use to train the model id2word = corpora.Dictionary(tokenized_sents) #turn it into a dtm corpus = [id2word.doc2bow(doc) for doc in tokenized_sents] #Run the model and print lda_model = gensim.models.ldamodel.LdaModel result_lda_model = lda_model(corpus, num_topics=6, id2word = corpora.Dictionary(tokenized_sents), passes=50) #the output from this is kind of messy, so we'll use pprint from pprint import pprint pprint(result_lda_model.print_topics(num_topics=6, num_words=5))
One of the things I REALLY like about gensim is that as you can see below it gives you the actual weights of each word per topic.
[(0, u’0.016*”love” + 0.014*”know” + 0.013*”im” + 0.012*”dont” + 0.009*”got”‘),
(1, u’0.017*”che” + 0.017*”di” + 0.011*”il” + 0.009*”non” + 0.006*”per”‘),
(2, u’0.012*”amor” + 0.010*”si” + 0.009*”quiero” + 0.006*”ms” + 0.005*”oh”‘),
(3,
u’0.007*”oh” + 0.005*”come” + 0.004*”one” + 0.003*”day” + 0.002*”little”‘),
(4, u’0.013*”ich” + 0.008*”du” + 0.008*”die” + 0.007*”und” + 0.005*”nicht”‘),
(5, u’0.013*”get” + 0.013*”got” + 0.013*”like” + 0.012*”im” + 0.009*”aint”‘)]
Now, let’s look at the NMF and LDA with SKlearn. To run these, you will need term frequencies which we will compute below.
number_of_features = 500 #you can play with this number #NMF will use a tf-idf, refer to previous post for explanation tfidf_vectorizer = TfidfVectorizer(max_df=0.90, min_df=5, max_features=number_of_features) tfidf = tfidf_vectorizer.fit_transform(docs) tfidf_feature_names = tfidf_vectorizer.get_feature_names() #LDA will need a term frequency in order to get the probabilities tf_vectorizer = CountVectorizer(max_df=0.90, min_df=5, max_features=number_of_features) tf = tf_vectorizer.fit_transform(docs) tf_feature_names = tf_vectorizer.get_feature_names() #Run the models no_topics = 3 nmf = NMF(n_components=no_topics, random_state=1, alpha=.1, l1_ratio=.5, init='nndsvd').fit(tfidf) lda = LatentDirichletAllocation(n_topics=no_topics, max_iter=5, learning_method='online', learning_offset=50.,random_state=0).fit(tf)
If we print the results, this is what we get:
NMF:
Topic 0:
know never youre one could cant im dont see go
Topic 1:
amor quiero si ms vida nunca ser voy solo siempre
Topic 2:
che di il non per si io sono ci da
Topic 3:
love heart baby need youi dont never know give im
LDA:
Topic 0:
got get like im dont aint know wanna want make
Topic 1:
know could never im right said say better think would
Topic 2:
love im dont cant oh know youre want baby like
Topic 3:
che di si il non amor da ich quiero per
So, in summary what we can gather is that no matter where you are in world the most songs out there talk about love, or some derivative of love. We’re all suckers :P!
Notes for the future: we could improve this by doing a number of things like stemming, removing mispelled words, changing the model parameters, etc.
Happy Coding!