Often we will be dealing with text; short or long, and often we only want to find out the important keywords of said text. As with most things, there are several algorithms out there that you could use for this. A common one that you’ve likely even used before is TF-IDF, but there are others like the RAKE algorithm that can perform very well. The crucial step in any of these keyword algorithms is to spend time carefully cleaning and preparing your data in order to pull out the keywords that really matter to that text.
For this tutorial, we will use TF-IDF. Sure, there are many other more complicated algorithms out there, but I always find that understanding and working through the simplest algorithms first can provide us with a wealth of knowledge before diving into more complicated adventures. TF-IDF will help us pull out terms that are undervalued aka, important keywords.
In a nutshell, these are the steps you’d need to take in order to perform a keyword extraction:
- Clean your text (remove punctuations and stop words)
- Remove stop-words
- Remove common and uncommon words
- Tokenize the text
- Stem the tokens
- Apply algorithm of choice
- Rank terms
For this tutorial we will a dataset containing 11,000 amendments made to the USA constitution between 1788 – 2014. The data lives here. Let’s now look at these steps in more detail.
First, let’s try to understand what TF-IDF is and how it works. In quick terms tf-idf is the product of the term frequency and the inverse document frequency. The term frequency measure how frequently a word or term occurs in a given document. Because every document is of different lengths said term can appear multiple times in a given document. This then can cause unneeded bias towards a word just because it appears 20 times in one document. To deal with this, we first consider documents to be bag of words – where word order does not matter – and we bring in the inverse term frequency to punish these words. Mainly, just because a term appears 30 times it does not mean that it is 30 times more relevant. If you take anything from this post, let it be that keyword relevancy is not proportional to frequency.
Mathematically, tf-idf looks like this,
TF:
IDF:
Combining them together:
As you can see, tf-idf increases with the rarity of a word. With that out of the way, let’s start implementing this baby! These are the libraries you will need!
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from time import time from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.decomposition import LatentDirichletAllocation from sklearn.decomposition import NMF from sklearn.pipeline import Pipeline import os import nltk from nltk.util import ngrams from nltk.corpus import stopwords import datetime from collections import Counter import re
Now, let’s define a few functions that will come in handy in cleaning our data and finding the ranked terms later on.
#---User-defined functions stop = stopwords.words('english') def rank_words(terms, feature_matrix): """ Display top comments by topic. """ sums = feature_matrix.sum(axis=0) data = [] for col, term in enumerate(terms): data.append( (term, sums[0,col]) ) ranked = pd.DataFrame(data, columns=['term','rank']).sort_values('rank', ascending=False) return ranked def removeHTMLtags(text): """ Removes html tags, anything between . Takes in a single string. """ cleanCondition = re.compile('') cleanText = re.sub(cleanCondition, '', text) return cleanText def sanitize(text): """ Removes any leftover HTML tags. Takes in a single string. """ newText = text.replace(" ", "") newText = newText.replace("&", "") newText = newText.replace("\xa0", "") newText = newText.replace("\t", "") return newText
Now, let’s read in the data and make sure our columns are of the right data type.
df = pd.read_csv("us-nara-amending-america-dataset-raw-2016-02-25.csv") df.dtypes #change to string df['title_or_description_from_source'] = df['title_or_description_from_source'].astype(str)
Great!
Then, let’s start cleaning our data with the functions we define above. Do note, that this will DEPEND ON YOUR DATA. Don’t just blindly remove stop-words or clean a certain way just because I am doing it here. You will need to decide what is best for your dataset.
df['cleanedText'] = df['title_or_description_from_source'].apply(removeHTMLtags) df['cleanedText'] = df['cleanedText'].apply(sanitize) df['cleanedText'] = df['cleanedText'].str.replace('[^\w\s]','') #remove punctuation df['cleanedText'] = df['cleanedText'].str.replace('\d+', '') #remove numbers df['cleanedText'] = df['cleanedText'].str.lower() #remove stopwords df['cleanedText'] = df['cleanedText'].apply(lambda x: " ".join(x for x in x.split() if x not in stop)) df['cleanedText'].head()
Now, we will inspect and decide whether we want to remove common and uncommon words.
#---Remove common/rare words #word frequency freq_com = pd.Series(' '.join(df['cleanedText']).split()).value_counts()[:10] freq_com freq_uncom = pd.Series(' '.join(df['cleanedText']).split()).value_counts()[-10:] freq_uncom
president 2629 states 2346 united 1840 amendment 1484 congress 1418 constitution 1380 rights 1334 election 1279 equal 1183 vice 1175
Let’s do the same for uncommon terms.
#---Remove common/rare words #word frequency freq_uncom = pd.Series(' '.join(df['cleanedText']).split()).value_counts()[-10:] freq_uncom
representationin 1 whoever 1 revoke 1 banks 1 profits 1 mines 1 intrastate 1 prosecutor 1 loyal 1 infringement 1
Cool, so by looking at this I feel pretty confident about removing these words from the corpus.
freq_com = list(freq_com.index) df['cleanedText'] = df['cleanedText'].apply(lambda x: " ".join(x for x in x.split() if x not in freq_com )) freq_uncom = list(freq_uncom.index) df['cleanedText'] = df['cleanedText'].apply(lambda x: " ".join(x for x in x.split() if x not in freq_uncom))
Now, we are ready for some keyword extraction! I will set up the ngram parameter to 2-4 since setting this up to 1 rarely spits out anything useful.
#---Set up TF-IDF for NMF and ranking words vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(2,4)) tfidf = vectorizer.fit_transform(df['cleanedText']) ranked = rank_words(terms=vectorizer.get_feature_names(), feature_matrix=tfidf) ranked[0:20]
Wut. Y’all we did it! Let’s try to visualize this initial output.
This is really interesting because we can clearly see important events in our history through these keywords!
Now, that we have that done, for fun, let’s see what sort of themes can we find. I will use NMF for this. If you need a refresher on this, please refer to my previous posts. Here’s the quick recipe:
#---Set up NMF n_topics = 12 nmf = NMF(n_components=n_topics,random_state=0) topics = nmf.fit_transform(tfidf) top_n_words = 5 t_words, word_strengths = {}, {} for t_id, t in enumerate(nmf.components_): t_words[t_id] = [vectorizer.get_feature_names()[i] for i in t.argsort()[:-top_n_words - 1:-1]] word_strengths[t_id] = t[t.argsort()[:-top_n_words - 1:-1]] t_words
0 | [u’men women’, u’relative men women’, u’relative men’, u’proposal men’, u’proposal men women’] |
1 | [u’regardless sex’, u’right regardless’, u’right regardless sex’, u’except military’, u’regardless sex except military’] |
2 | [u’balancing budget’, u’debt reduction’, u’balancing budget debt reduction’, u’balancing budget debt’, u’budget debt’] |
3 | [u’prayer public schools’, u’public schools’, u’prayer public’, u’schools institutions’, u’prayer public schools institutions’] |
4 | [u’right vote’, u’age older’, u’years age older’, u’right vote citizens’, u’vote citizens’] |
5 | [u’apportionment state’, u’apportionment state legislatures’, u’state legislatures’, u’appointment state’, u’appointment state legislatures’] |
6 | [u’right life’, u’abortion right life’, u’abortion right’, u’prohibit abortion right’, u’prohibit abortion right life’] |
7 | [u’prayer public buildings’, u’public buildings’, u’prayer public’, u’public buildings places’, u’buildings places’] |
8 | [u’popular senators’, u’providing popular senators’, u’providing popular’, u’popular senators popular’, u’popular senators popular senators’] |
9 | [u’upon god’, u’reliance upon’, u’reliance upon god’, u’upon god governmental’, u’reliance upon god governmental’] |
10 | [u’representation district columbia’, u’representation district’, u’district columbia’, u’people district columbia’, u’people district’] |
11 | [u’direct popular’, u’direct popular vote’, u’woman suffrage’, u’popular vote’, u’provide direct’] |
From this we can see that women suffrage, popular vote, religion, abortion and debt reduction are all big themes here and it seems that they have been for a long time.
To satisfy my curiosity, I applied the same logic we just did above but I split the data in 30 year buckets.
In the photo above, we can see the number of amendments throughout our history and the important keywords in those 30 year buckets. I’ve also highlighted the terms that appeared in more than one 30 year bucket. And there you have it – US history in keywords!