I’ve been meaning to dig into this data set for a while. There’s just so many text analytics we can learn from it and just learn how we can be better as a society when we read fake news. This data set came from Kaggle.com so if you will follow along go ahead and download it here.
In this post, we will see how to clean text data, do text exploration, set up an LDA, compute N-grams and compute the readability of the text. Yay!
tl;dr: The level of complexity and readability of fake news articles is the level of a 3rd grader. #SAD!
Let’s import the libraries we will use.
%matplotlib inline #---Import libraries import matplotlib import numpy as np import matplotlib.pyplot as plt import matplotlib.dates as mdates import pandas as pd import nltk from nltk.corpus import stopwords from collections import Counter from nltk.stem import PorterStemmer import gensim import logging from gensim import corpora from sklearn.decomposition import NMF, LatentDirichletAllocation from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer from readability_score.calculators.fleschkincaid import * from readability_score.calculators.dalechall import * import datetime as dt import unicodedata import seaborn as sns sns.set(style="darkgrid") import warnings warnings.filterwarnings('ignore')
Then, let’s load the stopwords and define a function to take care of weird characters we are likely to run into later on.
#---Set up stopwords and functions stop = stopwords.words('english') def strip_accents(s): "Remove those little naughty, pain-giving characters!" return ''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn')
Read your data in and view,
#---Read data file_ = "fake.csv" df = pd.read_csv(file_) df.head()
Let’s begin with the data processing!
#I want to remove the timestamo from the date because it will be easier when we # are grouping by date. df['published'] = pd.to_datetime(df['published']).apply(lambda x: x.date()) #initial text cleaning #--- Initial Text cleaning df['title'] = df['title'].str.lower() #lower case df['title'] = df['title'].str.replace('[^\w\s]','') #remove punctuation and df['title'] = df['title'].str.replace('\d+', '') #...numbers df['text'] = df['text'].str.lower() df['text'] = df['text'].str.replace('[^\w\s]','') #remove punctuation and df['text'] = df['text'].str.replace('\d+', '') #...numbers df['text'] = df['text'].astype(str) df['text'] = [x.replace("\r\n","") for x in df['text']] df['text'] = [x.replace("\n","") for x in df['text']] df['text'] = [x.replace("2x","") for x in df['text']]
Now, I think we’re in a good place to start exploring our data. Let’s start by plotting out the number of publications starting with October 25th to November 25th.
#--- Plot publications over the years publicationsPerDate = df.groupby('published')['title'].count().reset_index() #fontweight="bold" plt.figure(figsize=(13, 14)) ax = plt.subplot(111) ax.spines["top"].set_visible(False) ax.spines["bottom"].set_visible(False) ax.spines["right"].set_visible(False) ax.spines["left"].set_visible(False) dateFmt = mdates.DateFormatter('%b %d %Y') ax.xaxis_date() ax.xaxis.set_major_formatter(dateFmt) ax.get_xaxis().tick_bottom() ax.get_yaxis().tick_left() plt.tick_params(axis="both", which="both", bottom="off", top="off", labelbottom="on", left="off", right="off", labelleft="on") plt.yticks(fontsize=14) plt.xticks(fontsize=14) plt.plot(publicationsPerDate['published'], publicationsPerDate['title'], color='#ff8d00', marker = '.', markersize=12) plt.text(dt.date(2016, 11, 7), 1600, "Number of Published Fake News Per Day"+"\n"+"from October to November 2016", fontsize=16, fontweight="bold") plt.text(dt.date(2016, 10, 26), 10, "Data source: https://www.kaggle.com/mrisdal/fake-news" "\nAuthor: Aisha Pectyo (thingsgrow.me / @things_grow)", fontsize=11) plt.savefig("1.png", bbox_inches="tight");
Hmm.. that is VERY interesting. It seems most of bulge of fakes news was being published in the weeks leading to the election and then it started dying down.
Now, let’s look at the publications by source and by country.
#---Publications by source, where are the top fake news coming from? publicationsPerSource = df.groupby(['author', 'site_url'])['title'].count().reset_index() publicationsPerSource.sort_values('title', axis=0, ascending=True, inplace=True) (publicationsPerSource).tail(10) #---Publications by country, where are the top fake news coming from? publicationsPerCountry = df.groupby('country')['title'].count().reset_index() publicationsPerCountry.sort_values('title', axis=0, ascending=True, inplace=True) (publicationsPerCountry).tail(10)
Here’s a graphic of the results.
Let’s now build up our LDAs. We will do it for both the titles and the actual text in the articles.
#---Explore titles- What do the titles say? df['title'] = df['title'].astype(str) docs = df['title'].tolist() #Remove stopwords for j in xrange(len(docs)): docs[j] = " ".join([i for i in docs[j].lower().split() if i not in stop]) #---Set up LDA #You need the next two lines to be able to use the gensim package logging.basicConfig(format='%(levelname)s : %(message)s', level=logging.INFO) logging.root.level = logging.INFO #tokenize your sentences from nltk.tokenize import word_tokenize tokenized_sents = [word_tokenize(i) for i in docs] #define the dictionary you will use to train the model id2word = corpora.Dictionary(tokenized_sents) #turn it into a dtm corpus = [id2word.doc2bow(doc) for doc in tokenized_sents] #Run the model and print lda_model = gensim.models.ldamodel.LdaModel result_lda_model = lda_model(corpus, num_topics=6, id2word = corpora.Dictionary(tokenized_sents), passes=50) #the output from this is kind of messy, so we'll use pprint from pprint import pprint pprint(result_lda_model.print_topics(num_topics=6, num_words=5)) #We have to clean the article data a bit more #--- Further clean text to avoid encoding issues docs_uncleaded = df['text'].tolist() docs = [] for doc in docs_uncleaded: unicode_doc = doc.decode("unicode-escape") cleaned = strip_accents(unicode_doc) docs.append(unicode_doc) #--- Remove stopwords for j in xrange(len(docs)): docs[j] = " ".join([i for i in docs[j].lower().split() if i not in stop]) #---Set up LDA #You need the next two lines to be able to use the gensim package logging.basicConfig(format='%(levelname)s : %(message)s', level=logging.INFO) logging.root.level = logging.INFO #tokenize your sentences from nltk.tokenize import word_tokenize tokenized_sents = [word_tokenize(i) for i in docs] #define the dictionary you will use to train the model id2word = corpora.Dictionary(tokenized_sents) #turn it into a dtm corpus = [id2word.doc2bow(doc) for doc in tokenized_sents] #Run the model and print lda_model = gensim.models.ldamodel.LdaModel result_lda_model = lda_model(corpus, num_topics=6, id2word = corpora.Dictionary(tokenized_sents), passes=10) #the output from this is kind of messy, so we'll use pprint from pprint import pprint pprint(result_lda_model.print_topics(num_topics=6, num_words=5))
Here are two graphic representations of the results.
So… as expected and as we’ve seen in the news, the topics revolve Hillary, emails, Russia, war, etc. These keywords seem to be used as clickbait.
Let’s keep digging down, let’s look at N-grams. N-grams are continuous sequence of N words in a given text. N-grams of size N=1 are called unigrams, N=2 are bigrams, N=3 trigrams, and anything larger is referred to by the value of N so four-grams, five-grams, etc.
#--- Find N-grams #I want to know the n-grams in all the stories combined, 501430 docs_merge = ''.join(docs) #let's make this all into one massive string. def ngrams(input, n): input = input.split(' ') output = [] for i in range(len(input)-n+1): output.append(input[i:i+n]) return output n_grams_result = ngrams(docs_merge, 3)
The output was big because the data is big, but here’s a short list.
- federal, investigations, often
- interest, money, family
- deported, asap, take
- stealing, government, tax payers
Words like federal or interest are not necessarily negative, but as you can see above these articles are using them in a negative content. As in “too many federal investigations” or “deport ASAP.”
Finally, let’s look at readability of fake news. Here I am using the Flesch Reading Ease Score. The test was created in 1948 as a readability test and it will tell you roughly what level of education someone will need to have to be able to easily read a piece of text. The scores are usually between 0-100, but negative values and numbers over 100 are accepted.
#---What's the average age of the person who is reading these stories according the grammar of the text? #---Check average number of sentences dfFake = df[df['spam_score'] >= 0.60] dfFake.reset_index(inplace=True) ages = [] for doc in dfFake['text']: fk = FleschKincaid(doc, locale='nl_NL') ages.append(fk.min_age)
The result is 211. 211 guys. So this text is VERY EASY to read, no complexity at all and if you have only an education level of 3rd grade you can read this. What does that tells us about Fake News and the people they are targeting?
Happy Coding and say no to fake news!