I’ve been meaning to dig into this data set for a while. There’s just so many text analytics we can learn from it and just learn how we can be better as a society when we read fake news.  This data set came from Kaggle.com so if you will follow along go ahead and download it here.

In this post, we will see how to clean text data, do text exploration, set up an LDA, compute N-grams and compute the readability of the text. Yay!

tl;dr: The level of complexity and readability of fake news articles is the level of a 3rd grader. #SAD!

Let’s import the libraries we will use.

%matplotlib inline
#---Import libraries
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import pandas as pd
import nltk
from nltk.corpus import stopwords
from collections import Counter
from nltk.stem import PorterStemmer
import gensim
import logging
from gensim import corpora
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from readability_score.calculators.fleschkincaid import *
from readability_score.calculators.dalechall import *
import datetime as dt
import unicodedata
import seaborn as sns
sns.set(style="darkgrid")
import warnings
warnings.filterwarnings('ignore')

Then, let’s load the stopwords and define a function to take care of weird characters we are likely to run into later on.

#---Set up stopwords and functions
stop = stopwords.words('english')

def strip_accents(s):
    "Remove those little naughty, pain-giving characters!"
    return ''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn')

Read your data in and view,

#---Read data
file_ = "fake.csv"
df = pd.read_csv(file_)
df.head()

Let’s begin with the data processing!

#I want to remove the timestamo from the date because it will be easier when we
# are grouping by date.
df['published']  = pd.to_datetime(df['published']).apply(lambda x: x.date())

#initial text cleaning
#--- Initial Text cleaning
df['title'] = df['title'].str.lower() #lower case
df['title'] = df['title'].str.replace('[^\w\s]','') #remove punctuation and
df['title'] = df['title'].str.replace('\d+', '') #...numbers

df['text'] = df['text'].str.lower()
df['text'] = df['text'].str.replace('[^\w\s]','') #remove punctuation and
df['text'] = df['text'].str.replace('\d+', '') #...numbers
df['text'] = df['text'].astype(str)
df['text'] = [x.replace("\r\n","") for x in df['text']]
df['text'] = [x.replace("\n","") for x in df['text']]
df['text'] = [x.replace("2x","") for x in df['text']]

Now, I think we’re in a good place to start exploring our data. Let’s start by plotting out the number of publications starting with October 25th to November 25th.

#--- Plot publications over the years
publicationsPerDate = df.groupby('published')['title'].count().reset_index()

#fontweight="bold"
plt.figure(figsize=(13, 14))
ax = plt.subplot(111)
ax.spines["top"].set_visible(False)
ax.spines["bottom"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.spines["left"].set_visible(False)
dateFmt = mdates.DateFormatter('%b %d %Y')
ax.xaxis_date()
ax.xaxis.set_major_formatter(dateFmt)
ax.get_xaxis().tick_bottom()
ax.get_yaxis().tick_left()
plt.tick_params(axis="both", which="both", bottom="off", top="off", labelbottom="on", left="off", right="off", labelleft="on")
plt.yticks(fontsize=14)
plt.xticks(fontsize=14)
plt.plot(publicationsPerDate['published'], publicationsPerDate['title'], color='#ff8d00', marker = '.', markersize=12)
plt.text(dt.date(2016, 11, 7), 1600, "Number of Published Fake News Per Day"+"\n"+"from October to November 2016", fontsize=16, fontweight="bold")
plt.text(dt.date(2016, 10, 26), 10, "Data source: https://www.kaggle.com/mrisdal/fake-news"
       "\nAuthor: Aisha Pectyo (thingsgrow.me / @things_grow)", fontsize=11)
plt.savefig("1.png", bbox_inches="tight");

1

Hmm.. that is VERY interesting. It seems most of bulge of fakes news was being published in the weeks leading to the election and then it started dying down.

Now, let’s look at the publications by source and by country.

#---Publications by source, where are the top fake news coming from?
publicationsPerSource = df.groupby(['author', 'site_url'])['title'].count().reset_index()
publicationsPerSource.sort_values('title', axis=0, ascending=True, inplace=True)
(publicationsPerSource).tail(10)

#---Publications by country, where are the top fake news coming from?
publicationsPerCountry = df.groupby('country')['title'].count().reset_index()
publicationsPerCountry.sort_values('title', axis=0, ascending=True, inplace=True)
(publicationsPerCountry).tail(10)

Here’s a graphic of the results.
fake news writers_ who are they and from where_

Let’s now build up our LDAs. We will do it for both the titles and the actual text in the articles.

#---Explore titles- What do the titles say?
df['title'] = df['title'].astype(str)
docs = df['title'].tolist()
#Remove stopwords
for j in xrange(len(docs)):
docs[j] = " ".join([i for i in docs[j].lower().split() if i not in stop])

#---Set up LDA
#You need the next two lines to be able to use the gensim package
logging.basicConfig(format='%(levelname)s : %(message)s', level=logging.INFO)
logging.root.level = logging.INFO

#tokenize your sentences
from nltk.tokenize import word_tokenize
tokenized_sents = [word_tokenize(i) for i in docs]

#define the dictionary you will use to train the model
id2word = corpora.Dictionary(tokenized_sents)

#turn it into a dtm
corpus = [id2word.doc2bow(doc) for doc in tokenized_sents]

#Run the model and print
lda_model = gensim.models.ldamodel.LdaModel
result_lda_model = lda_model(corpus, num_topics=6, id2word = corpora.Dictionary(tokenized_sents), passes=50)

#the output from this is kind of messy, so we'll use pprint
from pprint import pprint
pprint(result_lda_model.print_topics(num_topics=6, num_words=5))

#We have to clean the article data a bit more
#--- Further clean text to avoid encoding issues
docs_uncleaded = df['text'].tolist()
docs = []
for doc in docs_uncleaded:
    unicode_doc = doc.decode("unicode-escape")
    cleaned = strip_accents(unicode_doc)
    docs.append(unicode_doc)

#--- Remove stopwords
for j in xrange(len(docs)):
    docs[j] = " ".join([i for i in docs[j].lower().split() if i not in stop])

#---Set up LDA
#You need the next two lines to be able to use the gensim package
logging.basicConfig(format='%(levelname)s : %(message)s', level=logging.INFO)
logging.root.level = logging.INFO

#tokenize your sentences
from nltk.tokenize import word_tokenize
tokenized_sents = [word_tokenize(i) for i in docs]

#define the dictionary you will use to train the model
id2word = corpora.Dictionary(tokenized_sents)

#turn it into a dtm
corpus = [id2word.doc2bow(doc) for doc in tokenized_sents]

#Run the model and print
lda_model = gensim.models.ldamodel.LdaModel
result_lda_model = lda_model(corpus, num_topics=6, id2word = corpora.Dictionary(tokenized_sents), passes=10)

#the output from this is kind of messy, so we'll use pprint
from pprint import pprint
pprint(result_lda_model.print_topics(num_topics=6, num_words=5))

Here are two graphic representations of the results.

 

 

So… as expected and as we’ve seen in the news, the topics revolve Hillary, emails, Russia, war, etc. These keywords seem to be used as clickbait.

Let’s keep digging down, let’s look at N-grams. N-grams are continuous sequence of N words in a given text. N-grams of size N=1 are called unigrams, N=2 are bigrams, N=3 trigrams, and anything larger is referred to by the value of N so four-grams, five-grams, etc.

#--- Find N-grams
#I want to know the n-grams in all the stories combined, 501430
docs_merge = ''.join(docs) #let's make this all into one massive string.
def ngrams(input, n):
    input = input.split(' ')
    output = []
    for i in range(len(input)-n+1):
        output.append(input[i:i+n])
    return output
n_grams_result = ngrams(docs_merge, 3)

The output was big because the data is big, but here’s a short list.

  • federal, investigations, often
  • interest, money, family
  • deported, asap, take
  • stealing, government, tax payers

Words like federal or interest are not necessarily negative, but as you can see above these articles are using them in a negative content. As in “too many federal investigations” or “deport ASAP.”

Finally, let’s look at readability of fake news. Here I am using the Flesch Reading Ease Score.  The test was created in 1948 as a readability test and it will tell you roughly what level of education someone will need to have to be able to easily read a piece of text. The scores are usually between 0-100, but negative values and numbers over 100 are accepted.

#---What's the average age of the person who is reading these stories according the grammar of the text?
#---Check average number of sentences
dfFake = df[df['spam_score'] >= 0.60]
dfFake.reset_index(inplace=True)
ages = []
for doc in dfFake['text']:
    fk = FleschKincaid(doc, locale='nl_NL')
    ages.append(fk.min_age)

The result is 211. 211 guys. So this text is VERY EASY to read, no complexity at all and if you have only  an education level of 3rd grade you can read this. What does that tells us about Fake News and the people they are targeting?

Happy Coding and say no to fake news!

Posted by:Aisha Pectyo

Astrophysicist turned data rockstar who speaks code and has enough yarn and mod podge to survive a zombie apocalypse.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s