Instead of a lengthy tutorial, I hope this post serves as a quick reference for those who work with text data on almost a daily basis. Come back to this post for updates as I will be posting them here as I develop them!

Strip Accents:

On many occasions, I have had to work with data in Spanish or Portuguese and given the the punctuation marks of these languages, further manipulation may be needed. Below is a quick function to remove accents from text. This function takes in a string and checks its encoding and turns it to unicode.

#---Import libraries
import unicodedata

def remove_accents(row):
    return "".join(c for c in unicodedata.normalize("NFD", row) if unicodedata.category(c) != "Mn"



Remove HTML Tags:

I often have to pull online data and it is not rare, actually quite the opposite, that it comes with undesired characters, like HTML encodings. We can easily remove this using regular expressions. This functions takes a string and using regular expressions it removes anything between

#---Import libraries
import re

def removeHTMLtags(row):
    encodings_to_remove = re.compile("")
    clean_row = re.sub(encodings_to_remove, "", row)
    return clean_row




Sometimes I have run into situations where removing HTML tags or removing punctuation, etc, won’t work and this is where this little function came in. It simply uses regular expressions to find these funky characters and it substitutes them with blanks. You can add more replacements here; these are just the common characters I’ve run into.

#---Import libraries
import re

def sanitize(row):
   sanitizedText = row.replace(" ", "")
   sanitizedText = sanitizedText.replace("&", "")
   sanitizedText = sanitizedText.replace("\xa0", "")
   sanitizedText = sanitizedText.replace("\t", "")
   return sanitizedText



Remove Stopwords:

You can easily use stopwords from NLTK, gensim, etc, but what is really crucial when cleaning your text data is to fit it to your needs; mainly, removing words that you know add nothing to your specific dataset context. For example, the word “computer” it is not a stop word by definition, but perhaps it is a common word in YOUR dataset and it has a different meaning in YOUR dataset and it would make sense to remove it. Here’s a recipe to do that.

#---Import libraries
#Import stopwords with nltk.
from nltk.corpus import stopwords
stop = stopwords.words('english')

def stop_words(row):
   cleanedColumn = row.apply(lambda x: [item for item in x if item not in stop])
   return cleanedColumn

df['no_stopwords'] = df['column'].apply(stop_words)


Numbers and Punctuation: 

In order to prepare your text for most text modeling approaches, you’ll need to remove punctuation and numbers from text data set.  Check out the function below.

def removePunctNum(row):
   cleanedColumn = row.str.replace('[^\w\s]','')
   cleanedColumn = cleanedColumn.str.replace('\d+', '')
   return cleanedColumn

df['cleaned_text'] = df['column'].apply(removePunctNum)


Happy Coding!
PS. All these functions can be grouped into a big function

Posted by:Aisha Pectyo

Astrophysicist turned data rockstar who speaks code and has enough yarn and mod podge to survive a zombie apocalypse.

One thought on “Cleaning Recipes for Text Data

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s