Instead of a lengthy tutorial, I hope this post serves as a quick reference for those who work with text data on almost a daily basis. Come back to this post for updates as I will be posting them here as I develop them!
On many occasions, I have had to work with data in Spanish or Portuguese and given the the punctuation marks of these languages, further manipulation may be needed. Below is a quick function to remove accents from text. This function takes in a string and checks its encoding and turns it to unicode.
#---Import libraries import unicodedata def remove_accents(row): return "".join(c for c in unicodedata.normalize("NFD", row) if unicodedata.category(c) != "Mn" #example df['column'].apply(remove_accents)
Remove HTML Tags:
I often have to pull online data and it is not rare, actually quite the opposite, that it comes with undesired characters, like HTML encodings. We can easily remove this using regular expressions. This functions takes a string and using regular expressions it removes anything between
#---Import libraries import re def removeHTMLtags(row): encodings_to_remove = re.compile("") clean_row = re.sub(encodings_to_remove, "", row) return clean_row #example df['column'].apply(removeHTMLtags)
Sometimes I have run into situations where removing HTML tags or removing punctuation, etc, won’t work and this is where this little function came in. It simply uses regular expressions to find these funky characters and it substitutes them with blanks. You can add more replacements here; these are just the common characters I’ve run into.
#---Import libraries import re def sanitize(row): sanitizedText = row.replace(" ", "") sanitizedText = sanitizedText.replace("&", "") sanitizedText = sanitizedText.replace("\xa0", "") sanitizedText = sanitizedText.replace("\t", "") return sanitizedText #example df['column'].apply(sanitize)
You can easily use stopwords from NLTK, gensim, etc, but what is really crucial when cleaning your text data is to fit it to your needs; mainly, removing words that you know add nothing to your specific dataset context. For example, the word “computer” it is not a stop word by definition, but perhaps it is a common word in YOUR dataset and it has a different meaning in YOUR dataset and it would make sense to remove it. Here’s a recipe to do that.
#---Import libraries #Import stopwords with nltk. from nltk.corpus import stopwords stop = stopwords.words('english') def stop_words(row): cleanedColumn = row.apply(lambda x: [item for item in x if item not in stop]) return cleanedColumn #example df['no_stopwords'] = df['column'].apply(stop_words)
Numbers and Punctuation:
In order to prepare your text for most text modeling approaches, you’ll need to remove punctuation and numbers from text data set. Check out the function below.
def removePunctNum(row): cleanedColumn = row.str.replace('[^\w\s]','') cleanedColumn = cleanedColumn.str.replace('\d+', '') return cleanedColumn #example df['cleaned_text'] = df['column'].apply(removePunctNum)
PS. All these functions can be grouped into a big function