Unpopular opinion: LDA models can be problematic.
tl:dr; LDA models are often overused and misused.
Latent Dirichlet Allocation (LDA) is a very common model used in natural language processing with the intent of clustering some amount of text data into topics. LDAs have become the staple natural language processing model with hundreds of tutorials in personal blogs and data science websites. We run LDAs for literally every text data out there. Moreover, it has become incredibly easy to run LDAs with packages such as spaCy in Python and tidytext in R. Now, having tools that make these models incredibly easy to implement is not the problem. I, myself, use these packages very often. I used them both for work and for my blog’s pet projects. The issue lies in running these models and not understanding the why you are running it. LDAs do not apply to every text data set and they should not be blindly applied to every text data set – it doesn’t matter how easy it is.
I think part of the problem also lies with the current obsession with machine learning. A few current hot buzzwords are “topic modeling”, “text analytics”, even “LDA” itself has become a buzzword. Take a quick look at how many Medium articles attempt to show you some basic tutorial about how to use and implement an LDA model. This often leads to a huge misunderstanding of how these models work. It leads to the wrong idea that we should apply this model to answer every question regarding text data. Consequently, this ends up in bad insights that then lead to a bad strategy. An LDA model is absolutely not a one size fits all kind of situation.
One of the classic problems I have encountered is applying an LDA model to short text data. By short text data, I mean text from sources like Twitter, Reddit, Google questions, e.g. when you type on google something like “restaurant near me” or when you reply to a reddit post with “lol, so true.” This has been a long-standing-open issue in natural language processing. How can we study and get proper results from short text data? I still don’t have a good answer, but an LDA model is definitely not it. How can you apply a model that strongly depends language structures to data that does not follow any language structures? It is imperative that we learn to differentiate when a model is appropriate and when it is not appropriate to use.
In the simplest terms, an LDA model is a probabilistic model that can help us detect topics or clusters in text data. LDAs use statistical inference to reveal latent patterns in your data. Essentially, they infer model parameters from observations.
It is a generative model so the model assumes that the topics are specified before any data is generated; it models how the data might have been generated and it learns from it. The way this generative model works is by randomly choosing a distribution over topics. Mainly, for each word in the document:
- a random topic is selected from the distribution over topics
- a random word is selected from the corresponding topic or from the distribution over the vocabulary.
Hence, each document is a distribution of topics and each topic is a distribution of words.
An LDA depends on several parameters. Your two main parameters are beta and alpha.
- alpha is an initial estimate on the topic probability or the probability that each topic can occur within a given document. In other words, this controls the mixture of topics.
- beta is a collection of topics, it controls the distribution of words per topic. You can adjust how many words there are per topic, e.g. a smaller beta will yield topics with less words.
Your main goal with this model is then to fine-tune these two parameters so they best fit the corpus generated by the model. It is clear then that LDAs heavily depend on the amount of text data your documents have. If you have really short documents, like tweets, it’s really hard to break them into topics.
You can think of your LDA as pulling multiple different items from a magician’s hat, it then infers on the distribution of the types of the items that it pulled. If you’re working with short data and you attempt to apply an LDA, your model WILL run. The LDA will cluster your data as that is what it’s meant to do; however, you’ve limited the amount and diversity of the items your LDA can pull from the magician’s hat. You simply won’t have nearly enough observations for your model to generate reasonable clusters. In addition, when you run these models, you traditionally drop things like stop words, which are likely to be the bulk of short text data.
Here’s an example of an LDA I ran on 5,000 song titles. How long can a song title be? 3- words long? Ideal for the short text test!
Some themes, I guess, could be considered actual clusters, but most of these themes are just the actual words in song titles.
Now, this doesn’t mean you can’t study short text. You just have to be careful with how and why you use these models. There are alternatives to work with short text data. You can focus instead in studying n-grams, word correlations or keyword extractions. It is worth mentioning though, there have been cases were researchers have been successful using some variant of this model, but it is far from an easy task.
Here’s a checklist for quick reference of what to do if you’re asked or tempted to run an LDA.
Do not use an LDA if:
- Your text data does not follow reasonable language structures.
- Your documents only consist of a handful of words.
- You can’t apply basic text processing (e.g. removing stop words) to your data without losing context.
Instead use:
- n-grams
- word correlations
- keyword extractions. (e.g. RAKE, TD-IDF)