Laten Dirichlet Allocation (LDA)

Khulood Nasher
4 min readFeb 13, 2021

In a previous blog, I introduced one of the important NLP techniques that is text clustering which deals with unlabeled data and tries to group them to specific clusters based on text similarity. To read about text clustering please visit my blog on text clustering and clustering by Kmeans. Laten Dirichlet Allocation (LDA) is unsupervised machine learning algorithm that clusters unlabeled texts. It is also called Topic Modeling which is different from the LDA in the supervised machine learning Algorithm ‘Linear Discriminant Analysis’.

What is Laten Allocation?

You might hear about laten energy or laten heat in high school when you studied physics. Laten is an adjective that is derived from the word ‘later’ and it was used widely in physics to explain the phenomena of happening later such as the example of a big rock on the edge of a high mountain and which is about to fall and roll down, so there is laten or hidden or potential energy was stored and changed to kinetic energy when the big rock rolled down the mountain. The same concept of laten is used in NLP here. But what could be laten(hidden) in NLP?

Well, since NLP is dealing with texts, we can guess there is a laten (hidden) text too.

So we know now there is a hidden topic in an a corpus, but we still want to know the allocation of this laten topic.

Laten Dirichlet Allocation is one of topic modeling techniques in NLP that aims to cluster unlabeled texts or documents into specific topics. For example an article from NY times that is talking about president’s Biden’s plan to overcome covid19 and suggest a timely strategy through distributing vaccine, open up schools and businesses, supporting the unemployment, suggesting stimulus checks, timely plan of distributing vaccines,… etc. This article might be classified to different topics such as unemployment, taxes, health,..etc. by using the unsupervised Laten Dirichlet Allocation. The article includes a huge volume of text and the topic modeling will cluster these texts into topics that share some common features, and by clustering the documents into topics, labels can be detected and which make our text data then vectorized and labeled and ready to analyzed by supervised machine learning algorithms.

The ‘Dirichlet’ part of the Laten Dirichlet Allocation has been named after the German mechanician, Johann Peter Gustav in the 1800s.But this NLP technique is based only on his probability distribution. LDA was published first time in 2003 as a topic modeling technique where topics can be determined based on common features.

LDA in Topic Modeling depends on a thought that documents share same topics are using same group of words. The abstract topics (laten topics) can be determined by searching for the group of words that appear multiple times in the documents along the corpus. The Dirichlet probability distribution works on documents over topics and topics on its turn find the group of words through this probability distribution as well.

As we can see in the above picture, there is a normal distribution of the laten topics along Doc1.For example, here we see that Topic2 has the highest probability. Therefore we can tell that Doc1 is mainly about Topic2.

As we saw above that there is a normal probability distribution of the topics on the documents along the corpus. The concept still same in the topics where words follow probability distributions. As we can see in the above picture that Topic#1 for example has more frequent words of cat, and dog comparing to other words which give us a hint that Topic#1 might be about pets.

However, NLP researcher must suggest at the beginning the number of topics in the document and use the multinomial distribution to find out the most probable topic in each document. Then the researcher uses again the multinomial distribution to find the most probable words in each topic. This process is iterated multiple times until we reach that repetition is not given us any new fining, and we can decide what topics do we have and the group of words accompanies each topic. Meaning the researcher will decide the topic based on the probability values of the words on each topic.

In the next blog, I will use LDA in python to figure out clusters (laten topics) of the article.

References:

1-Hesham Asem,NLP 5.3.1 LDA الجزء الأول

2- Udemy, NLP — Natural Language Processing with Python

--

--

Khulood Nasher

Data Scientist,Health Physicist, NLP Researcher, & Arabic Linguist. https://www.linkedin.com/in/khuloodnasher https:/khuloodnasher1.wixsite.com/resume