Topic Modeling by Laten Dirichlet Allocation (LDA) with Python
Topic modeling analyzes documents in a huge corpus and suggests the topics in each document. In this blog, I’m going to explain topic modeling by Laten Dirichlet Allocation (LDA) with Python. As I explained in previous blog that LDA is NLP technique of unsupervised machine learning algorithm that helps in finding the topics of documents where documents are modeled as they have probability distribution over a group of laten topics and topics in its turn are modeled as they have a probability distribution over mixture of words. Meaning any particular document has discrete probability distributions of different laten topic and these probabilities vary, it can be this specific document has high probability of one topic more than other topics and on the other hand topics have probability distribution over different words and some words have higher probability than others which make the NLP researcher concludes the laten (underlined) topic according to the highest probability of words that share same semantics.To read more about LDA in NLP, please visit my previous blog on this link.
Now Let’s start the Python code. For this purpose,I will use the corpus from National Public Radio (NPR).Since I’m going to use Colab notebook, I need to mount to google drive. To upload the NPR article, I have to read it through pandas library and add the file path as following:
As we see above that the npr article has a long document in each line and we don’t know the laten topic of each document. But before we can use any machine learning algorithm, text data has to be vectorized first. Therefore, I will import the CountVectorizer from sklearn, and define its parameters then fit my data on it as following:
Above, the max_df parameter will limit the words that are shown very often in our vectorizer, we set it equals to 0.95 to ignore words that appear 95% in all documents while the min_df parameter sets the minimum occurrence of words in all documents. We set minimum_doc equals to two to ignore words that appear in less than tow documents. Also, we set the stop words equals to English, so these non-informative words will be removed when vectorizing our corpus.
After we fit our corpus to the vectorizer and transform this huge text to numeric, we can see that that we have a sparse matrix with almost 12,000 documents and about 55 thousands words. These 55,000 words will be parsed to figure out the laten topic. We can take a first look of the words and perform some random selection over the 50,000 vocabularies through the random library and collect them through ’get_feature_names’ method as follows:
As we see above these 20 words are randomly selected, so we are not able yet to guess the laten topics from this random selection.
Now, we are ready to analyze our corpus by LDA. So first I will import Laten Dirichlet Allocation from sklearn as follows:
As we see above, I set the number of topics equal to seven. Setting this parameter is just a personal view about how many laten topics might be in our data. I selected random_state equals to 42 so that words will be selected randomly in each topic. I fit my vectorized text to LDA. Since my corpus is about 12,000 documents, it takes me about 5 minute to run LDA but it might take longer on other computers.
Now let’s collect the laten topic in each document along with the highest probable words that attached to each document. The highest probable words can be known through the ’argsort’ function which sorts values from least to highest. For sure we need to to iterate in tuple, therefore we will use enumerate function on LDA.components where components are the topics in our corpus as follows:
So we see above, we got words in the first topic like companies, money, and government which suggests that this topic is about the government budget while in topic 7 for example, the highest probable words are ‘student, data, schools, education’ and that suggest that topic 7 is about education.
So in this blog, I explained how LDA can perform topic modeling through sklearn however genism library does LDA better than sklearn. I will introduce LDA through genism in the near future.
LDA is a handy NLP technique for fast analysis in huge texts. It can be applied in many fields of life such as news, tweets, facebook reviews and more which make our life easier and more accessible.
References:
1-Hesham Asem,NLP 5.3.1 LDA الجزء الثاني