Non-negative Matrix Factorization

Khulood Nasher
4 min readFeb 28, 2021

--

Source of image

Non-negative Matrix Factorization (NNMF) or the positive matrix analysis is another NLP technique for topic modeling. In a previous blog, I presented topic modeling by Laten Dirichlet Allocation(LDA). To read more about LDA, please click on here.NNMF differs from LDA because it depends on creating tow matrices from random numbers. The first matrix represents the relationship between words and topic while the second matrix represents the relationship between the topic and documents that forms the mathematical basis for categorizing texts as happened in LDA.NNMF is also used in image processing applications.NNMF is faster and more accurate than LDA that’s can be simply explained because LDA depends on the frequency of words and the topics were selected according to how much these words were presented.However NNMF selects random correlation values between words and topics and training is run based in words exist or not which enable for adjusting weights as the training repeated. NNMF is more favorable for its dimension reduction.

Why NNMF is called Non-negative Matrix Factorization?

According to its name, it is only taking non negative values(positive matrix) of words counts. This positive value matrix A will be decomposed into two positive matrices (B and C). Our text data will be read as data frames through pandas library where each row is a document of huge texts as follows:

The big positive matrix A which is words/documents will be the combination of the two positive matrices: Matrix B (words/topics)+ Matrix C(topics/documents). The numeric value of the words is the count of words in each document which represent matrix B and these words will conclude the name of the topic. The number of topics will be suggested by the NLP analyst at the beginning. The NLP analyst will also interpret the type of topics based on the collection of words that share same features.Then matrix C will add these new topics to reform our original documents.

Preprocessing the Text:

The following steps are suggested as preprocesses of the text data before modeling by NNMF:

1- Removing stopwords, set the parameters of minimum words in all documents, and the maximum words in all documents.

2-Vectorizing the words through count vectorizer or tfidf. Tfidf can be imported from sklearn library.

3- Setting the number of topics.

To get the count of words as we mentioned above through the count vectorizer or through the frequency inverse document (tfidf). TFIDF makes words have less importance when they appear more frequent in many documents therefore these words will get less value.

Here I vectorize my text data through tfidf. As we can see above that our texts were transformed into a big sparse matrix where the number of documents is 12,000 docs which are represented in rows, while the number of words is 55,000 and are represented in columns.

Modeling:

In the modeling step, we will import NMF from sklearn and create the instance of the cluster and include the number of suggested topics which is same as number of components, fit the instance and transform it to our text data

From sklearn.decomposition import NMF

# create instance of the class

nmf=NMF(n_components=7, init=random)

We can view the words in this article through ’tfidf.get_ featur_names’ and pass the index number of the word. So the top 15 words in a topic can be viewed through the ’argsort’ function which sorts values from least to highest. For sure we need to to iterate in tuple, therefore we will use enumerate function on NMF.components where components are the topics in our corpus as follows:

So we can see above, we got words in the first topic like research, patients, health and disease which suggests that this topic is about the research in health care while in topic 7 for example, the most valuable words are ‘teacher, college, schools, education’ and that suggest that topic 7 is about education.

NNMF is an easy NLP technique for fast analysis in huge texts. It can be applied in many fields of life such as news, tweets, facebook reviews and more which make our life easier and more accessible.

References:

1- Non-negative Matrix Factorization, Coursera

2-Non-negative Matrix Factorization, Udemy, NLP — Natural Language Processing with Python

1

--

--

Khulood Nasher
Khulood Nasher

Written by Khulood Nasher

Data Scientist,Health Physicist, NLP Researcher, & Arabic Linguist. https://www.linkedin.com/in/khuloodnasher https:/khuloodnasher1.wixsite.com/resume

No responses yet