Gensim Library

Khulood Nasher
3 min readMar 26, 2021

--

Source of the Image

What is Gensim Library?

Gensim is an open source library used commonly for topic modeling purposes in a huge corpus.It is a preferable library for topic modeling because it is doing this job more efficiently and more easily.

Installing Gensim

To install the latest version of Gensim, we just need to write the following command in python notebook :

pip install --upgrade gensim

Create a Dictionary from a list of Sentences

We can create a dictionary from a list of words using Gensim library where we can easily vectorize the words by giving each word a numeric id. First we will import an import object from gensim called ‘corpora’ and we will create a simple document of short sentences and get the word tokenized, then we put our tokenized words inside a dictionary through the ‘Dictionary’ method and finally we will be able to give each tokenizer a numeric id through ‘token2id’ method as following:

As we can see above I got my text tokenized, and I included my tokenizers into a dictionary where it detects 30 unique tokens, meaning the words appear in my dictionary only once. Then in last line of my output, we can see that each word got an id. For example the word ’there’ has an id of ‘0’, the word ‘better’ has an id of ‘1’ , and the word ‘cultures’ has an id of ‘2’ and so on.

To follow my python code of this blog, please visit my GitHub

Displaying the words in vertical order:

To display my above tokenizers along with their id numbers vertically, I will iterate each word with its id on my dictionary that I created using gensim and I will apply the “token2id.items()” as following:

As we see above, words and their ids were printed in a vertical order. I made a distance between each of ‘12’ and between each id number of ‘10’ to make it neatly printed.

Reading Text File by Gensim

Most of NLP tasks will be performed on text files. Therefore, it is very important to learn how to open a text file, read it, and analyze it. Gensim library has an attribute called ‘simple_preprocess’ where the document will be converted to lower-case tokens which makes them ready to be processed as numeric id by ‘token2id’ method as following:

As we can see above, a text file called’NLU.txt” was read and its words turned to lowercase tokens and finally every word of the document got an id.

Creating a bag of words from a Text File

Bag of words is an important concept in NLP analysis because it shows the count of each word in text which reflects the importance of this word. That can help achieving our main goal in performing topic modeling. Gensim can find bag of words of any document or any text file easily through reading the text file first, then changing the words into tokens and lower case each word by ‘simple_preprocess’. Then each token will be involved into a dictionary where the bag of words can be created on the whole text file through ‘doc2bow’ method as follows:

As we can see above, Gensim created the bag of words easily. We can notice the count of word ‘ability’ is 2, while the count of ’across’ is 2 , however the word ‘after’ appears only once therefore its count is 1.

--

--

Khulood Nasher

Data Scientist,Health Physicist, NLP Researcher, & Arabic Linguist. https://www.linkedin.com/in/khuloodnasher https:/khuloodnasher1.wixsite.com/resume