Text Clustering

Khulood Nasher
4 min readJan 30, 2021

--

In order to analyze our text data, NLP has provided multiple methods. One of the most popular methods is to do text classifications where we have labels and we are trying to analyze our texts according to its labels as we showed in classifying movie reviews in the past and that it is according to its labels (positive or negative), please visit my blog on Movie Classification.

But what about I got text data such as movie reviews that are not labels. How can I train my algorithm on unlabeled text. Well, this is a really serious problem for someone got an assignment to analyze movie reviews or tweets or any kind of text data that is unlabeled particularly, these text data is unlabeled. Maybe this researcher is limited with resources and might do manual labeling of thousands of product reviews or million of tweets. Manual labeling of huge corpus will not help and is not a solution with soon deadlines.

Well, there is a great solution for this challenge that enables NLP researchers to perform text analysis on unlabeled texts. NLP analysis on unlabeled texts can be performed through text clustering.

What is Text Clustering?

Text clustering is a method of processing the unlabeled texts into groups or clusters therefore this NLP analysis is called text clustering.It is one of the unsupervised learning techniques that depends on grouping the texts according to their similarity. To understand the concept of text clustering, imagine you just assigned a new job as a librarian and you are in the first day of work where you got a hundred of books and you need to arrange these books to where they must be located. You might look at books related to history so you will start to arrange some of the books that share the features of history, and you might read other books that talk about linguists, so you will put these books on another side that consists of all books that talk about linguistic sciences. You must see other books in math, and others about computers, and others about sport etc. By the end of the day you will find that these books have been in many groups of history, language, math, science, computer. Inside each group, books were sharing the same topic. However some books can be located in one group and still share some similarity in another group. Books that were related to history, found some similarity to religion books where some topics were shared in both groups. Books were found in linguistic sciences were some of them has some similarity in computer group because of computational linguistic sciences, and so on those books were arranged into big clusters because of its similarity in a main feature and inside each group there is smaller groups or subgroups that are sharing more specific features.

The librarian task in grouping books into small groups and subgroups according to the book similarity on its titles and its scope is the same as the NLP researcher task when applying the clustering method on a text data. Both of them are doing clustering based on text similarity.

Text clustering has many applications such as classifying news in the news agencies and the concerned parties into specific categories, and extracting information from large sources such as books, and articles, and rating reviews or categorizing tweets or comments from facebook.

So if clustering means grouping or splitting data wether they are numeric or texts into groups according to its similarity. We need to learn some rules, or methods based on similarity. It is a common sense to cluster similar texts into close groups and arrange not similar texts in far groups. The question is what type of similarity that text cluster perform its mechanism? or what are the different methods of text clustering?

Clustering on texts can be applied either on words or on documents.

Word Clustering

When we try to cluster words based on its similarity, we mean by that based on its meanings i.e. the semantics. Therefore the similarity here is actually performed through word embeddings. We can apply the unsupervised algorithms on semantics which consider the embeddings probability. Every word has many features in the vector space, and according to the numeric value of the vectors of this word we can tell any word such as ‘play’ the embedding value of it according to its relation with the words near us. Therefore, if our data is a corpus of 10 thousands words, the unsupervised algorithm will cluster these words based on its embedding values. Meaning words with high embedding value will be in the same group while words with lower embedding values will be in far groups. Word clustering can be seen in speech recognition applications such as chatbot. When a word is mentioned, it must be recognized first to what group of words is related, therefore it will be able to expect the most probable next word.

Text Clustering

So we know now how words can be clustered based on the similarity that is taken from the embedding values but what about If I got a task to analyze texts that is full of sentences such as tweets or product reviews. How can I cluster sentences?

Texts that have many sentences can be clustered based on their features through collecting the appropriate features of the texts and include it in the unsupervised clustering algorithm. Some of the text features can be :

tfidf values, text length, words length, numbers, connotation, ..etc.

There are some powerful clustering algorithms that we will extend our discussion in the future and apply it on clustering words and documents.

Sources:

1- Hashem Asem,NLP Text Clustering.

--

--

Khulood Nasher

Data Scientist,Health Physicist, NLP Researcher, & Arabic Linguist. https://www.linkedin.com/in/khuloodnasher https:/khuloodnasher1.wixsite.com/resume