Text Clustering by KMeans
We know that Clustering is another word for grouping. In a previous blog I introduced the concept of text clustering where I showed how to arrange words or texts in groups based on its similarity. This similarity can be determined according to the embedding values in case I’m clustering vocabularies, and can be determined based on other text features such as TFIDF vectorizer if i’m clustering texts. To read my blog in text clustering, please visit my blog on Text Clustering.
Today, I’m going to get deeper in text clustering through introducing one of the important unsupervised machine learning algorithm i.e. K-Means and apply it on text clustering.
What is Kmeans?
Kmeans is unsupervised machine learning algorithm that deals with unlabeled or unclassified data. It rearrange data into groups, these groups looks like clusters around a center. These centers represent the mean of the observations. The number of clusters is represented by ‘k’ which is the number of centers that all data will be distributed as clusters. As we can see below that we have a mix of data and we will try to divide it into groups by determining the center of the observations that has similarity with the rest of observations, we will find ourselves that we will have number of clusters that are sharing common features in the same group but they are less similar with the observations in far groups.
Now let’s explain text clustering by k-means in python. So I will use the following texts that I copied its parts from different websites and I intentionally picked it from tow different topics. I want to see how kmeans can cluster my test as follows:
As we see above, we needed to import numpy library because text will be vectorized as a matrix. We also imported regex ‘re’ library to do cleaning. When text is tokenized, it is divided into its main components and the ‘split’ method is the best approach for tokenized text.Above, however I split the text by lines and not by space which is the default splitter if i didn’t define it. I also slice my split to work fro 1 to -1, so I can skip the lines of the quotation marks.
My text will look like the following:
Now to clean my corpus I will use regular expression after I lowercase all the words as follows:
So, after cleaning my corpus. Text is ready to be vectorized by Tfidf vectorizer and processed and clustered by kmeans through sklearn as follows:
I defined k=2 which is the number of clusters. I just wanted to test how the kmean algorithm will be able to determine every line according to what group is related. As I mentioned at the beginning I quoted my text from two different topics. First topic was about Amazon and has an index position of zero, and second topic was about NFL and has an index position of one.
I tried different lines and the kmean was able to put each line into its right cluster. As we see above the two lines were related to ‘NFL’ which is the second cluster on my text. Kmeans predicts my lines in the array of [1,1] which means first line follows the second cluster and the second line also follows the second cluster.
We can try different parts from our corpus to check if kmeans is able to cluster every line to its right cluster, and we will notice that kmeans has a good prediction accuracy in texts clustering.
Sources:
1- Hashem Asem,NLP Text Clustering.