Bag of Words
In this blog, I will provide a simple explanation of the concept of bag words. To find the python code, please visit my GitHub.
I will use a quote that I really think is full of strong-willed words. It is a quote from Steve Jobs, the Co-founder of Apple INC, at his speech in the 2005 Stanford commencement address. To watch the whole speech please visit Steve Jobs.
I will convert my quote above into a bag of words. But before I do that, I have to answer the main question:
What does a bag of words mean?
Bag of words is exactly the same as the count of words. It is a vectorizer technique that treats the texts numerically based on its number of occurrences or the frequency that the words appear on my corpus and converts our text data to a document matrix. Knowing the number of times that the word is mentioned on our corpus will give us a first look at what kind of text this is about and we will know how to analyze it using machine learning tools.
But before we go ahead and vectorize our quote by a bag of words, it will be better to do some cleaning, so that our output will be more meaningful by focusing on the unique words of our corpus.
First, I have to tokenize my text data. Tokenization is the process of splitting texts into a list of words or sentences. In this example, we will split our quote into sentences using the sentence tokenize (sent_tokenize) from nltk library:
After tokenizing our quote, we can see that our text has been tokenized or split into 6 sentences.
Now, we have to do cleaning and we do it through regular expression library re.
The cleaning part will include:
1. lower case all words.
2. split the sentences into a list of words
3. remove the stop words which are words that appear a lot in our text, but it does not provide a lot of information such as have, is, you, does…etc.
4. remove punctuations such as commas, and periods.
5. stem or lemmatize words
For this purpose, we will import:
· regular expression library re,
· the stop words,
· PorterStemmer from nltk.stem.porter for stemming purposes,
· WordNetLemmatizer which can be imported from nltk.stem for the lemmatization process.
I will create a list called a corpus where I can add the cleaned texts. This cleaning is performed through iterating with the number of sentences which is six here. In every iteration the regular expression through re.sub where sub is a method, it is going to replace every character in my sentences with a space and keep only the letters from a-z whether those letters are capital or small. Then, words must be lowered using the “ .lowered” method. Then each sentence will be split into words.
After that, I will clean my text from the stopwords and stem the remaining words. To do this I will put it in a list comprehension, where I will return every word that is not in stopwords and I will do stemming for it and it will simply be something like for word not in the set stopwords(English) return this word and apply the stem function on it i.e. ps.stem(word) as follows:
The output of my cleaned text is the following:
Let’s compare the original text to the cleaned stemmed text in the first sentence:
As we can see that because of using stemming words, the cleaned text now has not meaningful words such as “els, wast, someon, people, voic, secondari”..etc. For this reason, I will use another NLP technique that is called lemmatization. Both stemming and Lemmatization return the original form of the word i.e. the root or the base word. Stem however chops off the end of the word and deforms the meaning. Lemmatization does the same but it considers the morph or the smallest part of the word which is called lemma and returns a meaningful word. For example, the word secondary with stem function was not meaningful because it was returned as “secondari” but with lemmatization, the word will not be deformed and will look like a “secondary” Lemmazation returns another original form of the word and it has a meaning while in stemming some of the word representations mostly be not meaningful.
For this reason, I will replace the stemming function above with the lemmatization function which is wordnet.lemmatize(word) on my comprehension list instead of ps.stem(word) as follows:
The cleaned lemmatized text will be as follows:
Here we can see that the above-stemmed word has changed with the lemmatization function to meaningful words for example the word ‘limit’ lemmatized to ‘limited’, the word ‘wast’ lemmatized to ‘waste’, the word ‘live’ lemmatized to ‘living’, and the word ‘secondari’ lemmatized to ‘secondary’ and every word in our quote was lemmatized to an original meaningful form i.e. lemma.
Finally, we will create our bag of words on the above quote. For creating a bag of words, I will import CountVectorizer library from sklearn.feature_extraction.text which is responsible for creating a bag of words, and then I will create an object of CountVectorizer, then I will fit and transform our cleaned lemmatized text to the CountVectorizer, where our quote will be converted to a matrix of zeros and integers. The integers in our array will show the count of every unique word in each sentence.
So, the output of our cleaned lemmatized text is:
As we can see that we still have six sentences that are vectorized here. Every unique word is counted ‘1’. If we look at sentence#6 for example, there were three unique words ‘everything’, ‘else’ and ‘secondary ‘. These three unique words in the previous lemmatized corpus are converted to ‘1’ here and any other words that are nonexistent in sentence #6 but exist in other sentences got ‘0’.
In the case that words occurred more than once in a sentence, the counts will be recorded with its number of occurrences in each sentence. Let’s assume that the above quote has some changes where some words were repeated many times as follows:
After cleaning and lemmatizing the above text, we can see the word ‘limited’ occurred ‘5’ times, and the word ‘time’ occurred ‘3’ times in the first sentence, ‘noise’ occurred two times in the third sentence while ‘else’ occurred ‘2’ times in sentence #5 and ‘1’ time in sentence#6. If we try to match them with the output array, we can recognize the bag of words here where the count of the word in each sentence was represented as a document matrix.
Bag of words is one of the NLP vectorizing tools I will talk about again when I explain Gensim Library in the near future.