Classification with GloVe
Global Vectors for Word Representation (GloVe) is an unsupervised learning algorithm for creating pre-trained vectorization of words based on co-occurence statistics from a huge corpus such as Wikipedia. To learn more about GloVe, please visit my previous blog on Glove from here .
In this blog, i’m going to present how to perform text classification through transfer learning by using a pre-trained model i.e.Glove.
I downloaded one of the files that was trained by Stanford university developers on 6 billion words from here:
http://nlp.stanford.edu/data/glove.6B.zip
There are 4 files that were trained on 6 billion words but in different dimensions i.e. glove.6B.300d, glove.6B.200d, glove.6B.100d, and glove.6B.50d. Each file includes billions of words.Their values are the embedding values of each word in 50 or 100 or 200 or 300 dimensional vectors, and so on.
The GloVe file is read and the embedding values are collected into a dictionary. Number of operations can be performed between the words of the dataset such as making a comparison between the meanings of the words, searching for the closest vocabularies for a specific word or perform mathematical operations in the words.
In my project, I downloaded and unzip the GloVe file from Stanford then I read the GloVe file, and save only the vectors that matches the words that exist in the vocabulary of my data .The total vocabulary of my dataset was added into a set object.
In the cell below, I added every token from every tweet in data into a set, and store the set in the variable total_vocabulary as follows:
So after collecting the total vocabulary on my dataset, words were matched with the appropriate vectors of GloVe as follows:
Get GloVe Embeddings
Since GloVe is pertained model in a huge corpus, we are facing an issue that these vectors of words are deferent and huge too. We all know that each word has a deferent meaning according to its location in the text. This means we have a huge number of embedding values. To solve the problem of huge embedding values, we confine our Glove use with the mean word embedding and here I restrict my search with only 100 dimensions of word features of the 6 billion words of GloVe.
As we see above, a vectorizer class is used to collect the mean word embeddings from word vectors. Fit and transform methods were performed so that this class will be workable with the sikit-learn requirements.
Random Forest Classifier — GloVe Mean Word Embeddings
Finally, our sklearn is ready to use the GloVe vectorized data and be apply it in any algorithm. Therefore, we will split the data the same way we use to do into training and testing dataset, and we will adjust our inputs to include the vectorized GloVe data that we processed them above, and then we just feed our GloVe training and testing datasets into random-forest algorithm as following:
Using the pre-trained model GloVe gave me a classification accuracy of 89% which is a good score, and it can still be improved with trying deferent GloVe dimensions and deferent Vectorizers.
References:
1- Flatiron School,Learn.co, Module 4, Appendix,Link