Word Vectorization

4 min readJan 9, 2021

In this blog, I’m going to discuss an important approach that is used in NLP for analyzing texts and it has many applications. I’m going to present here how to vectorize texts.

What is Word Vectorization?

Vectorization is a method of representing words of the text as a vectors which means as a numeric data. It processes textual data (corpus), so that the input will be words while the output will be vectors i.e.numbers. The way vectorization works is through grouping mathematically similar words together.These words are similar because they have same relation such as meaning or words have close relation which is understood from the context using huge corpus of text without human intervention. Word vectorization is called embedding in NLP. The purpose of applying word vectorization is to predict the similar words in meaning, these similar words are called semantics.

Word2vec

Word2vec is a neural network algorithm in machine learning that learn from a large corpus such as wikipedia the similarities of the words based on the appearance of the word on this large text. For example words like fun is most likely to be close to other words such as happy, nice, funny, entertainment, cool, etc. The algorithm will expect the meaning of this word and will calculate the list of words that are similar to it mathematically according to its meaning from the context. In other words, Word2vec is an algorithm that calculates the semantics of words in texts mathematically and suggest the closest semantic.

Word2vec can be processed in two different methods. The first method is from text to word while the second method is a reverse form which is from word to context.

CBOW versus skip-gram

Word2vec uses one of two approaches to vectorize words. It either applies Continuous Bag of Words (CBOW), or skip-gram method. In CBOW, it uses context to predict a target word while in skip-gram, it uses a word to predict a target context.

The CBOW method uses several words of the text and through a neural network algorithm, it will try to predict the highest probable word to show up by using the given context of those surrounding words. While the Skip gram is an inverse approach. It takes little longer to train and to develop by the given one word, and it uses the auto encoder neural network projection and try to output the weighted probabilities of the other words that are going to show up around the context of this input word. So we have kind of two inverse approaches here either given the context words and then predicting the output word or given the input word and then trying to predict the output context. CBOW is faster however skip-gram is slower but does a better work for uncommon words.

Each word is going to be represented by a vector. In Spacy, each of these vectors has 300 dimensions. But typically the ranges are between 100 to 1000 dimensions. The higher number of dimensions, the longer the training time that we should have, then more context around each of these words is going to be, since we have more dimensions to hold more information.

Cosine Similarity

What this means now is that we’ve assigned each word to a vector in a 300-dimension space. We can use cosine similarity to measure how similar word vectors are to each other. So, cosine similarity is essentially just checking the distance between two vectors. And here we see a simple diagram in two dimensional space but this expands out to n-dimensions. If a two words were congruent, then the angle between their vectors would be zero, so the value of cosine zero would be one, and the further the two words moved away from each other, the greater the angle between the two vectors representing the words, and the lower the value of the cosine, until the value of cosine becomes equal to zero at an angle of 90 degrees and then the relationship between the two words would be equal to zero which means that the two words are not similar in meaning.

Now what’s really interesting is that we can perform a vector arithmetic(mathematic) with the word vectors. So we can calculate a new vector by performing athematic which is adding or subtracting different vectors. So I can take the vector that 300 dimensional vector for King and then subtract the vector for man and then add the vector for woman. So this creates a new vector that’s not directly associated with a word. We can instead perform an arithmetic between the vectors of several words and then try to find the most similar existing vectors to this new vector, so hopefully after we do something like King minus man plus woman that new vector closest word vector could be something like queen which is essentially understanding the context of royalty along one dimension and moving along another dimension for gender. So this enables to actually establish a really interesting relationships between the word vectors including relationships between male versus female or even dimensions for verb tense so we can understand that walking is to walked as swimming relates to swim.

Word Vectorization

Word2vec

CBOW versus skip-gram

Cosine Similarity

Written by Khulood Nasher