Embeddings

Khulood Nasher
3 min readJan 24, 2021

--

source of the image

We know that NLP is abbreviation for natural language processing. But actually NLP is the summation of NLU and NLG which is natural language understanding and natural language generation.In order to understand words, NLP try to represent word as a vectors in the space where each word is a vector which means as a numeric data. This representation of the meaning of words is through embedding them in the space. That’s why we have this term i.e. called embedding. Embedding has served many NLP applications such as NLP, chatbot, machine translation, IOT and more. Embedding in sentiment analysis for examples help the algorithm to classify the positive expressions such as good, great, help, support and the negative expressions such as hate, bad, unfair…etc.

Embeddings can be performed through many type of vectorizations such as tfidf, word2vec,doc2vec or pertained models. To understand embeddings, I will define some vectorizers.

What is TFIDF?

TFIDF is a sparse vectorizer. Meaning words are represented inside a matrix where count of words in each document is represented which makes the matrix full of zeros as shown on the chart below.

As we are seeing word1(w1) for example was shown in document1 two times and wasn’t mentioned in document2 or document3, and was mentioned one time in document4. Word3 however was mentioned only in document 2 three times, however it wan’t mentioned in other documents. Therefore the numeric representation of words through this method is considered as sparse vectorized because of the appearance of too many zeros in the matrix.Those zeros are not favourable because it will make kind of complications with matrices operations such as multiplication.

TFIDF is considered as a base model comparing to other vectorizers.But predicting the next word or meaning depending only on the frequency mode is not very helpful particularly with words that might appear too many times along the corpus such as stopwords or any word that can distract the prediction ability from picking the powerful meanings. To overcome the frequency weakness, inverse document frequency, was introduced as a weighting factor of the importance of the word in the whole document where non-informative words such as stopwords will be normalized and this will add more strength to the informative words.

However, there was still a need to overcome the awkward representation of TFIDF with smarter representation such as word2vec.

What is WORD2VEC?

While tfidf is considered as a basic model because it depends on counting the number of words occurred in the corpus. It has some defects because of its sparse density and high dimensionally because every word is represented as a vector and words relations are measured through cosine similarity method. If the angle between two vectors is small, the vectors will be closer to each other, which means the words share similar meaning or strong relation. But if the angle between two vectors is large, the vectors will be far in meanings. Word2vec however is more advanced vectorizer because it is a model where the word representation is created through training the corpus in machine learning classifier to classify between the words that have close meanings and far meanings.

We will explain the Word2VEC in more details in the next blogs

References:

1-Dan Jurafsky, Speech and Language Processing, Chapter 6,Vector Semantics.

2- Motaz Saad,NLP Vectors & Semantics and Word Embeddings.

--

--

Khulood Nasher
Khulood Nasher

Written by Khulood Nasher

Data Scientist,Health Physicist, NLP Researcher, & Arabic Linguist. https://www.linkedin.com/in/khuloodnasher https:/khuloodnasher1.wixsite.com/resume

No responses yet