Representing Texts Numerically
To train textual data in neural network, words of the texts must be preprocessed first as a numeric data. In this blog, I will explain how words can be represented numerically.
Let’s assume I have the following sentence:
My name is Sara.
Each word in this sentence must be preprocessed as a number to be feeding to neural network. For example, each word in the above sentence can take a unique number as follows
I can do that through tokenizer class in python from the keras.preprocessing.text library.
Tokenization is the method of separating texts into its main elements such as characters, words, punctuations, or sentences. These components are called tokens.
To see it in python:
The tokenizer splits each word into tokens in a list, the splitter by default is the space. Those tokens are fit through taking each word from the sentences and giving it a unique index number. The word_index attribute of the Tokenizer class returns a word-to-index dictionary where words are the keys and the corresponding unique index_numbers are the values.
The num_words parameter is the number of words in our text corpus and we can give it any maximum number, just in case we want to add extra words later. By default, the punctuation is removed, and output tokens are lowered-case, therefore capital case words and lower case are considered same here.
If we want to vectorize sentences, we can use (text_to_sequence) as follows:
The sequence method vectorizes sentences and then tokens which are words kept their unique index number.
Now let’s test our vectorized sentence to see how it will recognize new texts:
As we can see here the words which were recognized are only the words in the training dataset ‘my’ with index number ‘1’, the word ‘is’ with an index number ‘2’, and the word ‘name’ with an index number ‘3’. The text words that were not recognized do not have a place holder. To avoid any missing meaning in the sentence, we can give all the missing words a number, this number is unique but for the words that are not in training dataset. For this purpose, we can use a parameter called ‘oov_token’. The ‘oov_token’ if determined in the tokenizer, it will add an index_number to all words that are out of the vocabulary of the texts in the training dataset as follows:
Now, we can notice that by adding the oov_token has added a numeric index of ‘1’ for any new word in testing data.
Next we want to make sure that all sentences have same length. Sentences of input text data can be of varying lengths. We might have sentences with three words, others two or four words. However, LSTM (algorithm for training processed texts) requires same length of sentences. Therefore sentences of text inputs must be adjusted into a fixed-length vectors. To do this, we are going to use padding where padding is going to take the max length of my sentences and naturalize according to it as follows:
We noticed here that shorter sentences have got zeros for the place holders of missing words, as we can see in first and second sentences. So that all sentences have the same length . This is called pre-padding because zeros are appended at the beginning. We can add zeros at the end and it is called post-padding by setting the parameter padding=’post’ which gives the sentence with length less than max length zero at the end so it will have same length as follows:
This is very important in neural network because when we feed these sentences to neural network they must have same length, by default is pre, which means adding zero at the beginning.
Another important thing is setting the max length parameter of all sentences. Lets say we don’t know the max length of my sentence, so I can just assume long max length but we should make sure this assumption is longer than any of our sentences so we will not miss any word, for example max length=9, so it will give zeros for any shorter sentences that have less words than 9 words but all sentences will have same max length as follows:
As we see above that by setting the max length of the sentences to 9, zeros are appended for any shorter sentence that has less than 9 words.
And If you want to know the corresponding index number of a specific word we can simply get that through printing its word_index as follows:
Finally, I will give a brief explanation of embedding layer which its job is arranging words according to its meaning.
Embedding Layer:
The embedding layer however is taking every word of these words and change it to a vector of n-dimensions. The embedding layer will add to these words hidden semantics.
As we can see here that this thick cloud is full of dots. Every dot represents a word and every word is a vector that has n-dimension. Let’s assume the word is happy. The embedding layer will arrange all similar words and put them close to each other. Embedding layer is clustering the words according to their relationship, similar meaning and word derivatives put them closed while opposite meaning and opposite derivatives set them in the opposite side. For example, happy can be in the right, and fun can be found close to it but in the opposite side we can see words such sad or boring etc.
Word Embedding can be customized such as Doc2Vec or word2vec and can be pre-trained such as GloVe.
More details about word embedding will be introduced in the next blog.
Resources:
1-https://www.tensorflow.org/tutorials/text/word_embeddings
2-https://www.coursera.org/learn/classification-vector-spaces-in-nlp#syllabus