spaCy Library in Python
spaCy is a free open source library and it is used widely in natural language processing(NLP).It supports 61 Languages, but Arabic Language has not been supported yet. It has many important applications that assists natural language understanding(NLU)such as tokenization, vectorization, extract information, part of speech tagging, name entity recognition, sentence segmentation, processing texts with statical models and deep learning and more.To get the python code of this blog, please visit my Github
Importing spaCy
Importing the library and loading the English Language can be done as shown below, we should make sure to create nlp object:
In order to process the text by spaCy, text must be created first by NLP object as follows:
Tokenization
First, I have to tokenize my text data. Tokenization is the process of splitting text into a list of meaningful parts mainly words or sentences. In this example, we will split our text into words first using ‘.text‘ method through a List Comprehension as follows:
I can also tokenize my text through iterating on the range of my text and append in an empty list as follows:
In both ways, I was able to split the text into a list of words to get it ready for more NLP processing.
Attributes of the Token Class:
The token class in spaCy has many useful attributes such as:
token.idx: shows the starting index of each token.
is_alpha: shows if the token is alphabetic or not.
is_punct: shows if the token is a punctuation or not.
is_space: shows if the token is a space or not.
shape_ : shows the shape of the word.
is_stop: shows if the token is a stop word or not.
To apply all the above attributes in our text example as follows:
To analyze the token ‘Library’ in our text example:
‘library’ has starting index at 41, it is an alphabetic character, it is not a punctuation, it is not a space. It has a shape of lowercase letters, and it is not a stopword.
Extracting sentences
Sentence extraction is an important step in analyzing texts in NLP, so that we can know the part of speech in each sentence or the entity extraction.spaCy library enables us to extract the sentences of text through ‘sents’ method as follows:
As we can see above the ’sents’ method helps us to extract all the sentences in our text using the period ’.’ as a delimiter of the sentence.
Now, it is time to start thinking of cleaning our text. One of important steps is removing the stop words.
spaCy Stop words
spaCy is able to recognize the stop words in English Language. As we know stop words do not have an important effect in the sentence, and it is better to remove them when analyzing text as a part of cleaning task, so that nlp analysis can focus on the most effective words of the text. To find the number of the stop words in the English language, and to see the first 10 of these stop words as follows:
Words such as ‘whoever, per, also, that, the’ these‘ are example of stop words that are not important and it is better to remove them to focus on the important words when analyzing texts.
Cleaning from stop words
we can clean from stop words in different ways. One of the basic method is through ‘for‘ loop, by appending any word that is not a stop word as follows:
As we can see, the above cleaned text includes only an important words such as ‘learn’, ‘Spacy’, ‘today’…etc.
Part of Speech (POS):
To know whether the word is a verb, a noun, an adjective, an adverb or any other part of speech is very important in analyzing textual data. spaCy has the ability to determine the POS of each word on the text. spaCy can do more by providing the tag of words like if the noun is plural or singular, or wether the verb is in present tense or past tense, or wether the adjective is affix or comparative or superlative.To find the POS of the token in spaCy is through ‘token.pos_’ , and to find the tag of the token is through ‘token.tag_’ . spaCy also explains the tag through ‘spaCy.explain(tag)’.
Let’s apply it on our example text:
In the above cell, we can notice what a great job that spaCy does in analyzing the tokens in a proper grammar. If we look at how spaCy analyze the word ‘learn’ for example, we can see that the part of speech is the verb, the tag is ‘VB’ and the spaCy explanation of the tag is a base form. While the POS of the token ‘mainly’ is adverb ,and the tag is ‘RB’ and spaCy explanation of the tag is adverb.
Extracting Noun Phrases
spaCy can detect noun phrases on the text through chunks. Noun phrases are phrases that have a noun as a main component and it can come with an adjective, determinant..etc. To see it on our text example as follows:
ِAs we can see above, all the noun phrases were detected such as ‘We’ which is a pronoun, ‘Spacy’ is a proper noun, and ‘ a great library’ which includes determiner ‘a’ , an adjective ‘great’ and the main component noun ‘library’.
Lemmatization in spaCy
Lemmatization is the root of the word that has a meaning. In contrast to stemming that has the root of the word but sometimes the root is not meaningful. spaCY is able to find the lemma through the attribute’.lemma’. To collect the lemmatization on our text example:
As we see above, spaCy detects the lemma of every word. the lemma of the word ‘used’ is ‘use’, the lemma of word ‘is’ is ‘be’ .
Finally, we learn that spaCy is a great library to find the part of speech and we will provide more applications of it in the future blogs.