Named Entity Recognition
What is Named Entity Recognition?
Named Entity Recognition (NER) is one of the most important applications in Natural Language Understanding(NLU). It is a part of extracting information from unstructured text data.It focuses mainly on identifying the proper nouns and give its entity label. Name of organizations, currency, country, person, quantity of weight, distance, ranking, date, time, percentage, language, religion etc. can be recognized and labeled as a category through spaCy library. To get more information about spaCy library, please click on my previous blog . spaCy library has a feature called ’.ent’ that enables us to extract the named entity. To get the python code of this blog please visit my Github.First we have to import spaCy and create nlp object as follows:
In order to analyze the text in spaCy, it must be processed through nlp object. It’s good idea to start the text analysis with text tokenization as follows:
The method ‘.text’ in spaCy enables us to identify every unit of the text including words, punctuation and symbols. However, we might need to search for a specific information on my text such as extracting named entity.
To extract the name entities from the above text and collect its label categories, we can do that through the class‘.ents’ of spaCy library where we can extract its entity name through ‘.text’ attribute and its label tag through ‘.label_’ as follows:
spaCy is able to recognize ‘Khulood Nasher’ as a person name, ‘Los Angeles’ ‘California’ as a ‘GPE’ tag which means geographical entity. Amazon,and Idaho State University as’ORG’ tag which is an organization entity.‘November 9th, 2020’ ,and ‘annual’ as a date, ‘5 miles away’ as a ‘Quantity’ tag, ‘78,230’ as a ‘money’ tag, and Arabic as ‘NORP’ tag which refers to the category of Nationality or religion or political (NORP) group. ‘6’ as a ‘cardinal ‘entity, and ‘second’ as an ‘ordinal’ tag.I would like to give a description of the common tags of named entity in spaCy library in the following table:
NER Label Tags:
To see the explanation of the entity labels, we can do that through ‘spacy.explain’ as following:
As we see above, word like’Arabic’ is recognized as ‘NORP’ tag and explained as nationalities or religious or political groups, while the word ‘English’ has a tag of ‘LANGUAGE’ and is explained as ‘Any named language’
Counting the Number of Specific Entity
To get the number of a specific named entity, let’s say the number of times of the entity ‘ORG’ occurs on out text, we can do that through the ‘len’ function on a list comprehension so it will be: len([entit for entit in my_text.ents if entit.label_==’ORG’]) as follows:
As we see above, the number of ‘ORG’ entity on our text is three which is correct.
Extracting Named Entity through a Function
It is good idea to use spaCy library for extracting named entity through a function and apply it on any given text . The class of ‘.ents’ is the main component for extracting named entities and as we did previously, we can collect more information of the entity through attributes of (.text ,.label_, spacy.explain(.label_). Our function must be read in a way like if there is an entity on our text, print the entity text, the entity label and the entity explanation for every entity. Therefore our function is simply as follows:
In the above example, “May I borrow Harry Potter book in May?” , the function was able to recognize the second “May” as a named entity with tag of date and didn’t mistake with the fist “May” which is not a named entity because it is a modal verb. Also the function extracts the entity ‘Harry Potter’ and labels it as a ‘Person’. The second text however,“How are you doing today?” does not have any named entity, therefore as we defined in our function above, it returns ’no named entities found’.
‘.ents ‘are actually token spans that have their own attributes such as:
’.text: The name of the entity
‘.label_’: The entity tag
‘.start’:The starting index of the word in the text
‘.end’:The ending index of the word in the text
‘.start_char’:The starting index of the character in the text
’.end_char’:The ending index of the character in the text
To see the above entity attributes on our text example as follows:
As we can see above ‘Khulood Nasher’ is a text entity, its label tag person, it is a token has a word starting index at 3 and a word ending index at 5. The starting index of the character is 9 which is the place holder of the litter’k’ and the ending index of the character is at 25.
Finally we can summarize spaCy library job on NER with two words ‘extracting/labeling’ proper nouns and we can summarize spaCy task on POS with two words also ‘extracting grammar’ of the word. In the next blog, I’m going to present how to visualize NER and POS.