Sentence Segmentation

Khulood Nasher
4 min readDec 26, 2020

--

Source of Image

What is Sentence segmentation?

Sentence segmentation is the analysis of texts based on sentences. In NLP analysis, we either analyze the text data based on meaningful words which is tokens or we analyze them based on sentences . To split the data to its main components i.e tokens we can do that through spaCy library as follows:

To follow with python code, please click on my Github.

As we see above, I split my text into tokens which are words, punctuations, and symbols by using the ‘.text’ ,but If I want to split my text into sentences. I will use ‘.sents’ generator as follows:

So as we can see, spaCy recognizes the end of each sentence and splits the above text by using the period symbol ‘.’ as a sentence splitter.

To check if a specific word is the start of the sentence, there is an attribute in spaCy ‘is_sent_start’ that can test if the index of specific word is the start of the sentence or not as follows:

As we have seen previously that the index of the token ‘jewelry’ is 7 and when we check if the word ’jewelry’ is the start of the sentence, spacy is able to recognize it and give it a true validation. But what about if I choose a different word that has an index of 5 for example. This word is ‘high’,how spaCy recognize this word as a start of the sentence or not?

Actually, spaCy recognized the word ‘high’ as a non-start of a sentence.But what about specifying the sentence index, can we say doc1.sents[0]? Let’s see it in python:

As we see above, it gives me an error because ‘sents’ is a generator object and is not a list. To solve the problem, we simply say ‘ list(doc1.sents)[0]’, as follows:

With including the ‘sents’ generator into a list, I was able to slice my text and collect the sentences according to its index place holder.

To see the above text as a list of sentences, it is better to use list comprehension as follows:

Above, my text was split into sentences because of ‘sents’ generators, and the sentences are elements of a list because of list comprehension. Now, I can collect the index of the start and end of any sentence of my text through start and end attributes as follows:

As we noticed above the first sentence starts at index zero with the token ‘Online’ and ends at index ‘7’ which is the token ‘Jewelry’ which is the start of the next sentence and it is not a part of the first sentence. But what about the sentence ends with semi colon ‘;’ ,can spaCy recognize the sentences? In other words, is there a way to customize my sentence analysis by choosing the type of a sentence splitter? Well, this possible.

Customize Sentence Segmentation

When we process our document in spaCy as NLP object, there is a track of pipeline that the text is followed. It starts with tokenizer as a main step which is tokenizing the text into tokens, then it follows with a tagger which is giving tags to the words. These tags are actually defining the part of speech (POS) of each word on the text. The next step on the processing pipeline is the passer which is determining the relationship between the words(dependency).Then the step of entity recognizer which is determining the proper nouns of entities such as persons, organizations, countries,..etc. This track of nlp pipeline for processing the text is actually a default pipeline in which we can interfere and customize it with the track that fits our need.Each step of this track return the document, and it can be performed independently.

--

--

Khulood Nasher

Data Scientist,Health Physicist, NLP Researcher, & Arabic Linguist. https://www.linkedin.com/in/khuloodnasher https:/khuloodnasher1.wixsite.com/resume