ٍٍSearching Arabic Texts by Python Regular Expressions
In the previous two blogs REGEX-PART-ONE and REGEX-PART-TWO, I presented explanation of Regular Expressions in Python with some examples. In this blog, I will apply Regular Expressions on Arabic texts. To get the python code please click on my GiTHUP
I will start with searching for specific word from a string text. The word I will search is ’كروم ‘ which means chrome. We will notice that REGEX in python is supported in Arabic Language as follows:
As we can see that Regex finds the match of Arabic word’كروم ’ in the range from 24 to 28 of my text string.
Extracting Numeric Data
Numeric data can be matched through the identifier \d. If we want to collect all the numeric data on my text, we can simply apply ‘findall’ method with ’\d+’ which means more than one digit.
Even though the numbers were in Arabic Language, Regex was able to recognize all the numbers on the above text.
Extracting Date Pattern
Now, we will be more specific and try to extract the Arabic numeric date pattern. In Arabic Language, date starts with 2 digits of day, then two digits of month, then 4 digits of year.
It is still easy to search Arabic numeric date pattern. We will use the identifier ‘\d’ with the number of repetition in each group of day, month and year.Also, we have to include the forward slash symbol because our date pattern has it. This mean the Regex search pattern will be something like: ‘\d{2}/\d{2}/d{4}’
OR Operator
OR operator has the form of pipe symbol ’|’. We can use it to find a match between two choices. If I want to search for the word ‘google’ or ‘chrome’ which is in Arabic ‘جوجل|كروم’ , we can do it with any REGEX methods, such as search as follows:
REGEX finds the first word of the pattern, which is ‘ كروم ‘ . The Arabic Language works from right to left, REGEX in Arabic texts also considers the order of the Arabic language from right to left, therefore the word ‘ كروم ‘ comes first, and it has a range of placeholder from 24–28. Space also has a place holder.
Let’s say I want to find all the matches of ’جوجل|كروم’ in the above text, In this case it is better to use ‘findall’ method:
As we can see, using ‘findall’ method enables to find all the matches of both words.