Regular Expression in Python Part (2)
In the previous blog, I presented the first part of my explanation of Regular Expression, please go to Regex_part_one . Today, I will continue presenting more regular expression methods, examples, and some example texts in the Arabic Language. To get the python code, please go to my Github
I will start with a search pattern of the letter ‘c’ and the alphanumeric characters that follow it. Therefore my search pattern will be r’c\w+’ as follows:
We can see the output is all the words that start with the letter ‘c’ followed by ‘\w+’ which is the rest of the characters of each word on my text string. Therefore the output was the list of ‘culty’, ‘ceive’, ‘ccount’, ‘ccess’, ‘calendar’
Searching Emails:
Emails come in patterns of alphanumeric + ’@’ symbol + alphanumeric +’.’+’some words
To represent the email pattern in REGEX, the search pattern simply looks like r’\w+@w+
As we can see, the output gets part of the email which is the words before the ‘@’ symbol and the words after the ‘@’ symbol. There is a missing part of my email address which is the part that includes ‘.’ and also the words that come after it. To get everything off my email address, I will edit my search pattern to include ‘.’ on the part that has it and I will put it inside a square bracket so that part will be ‘[ \w.]+’. The whole search pattern for the email address will be: r’\w+@[\w.]+’
What about customizing my search pattern of email addresses to extract everything after the ‘@’ symbol. Well, that‘s still simple in regex because pattern search will be r’@\w+’ which means start with the ‘@’ symbol and bring all the words that come after it as following:
Again we got the words after the @ symbol, but we are still missing the part that has the ‘.’ symbol, and as we did above we just need to add ‘.’ and specify my search by including ‘\w.’ inside a bracket, so my search pattern for everything after ‘@’ will look like r’[\w.]+’
Now If I want to extract the domain only from my email address which is the words that come after ‘.’, I just need to do grouping for this part. So my pattern search will be r’@\w+.(\w+)’ which means start with ‘@’ symbol, then many characters then ‘.’ symbol then the characters that come after it.
Searching Multiple Words:
Let’s say that I need to search for multiple words on my text string such as the following words:
my_pattern=[‘faculty’, ‘staff’, ‘name’, ‘students’,’Stanford’,’education’]
To search for this list of words using REGEX, I can use for a loop as follows:
So it is really very simple code where it iterates for every word in the search pattern and tries to see if it matches the words in the text, so for the words that exist in the text it returns word is a match and if the word does not exist, it returns this word is not a match.
Extracting Date:
The date usually comes in the format of 2 digits(month)-2 digits(day)-4 digits(year). To extract this pattern, we simply write it as r’\d{2}-\d{2}-\d{4}’ and to group it, we have to add parentheses for each group that we are interested in. So if we want to extract the year, we have to add parenthesis in the third group, and our date pattern will look like r’\d{2}-\d{2}-(\d{4})’.
Split Function:
Split function in Regex can divide the string of text into words being split by space or delimiter or any kind of character. Let’s say that I want to use the splitter to be the space, the first thing I will do is calling the split function from re library. My search pattern is going to start with re.split(). Since the splitter will be the space, I will use the identifier ‘\s’ for this purpose. So this means that my search pattern will look like re.split(r’\s,my_text).
To determine the number of splits that will be applied to my string, all we need to do is just set the parameter ‘maxsplit’ to a definite number. In the above example, I set the ‘maxsplit ‘to be ‘1’. Therefore my string will make one split only which is the first word, then the splitter which is space here , then the rest of the string as shown above.
Replace Method with Sub function
The sub() function does a search and replace of the text or character of a string with another text or character of our choice: