Regular Expression in Python part(1)
What is Regular Expression?
Regular Expression library or Regex as an abbreviation is a tool for search patterns.
What is the meaning of Patterns?
Patterns can be in the form of an email address which has the arrangement of somewords+@+somewords+’.’+somewords or phone number of 9 digits with 3 digits then’-‘ then 3 digits then ‘-‘ then 4 digits, or website pattern, or social security number, or any form of arrangements of characters.
Regular expression has many uses such as extracting specific texts, cleaning texts, replacing texts with other words, and more.
Regular expression is written in many languages including python. They have many expressions and with practice, these expressions can be simply memorized such as:
Additionally, Regex has many methods to perform its quick search.
REGEX Most Common Methods
The most common methods are six methods as follows:
match, search, find all, sub, split, and compile.
re.match() checks for the match at the beginning of the string
re.search() checks for the match at any place of the string.
re.findall() looks for all instances of the words in the string.
re.split() split the string into smaller strings.
re.sub() search for the pattern and replace it with a new substring.
re.compile() convert pattern into objects
Escaping Characters
$ ^ |
$: matches at the end of the string
^: matches at the beginning of the string
| means ‘or’
Now, ’m going to use those common methods and some common expressions.
To get the python code of this blog, please visit my Github. But first, we have to import the regex library ‘re’. We can start to search if a specific pattern exists, for that we can use the ‘search’ method which will look for the first match only.
The regex method’s ’search’ matches my pattern which is ‘number’ with my text, the matches on my text string started at the index 32 and end at 38, space here is counted as well inside the span as a place holder. However the match considers the first instance of the word ’number’ , but the second occurrence wasn’t considered. To solve this issue, we will use ‘findall’ method as follows:
And to get the span of all my pattern instances, we will do iteration through ‘finditer’ method:
In the above text, we have a phone number pattern i.e. ‘1–888–280–4331 ’. The pattern format of this phone number is: one digit dash three digits dash three digits dash four digits. It is expressed in Regex as follows:
To avoid writing the identifier ‘\d’ multiple times, REGEX allows us to use Quantifiers.
What are REGEX Quantifiers?
Quantifiers specify the number of times that the identifier is used.
+ : matches one character or more, e.g. w+ matches one letter/number or more
*: match zero or MORE repetitions
?: match ONE repetitions or none
{4} occurs exactly four times
{1,3} between one to three times
{2,} occurs two or more
Let’s run the above example again but with using some quantifiers on our identifier ‘\d’:
As we can see, we got the exact output with less repetition.
Slicing in REGEX
We can also do slicing inside regex pattern search through using the group method. For example, if I want the last 4 digits of the phone number above, I will say group (4) but we have to use parentheses between each group of digits to let the grouping method works for us as follows:
As we can see above, I added parentheses to let the group method work, and we can add ‘- ‘if we want to get it with our search output. The group method starts at the count of one, and if we use group (0) or group (), we will bring the whole string as shown in the above example.
OR Operator in REGEX
OR operator has the form of pipe symbol ’|’. We can use it with ‘search’ or ‘findall’ methods or any REGEX method as follows:
OR operator matches for all instances of ‘phone’ or ‘number’ in ‘findall’ method. But in the ‘search’ method, it collects only the first match, that’s why when we searched for the word ‘phone| number’ in my text, it grabbed the first occurrence which was ‘phone’, but when I searched for ‘customer| phone’ it matched the word ‘customer’ because it existed first in my text string.
Search Patterns: Start with/ End with
To search for a pattern that starts with a specific character, we add the symbol ‘^’ before the search pattern, so in the above text if I want to search for the beginning of the string I will add r’^\w+ ‘as a pattern search.
If I want to search the end of my text, my pattern search will include the ‘$’ at the end, and it will be something like r’\S+$’, as follows:
Searching for Specific Sequence of Characters
To search for a specific sequence of characters, let’s say ‘er’ in the above text, our pattern will look something like r’\S+ er’. This means the search for the match that contains more than one repetition of the ‘er’ sequence on non-space characters. According to our above text, the matches must be found at ‘customer’, ’number’, and ‘ser’ from the word ‘service’.
But what about if I do some little changes on my search pattern to look like r’er\S+’ , is there a difference between r’er\S+’ and r’\S+er’? The answer is yes. The ” r’er\S+’ ” pattern means to find the sequence ‘er ‘ which will be followed by one or more none-space characters. Therefore the output is ‘ervice’. While the search pattern ‘r’\S+er’ means find one or more none-space that contains ‘er’ sequence, therefore the output is [‘customer’, ‘ser’, ‘number’]
Let’s say I want to search for a pattern that contains some words than ‘-‘ then some words, how can I search for this pattern? Well, it is easy with regex. My pattern will be something like r’\w+-\w+’. Which means some characters then ‘-‘ then some characters as follows:
As we can see above, the sequence pattern that has ‘some words +’-‘+ some words’ finds its matches on my text at the ‘phone-number’ and at ‘1–888–280–4331’
Exclusion
Any content can be excluded if it is put inside square brackets and ‘^’ symbol.
In the above example, I excluded the digits, so that I wrote my pattern as r’[^\d]’. We see that the output was all the characters of my text including spaces but not digits. If I want to exclude a specific content such as digits and keep the rest text as full words, we can just add a plus sign as I did in the above example where my search pattern becomes as r’[^\d]+’
Cleaning from Punctuation
To clean from punctuation, I will apply the exclusion approach that I performed above. So, my search pattern will have a square bracket and inside it ‘^’ symbol in addition to the punctuations that I want to remove.
With Regex, cleaning is very easy and fast. This exclusion method can work for anything such as cleaning texts from special characters like the hashtags in tweets or removing stop words.
The output here is tokens which means words that are separated by space. To have this cleaned text as a string, I will apply the ‘join’ method to the ‘space’ as follows:
Finally, this blog is a simple summary of regular expression. In the next blog, I’m going to present more methods and examples and I’m also going to apply it to Arabic texts.