Python Regex

What is Regex?

Regex stands for Regular Expression, which is a very powerful technique for searching a piece of text in a text document. Using Regex, we define an expression based on the pattern that the required text in the document follows. This is much more powerful than the usual way of finding text by just searching for a specific string. The latter is very limited.

We can match text that follows a complex formula using a Regular Expression.

For example, a regular expression that matches all the words in a text that are bounded by double quotes or an expression that matches all the dates mentioned in the specified text file. We will discuss all this further but first we have to import the regex module into our python program.

Importing Regex in Python

To use regex in our python program, we import a module called re. We can import it like this

			

import re

				

Using Regex

We will call the compile() method to use regex. This method takes a pattern as an argument while returning a pattern object. We then give this Pattern object the input text and it will return the matching text in it. Like this

			

searchDates = re.compile("\s\w{0,8}\s\d,\s\d{4}")

result = searchDates.findall(text)

				

Regex Expression consists of ‘\’ characters that each have their own unique meanings. We can directly match a string by just writing it without the ‘\’ symbol. Special characters need to be mentioned in the escape sequences using the ‘/’ symbol.

Now let's take an example, Suppose we have to find the dates mentioned in the input text. The dates are in the give format

MonthName Day, Year

A regular expression to match text of this pattern will be,

\w+\s\d{1,2},\s\d{4}

Let's discuss what this complex looking expression really means.

First we see \w. This matches a word in the text. The + symbol specifies that the symbol preceding it needs to occur at least one time. Thus, \w+ means, a word(in our case the month name) must occur at least one time.

\s character stands for a whitespace.

\d specifies a digit. The {} specify the number of times the preceding symbol must occur for the match. In our case, the day can be of either 1 digit or 2 digit (day number ranges between 1-31). Thus, the \d{1,2} means there must occur either a single or double digit number.

, symbol represents itself.

\d{4} specifies the year having 3 digits.

Here is the code of the above mentioned case

			

import re

text = open("./bob_dylan_bio.txt", 'r')

searchDates = re.compile("\s\w{0,8}\s\d,\s\d{4}")

result = searchDates.findall(text.read())

print(result)

				

Output:

[' May 3, 2020', ' November 7, 2020', ' November 7, 2020', ' October 1, 2020']

Useful Regex Character

Here are some useful Regex Character

Symbol Description
\w Matches alphanumeric characters (including underscore character)
\d Matches a digit from 0-9
\s Matches one whitespace character
\W Matches non-alphanumeric and non-underscoreOpposite of \w)
\S Matches a non-whitespace characterOpposite of \s)
\D Matches a non-digit character(Opposite of \d)
\g Refers to the text matched by group n
^ Matches beginning of string
$ Matches ending of string
- Used to represent characters or numbers in specified range
[] Matches one character that is inside of the [] brackets
{n} Matches n times the preceding character
() Groups Regex expressions
| Acts as OR condition(Matches either the preceding or the succeeding character)
* Matches 0 or more times the preceding character
. Matches any single character(excluding end of line) 1 time
+ Matches 1 or more times the preceding character
? Matches 0 or 1 times the character preceding it

Regex Methods

Some useful Regex methods:

Method Description
search() Return a match object if found Else return None
findall() Return a list of all match objects
sub() Return the string obtained by replacing with another string
match() Return a match object if found by applying the pattern at the start of the string Else return None
fullmatch() Return a match object if found by applying the pattern on the whole string Else return None
escape() Escape special characters in a string

Regex Examples

Code to match the email addresses in a document:

			

import re

text = open("./emails.txt", 'r')

result = re.findall("\w+@\w+.com", text.read())

print(result)

				

Output:

['peter@dummyemail.com', 'harry@dummyemail.com', 'potter@dummyemail.com', 'john@dummyemail.com', 'doe@dummyemail.com']

Code to match IP addresses in the given text:

			

import re

text = open("./ip_address_text.txt", 'r')

result = re.findall("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", text.read())

print(result)

				

Output:

['127.0.0.1', '255.0.0.0', '192.168.29.207', '255.255.255.0', '192.168.29.255']

Code to add double quotes to all the words in the given text:

			

import re

text = open("./hobbies.txt", 'r')

result = re.sub("(\w+)", '"\g<1>"', text.read())

print(result)

				

Output:

"Swimming" "Dancing" "Acting" "Gaming" "Singing"

Note: In the above code, \g<1> refers to group 1 of matched text.