# Regular expression (re)

Regular expressions are essentially a tiny, highly specialized programming language embedded inside Python and made available through the *re* module. Using *re*, one can specify the rules for the set of possible strings that one wants to match. One can match strings in English sentences, or e-mail addresses, or html files. In short, it provides
an extremely powerful way for us to do *string matching*.

We first import the *re module*. This module has the following features:
- *re* provides regular expression tools for advanced string processing.
- We can use *re.search()* to see if a string matches a regular expression. Note that the *Return* is a *True* or *False*.
- You can use *re.findall()* to extract portions of a string that match your regular expression. Note that the *Return* a list of strings.

**Reference**: https://pymotw.com/3/re/

**Created and updated by** John C. S. Lui on August 14, 2020.

**Important note:** *If you want to use and modify this notebook file, please acknowledge the author.*


## Finding patterns in text

One common use of *re* is to search for patterns in text. The *search()* function takes the pattern and text to scan, and returns a *Match object* when the pattern is found. If the pattern is not found, search() returns *None*.  Let's look at an example.


In [None]:
# first import the regular expression module
import re

pattern = 'this'
text = 'This is really stupid, because this is nut.'

match = re.search(pattern, text)    # scan pattern within the text

start = match.start()         # note the position of the starting position
end   = match.end()

#  Let's look at the 'format' output
print('Found "{}"\nin "{}"\nfrom {} to {} ("{}").'.format(
    match.re.pattern, match.string, start, end, text[start:end]))

## Compiling expressions

Although *re* includes module-level functions for working with regular expressions as text strings, it is more efficient to compile the expressions a **program uses frequently**. The *compile()* function converts an expression string into a *RegexObject*.  Let's study this.

In [None]:
# import the re
import re

# Precompile the patterns we want to search, in this case, they are 'this' and 'that'
regexes = [
    re.compile(p)
    for p in ['this', 'that']
]

text = 'Does this text match the pattern?'   # this is the text we want to search

print('Text: {}\n'.format(text))

for regex in regexes:
    print('Seeking "{}" ->'.format(regex.pattern), end=' ')  # pattern we want to search

    if regex.search(text):
        print('We found a match !!!!')
    else:
        print('Sorry, no match')

## How to do **multiple** matches?

So far, we can only match the *first* instance of the pattern, what if we want to find all instances?  In this case, we use teh *findall()* function, which returns all of the substrings of the input that match the pattern without overlapping.  Let's take a look.

In [None]:
import re

pattern = 'this'
text = 'This is really stupid, because this is nut, and this is crazy.'

for match in re.findall(pattern, text):
    print('Found "{}"'.format(match))

## What if we want to find all possible start and end indexes?

We can use the *finditer()* function, which returns an **iterator** that produces Match instances instead of the strings returned by *findall()*.

In [None]:
# Let's repeat the above program if we want to find the specific positions of each find

import re

pattern = 'this'
text = 'This is really stupid, because this is nut, and this is crazy.'

for match in re.finditer(pattern, text):
    start = match.start()
    end   = match.end()
    print('Found "{}" at {}:{}'.format(text[start:end], start, end))

## Pattern syntax

Regular expressions support powerful patterns. Patterns can 
* repeat
* be anchored to different logical locations within the input
* be expressed in compact forms 

All these features are used by combining literal text values with meta-characters that are part of the regular expression pattern syntax implemented by *re*.

The following are some examples of *re*
- ^  &nbsp; &nbsp; : Matches the **beginning** of a line
- $  &nbsp; &nbsp; : Matches the **end** of a line
- .  &nbsp;&nbsp; &nbsp; : Matches **any** character
- \s &nbsp; &nbsp;: Matches **whitespace**
- \S &nbsp; &nbsp;: **non-whitespace** character
- \*  &nbsp; &nbsp;&nbsp;: **Repeats** a character *zero or more times*
- \*? &nbsp; : **Repeats** a character *zero or more times (non-greedy)
- \+  &nbsp; &nbsp; : **Repeats** a character one or more times
- \+?  &nbsp;: **Repeats** a character one or more times (non-greedy)
- [aeiou]  &nbsp; : Matches a single character in this listed **set**
- [^XYZ]   &nbsp; : Matches a single character in **not in** the listed **set**
- [a-z0-9] &nbsp; : The set of character can include a **range**
- (  &nbsp; : Indicates where string **extraction is to start**
- )  &nbsp; : Indicates where string **extraction is to end**

For complete information, please refer to the documentation.

Let's see some examples.

In [1]:
# Example: only match lines that "start with the string 'From:'

import re
handle = open('mbox-short.txt')  # open a file
for line in handle:      # process each line at a time
    line = line.rstrip() # 
    if re.search('^From:', line):  # match 'From' at the beginning of a line
        print('Found it, and the line is: ', line)

handle.close()  # close the opened file

Found it, and the line is:  From: John to Lui
Found it, and the line is:  From: John to the VC:  "You are fired !!!"
Found it, and the line is:  From: the VC to John:  "Are you nut?"
Found it, and the line is:  From: cslui to luics
Found it, and the line is:  From:cslui to luics
Found it, and the line is:  From: cslui@cse.cuhk.edu.hk to vc@cuhk.edu.hk


In [2]:
# Example: only match lines that "start with the string 'From:', 'Fxxm:',
# 'F12m:', or 'F!@m:'

import re
handle = open('mbox-short.txt')  # open a file
for line in handle:      # process each line at a time
    line = line.rstrip() #  strip off white space before the end of line
    if re.search('^F..m:', line):  # math 'F..m' at the beginning of a line
        print('Found it, and the line is: ', line)
handle.close()  # close the opened file

Found it, and the line is:  From: John to Lui
Found it, and the line is:  From: John to the VC:  "You are fired !!!"
Found it, and the line is:  From: the VC to John:  "Are you nut?"
Found it, and the line is:  Fxxm: this is nut1
Found it, and the line is:  F12m: this is nut2
Found it, and the line is:  F!@m: this is nut3
Found it, and the line is:  From: cslui to luics
Found it, and the line is:  From:cslui to luics
Found it, and the line is:  From: cslui@cse.cuhk.edu.hk to vc@cuhk.edu.hk


In [3]:
# Match lines that start with “From:”, followed by one or more characters (“.+”), 
# followed by an at-sign (@)”  

import re
handle = open('mbox-short.txt')  # open a file
for line in handle:      # process each line at a time
    line = line.rstrip()  
    if re.search('^From:.+@', line):  # start with 'From:', with one or more character, and ':'
        print('Found it, and the line is: ', line)

handle.close()  # close the opened file

Found it, and the line is:  From: cslui@cse.cuhk.edu.hk to vc@cuhk.edu.hk


In [4]:
# Extract email addresses

import re

my_string = 'Hello from cslui@cse.cuhk.edu.hk to pclee@cse.cuhk.edu.hk about the meeting @2PM'

# match one or more non-white space, then @, then one or more non-white space
my_list = re.findall('\S+@\S+', my_string)  

print('my_list: ', my_list)

my_list:  ['cslui@cse.cuhk.edu.hk', 'pclee@cse.cuhk.edu.hk']


## String pattern matching library

Let's look for substrings that start with a single lowercase letter, or uppercase letter, or a number ("[a-zA-Z0-9]"), followed by zero or more non-blank character ("\S*"), followed by an **at-sign** (@), followed by zero or more non-blank character ("\S*"), followed by an upper or lower case letter ("[a-zA-Z]").  In other words, we are looking for all **email addresses**.

In [5]:
# Let's examine the program
import re
handle = open("mbox-short.txt")
for line in handle:     # process each line
    line = line.rstrip()
    x = re.findall('[a-zA-Z0-9]\S*@\S*[a-zA-Z]', line)
    if len(x)> 0:
        print(x)
        
handle.close()   # close the opened file

['F!@m']
['cslui@cse.cuhk.edu.hk', 'vc@cuhk.edu.hk']
['lyu@cse.cuhk.edu.hk']
['king@cse.cuhk.edu.hk']
['eric@cse.cuhk.edu.hk']


## String Matching Library
- The *re* module provides regular expression tools for advanced string processing
- You can use *re.search()* to see if a string matches a regular expression, similar to useing *find()* method for strings.  Note that the return is *True* of *False*
- You can use *re.findall()* to extract portion of a string that matches your regular expression similar to combination of *find()* and slicing: *var[5:10]*. Note that return is a list of string.

In [6]:
# using find() in string vs. re.search() in re

handle = open('mbox-short.txt')
for line in handle:
    line = line.rstrip()
    if line.find('From:') >= 0:
        print(line)

handle.close()
print('----------------------')

import re


handle = open('mbox-short.txt')
for line in handle:
    line = line.rstrip()
    if re.search('From:', line):
        print(line)

handle.close()

From: John to Lui
John is going From: HK to US
From: John to the VC:  "You are fired !!!"
From: the VC to John:  "Are you nut?"
From: cslui to luics
From:cslui to luics
From: cslui@cse.cuhk.edu.hk to vc@cuhk.edu.hk
----------------------
From: John to Lui
John is going From: HK to US
From: John to the VC:  "You are fired !!!"
From: the VC to John:  "Are you nut?"
From: cslui to luics
From:cslui to luics
From: cslui@cse.cuhk.edu.hk to vc@cuhk.edu.hk


In [7]:
# using  startwith() in string vs. re.search() in re

handle = open('mbox-short.txt')
for line in handle:
    line = line.rstrip()
    if line.startswith('From:'):
        print(line)

handle.close()
print('----------------------')

import re

handle = open('mbox-short.txt')
for line in handle:
    line = line.rstrip()
    if re.search('^From:', line):
        print(line)

handle.close()

From: John to Lui
From: John to the VC:  "You are fired !!!"
From: the VC to John:  "Are you nut?"
From: cslui to luics
From:cslui to luics
From: cslui@cse.cuhk.edu.hk to vc@cuhk.edu.hk
----------------------
From: John to Lui
From: John to the VC:  "You are fired !!!"
From: the VC to John:  "Are you nut?"
From: cslui to luics
From:cslui to luics
From: cslui@cse.cuhk.edu.hk to vc@cuhk.edu.hk


In [8]:
# Using "." character to match any character. Use "*", the character is "zero or more times".
# Using "^S^ is any non-whitespace character
# Let's illustrate

pattern1 = '^X.*:'     # start with'X", then zero or more character, and ends with ':'
pattern2 = '^X-\S+:'   # start with 'X-', then one or more non-white space character, ends with ':'

s1 = 'X-Sieve: CMU Sieve 2.3'
s2 = 'X-DSPAM-Result: Innocent'
s3 = 'X-Plane is behind schedule: two weeks'
my_list = [s1, s2, s3]

import re

for my_string in my_list:
    if re.search(pattern1, my_string):
        print("Found pattern 1 '" + pattern1 + "', my_string: " + my_string)
 
    if re.search(pattern2, my_string):
        print("Found pattern 2 '" + pattern2 + "', my_string: " + my_string)
    print('-------------')


Found pattern 1 '^X.*:', my_string: X-Sieve: CMU Sieve 2.3
Found pattern 2 '^X-\S+:', my_string: X-Sieve: CMU Sieve 2.3
-------------
Found pattern 1 '^X.*:', my_string: X-DSPAM-Result: Innocent
Found pattern 2 '^X-\S+:', my_string: X-DSPAM-Result: Innocent
-------------
Found pattern 1 '^X.*:', my_string: X-Plane is behind schedule: two weeks
-------------


In [9]:
# Note that re.search() returns a "True/False" dependeing on whether the string matches the re.
# If we want the matching strings to be EXTRACTED, we use re.findall()

import re
x = 'My 2 favorite numbers are 19 and 42'
y = re.findall('[0-9]+', x)   # find all substrings that start with 0 to 9
print('y = ', y)

y = re.findall('[AEIOU]+', x)
print('y = ', y)
y = re.findall('[AEIOU]+', 'ABC ddkfj xAA')   # find all substrings that has  A, E, I, O, or U
print('y = ', y)

y =  ['2', '19', '42']
y =  []
y =  ['A', 'AA']


In [10]:
# The repeat characters "*" and "+" push outward in both directions (greedy-fashion) 
# to match the largest possible string.
# Let's illustrate

import re
x = 'From: Using the : characters'     
pattern1 = '^F.+:'

y = re.findall(pattern1, x)   # do substring search in a greedy fashion
print('y = ', y)

y =  ['From: Using the :']


In [11]:
# If you don't want to use the greedy mode, you can add "?" character, then thigns will chill out

import re
x = 'From: Using the : characters'     
pattern1 = '^F.+?:'

y = re.findall(pattern1, x)   # do substring search in a non-greedy fashion
print('y = ', y)

y =  ['From:']


## Fine tuning string extraction

We can refine the match for *re.findall()* and separately determine which portion of the match is to be extracted by using parentheses.

In [12]:
import re

x = "From stephen.marquard@uct.ac.za Sat Jan 5 09:15:15 2008"
pattern1 = '\S+@\S+'       # match non-whitespace character, and "@", and non-whitespace character
y = re.findall(pattern1, x)
print('y = ', y)

pattern2 = '^From.*? (\S+@\S+)'   # note the use of "(" and ")", we only want to extract that part
z = re.findall(pattern2, x)
print('z = ', z)

y =  ['stephen.marquard@uct.ac.za']
z =  ['stephen.marquard@uct.ac.za']


In [13]:

# Given an email address, we want to find the hostname.
# For the following example, we want to find 'cse.cuhk.edu.hk'

x = "From idiotic.professor@cse.cuhk.edu.hk Sat Jan 5 09:15:15 2008"

atpos = x.find('@')   # use string's method to find the position of the first '@'
sppos = x.find(' ', atpos)   # find the index of space after the atops index

print ('atpos = ', atpos, '; sppos = ', sppos)
hostname  = x[atpos+1:sppos]
print('hostname is: ', hostname)

atpos =  22 ; sppos =  38
hostname is:  cse.cuhk.edu.hk


In [14]:
# Sometimes we split a line one way, and then grab one of the pieces of the line 
# and split that piece again

x = "From idiotic.professor@cse.cuhk.edu.hk Sat Jan 5 09:15:15 2008"

words = x.split()      # find out list of words
email = words[1]       # access to teh email 
pieces = email.split('@')   # find out username and institution
print('hostname is: ', pieces[1])


hostname is:  cse.cuhk.edu.hk


In [15]:
# in re module, we can do the following
import re

x = "From idiotic.professor@cse.cuhk.edu.hk Sat Jan 5 09:15:15 2008"

# for pattern, starts with '@', '()' is to extract the non-black characters
# '[^ ]' is to match non-blank character and finally, '*' is to match many of them.
pattern = '@[^ ]*'    

hostname = re.findall('@[^ ]*', x)
print('hostname is: ', hostname)

hostname is:  ['@cse.cuhk.edu.hk']


# New Lecture

In [16]:
import re

"""
Given source text and a list of patterns, look for
matches for each pattern within the text and print
them to stdout.
"""
def test_patterns(text, patterns):
    # Look for each pattern in the text and print the results
    for pattern, desc in patterns:
        print("pattern and its description: '{}' ({})\n".format(pattern, desc))
        print("text is: '{}'\n".format(text))
        for match in re.finditer(pattern, text):
            s = match.start()    # found beginning index
            e = match.end()      # found ending index
            print ('pattern:', pattern, 'found in starting index=', s, ";", 'ending index=', e)
        print()
    return


test_patterns('abbaaabbbbaaaaabaaaabbbbbbbabbbb',
               [('ab', "'a' followed by 'b'"),])

pattern and its description: 'ab' ('a' followed by 'b')

text is: 'abbaaabbbbaaaaabaaaabbbbbbbabbbb'

pattern: ab found in starting index= 0 ; ending index= 2
pattern: ab found in starting index= 5 ; ending index= 7
pattern: ab found in starting index= 14 ; ending index= 16
pattern: ab found in starting index= 19 ; ending index= 21
pattern: ab found in starting index= 27 ; ending index= 29



In [17]:
# Let's try some regular expression syntax like *, +, ?, {}
test_patterns(
    'abbaabbba',
    [('ab*', 'a followed by zero or more b'),
     ('ab+', 'a followed by one or more b'),
     ('ab?', 'a followed by zero or one b'),
     ('ab{3}', 'a followed by three b'),
     ('ab{2,3}', 'a followed by two to three b')],
)

pattern and its description: 'ab*' (a followed by zero or more b)

text is: 'abbaabbba'

pattern: ab* found in starting index= 0 ; ending index= 3
pattern: ab* found in starting index= 3 ; ending index= 4
pattern: ab* found in starting index= 4 ; ending index= 8
pattern: ab* found in starting index= 8 ; ending index= 9

pattern and its description: 'ab+' (a followed by one or more b)

text is: 'abbaabbba'

pattern: ab+ found in starting index= 0 ; ending index= 3
pattern: ab+ found in starting index= 4 ; ending index= 8

pattern and its description: 'ab?' (a followed by zero or one b)

text is: 'abbaabbba'

pattern: ab? found in starting index= 0 ; ending index= 2
pattern: ab? found in starting index= 3 ; ending index= 4
pattern: ab? found in starting index= 4 ; ending index= 6
pattern: ab? found in starting index= 8 ; ending index= 9

pattern and its description: 'ab{3}' (a followed by three b)

text is: 'abbaabbba'

pattern: ab{3} found in starting index= 4 ; ending index= 8

pattern

## Greedy and the non-greedy mode of searching

When processing a repetition instruction, *re* consumes as much of the input as possible while matching the pattern. This so-called **greedy** behavior and it may result in fewer individual matches, or the matches may include more of the input text than intended. How can we **turn off** greediness behavior? We can achieve this by following the repetition instruction with ?.  Let's illustrate.

In [18]:

test_patterns(
    'abbaabbba',
    [('ab*?', 'a followed by zero or more b'),
     ('ab+?', 'a followed by one or more b'),
     ('ab??', 'a followed by zero or one b'),
     ('ab{3}?', 'a followed by three b'),
     ('ab{2,3}?', 'a followed by two to three b')],
)

pattern and its description: 'ab*?' (a followed by zero or more b)

text is: 'abbaabbba'

pattern: ab*? found in starting index= 0 ; ending index= 1
pattern: ab*? found in starting index= 3 ; ending index= 4
pattern: ab*? found in starting index= 4 ; ending index= 5
pattern: ab*? found in starting index= 8 ; ending index= 9

pattern and its description: 'ab+?' (a followed by one or more b)

text is: 'abbaabbba'

pattern: ab+? found in starting index= 0 ; ending index= 2
pattern: ab+? found in starting index= 4 ; ending index= 6

pattern and its description: 'ab??' (a followed by zero or one b)

text is: 'abbaabbba'

pattern: ab?? found in starting index= 0 ; ending index= 1
pattern: ab?? found in starting index= 3 ; ending index= 4
pattern: ab?? found in starting index= 4 ; ending index= 5
pattern: ab?? found in starting index= 8 ; ending index= 9

pattern and its description: 'ab{3}?' (a followed by three b)

text is: 'abbaabbba'

pattern: ab{3}? found in starting index= 4 ; ending in

## Character sets

A *character set* is a group of characters, any one of which can match at that point in the pattern. For example, *[ab]* would match either *a* or *b*.  Let's illustrate.

In [None]:
test_patterns(
    'abbaabbba',
    [('[ab]', 'either a or b'),
     ('a[ab]+', 'a followed by 1 or more a or b'),
     ('a[ab]+?', 'a followed by 1 or more a or b, not greedy')],
)

## Character set as exclusion

A character set can also be used to exclude specific characters. The carat (*^*) means to look for characters that are not in the set following the carat.  Let's illustrate.

In [None]:
# This pattern finds all of the substrings that do not contain 
# the characters -, ., or a space.

test_patterns(
    'This is some text -- with punctuation.',
    [('[^-. ]+', 'sequences without -, ., or space')],
)

## Characters range

As character sets grow larger, typing every character that should (or should not) match becomes tedious. A more compact format using character ranges can be used to define a character set to include all of the contiguous characters between the specified start and stop points.

In [None]:
test_patterns(
    'This is some text -- with punctuation.',
    [('[a-z]+', 'sequences of lowercase letters'),
     ('[A-Z]+', 'sequences of uppercase letters'),
     ('[a-zA-Z]+', 'sequences of letters of either case'),
     ('[A-Z][a-z]+', 'one uppercase followed by lowercase')],
)

In [None]:
# As a special case of a character set, the meta-character dot, 
# or period (.), indicates that the pattern should 
# match any single character in that position.

test_patterns(
    'abbaabbba',
    [('a.', 'a followed by any one character'),
     ('b.', 'b followed by any one character'),
     ('a.*b', 'a followed by anything, ending in b'),
     ('a.*?b', 'a followed by anything, ending in b')],
)

## Escape codes

A more compact representation uses escape codes for several predefined character sets. The escape codes recognized by re are listed in the table below.

**Regular Expression Escape Codes**<br>
Code	&nbsp; &nbsp; &nbsp; Meaning<br>
\d	    &nbsp; &nbsp; &nbsp; a digit<br>
\D	    &nbsp; &nbsp; &nbsp; a non-digit<br>
\s	    &nbsp; &nbsp; &nbsp; whitespace (tab, space, newline, etc.)<br>
\S	    &nbsp; &nbsp; &nbsp; non-whitespace<br>
\w	    &nbsp; &nbsp; &nbsp; alphanumeric<br>
\W	    &nbsp; &nbsp; &nbsp; non-alphanumeric<br>

In [None]:
test_patterns(
    'A prime #1 example!',
    [(r'\d+', 'sequence of digits'),
     (r'\D+', 'sequence of non-digits'),
     (r'\s+', 'sequence of whitespace'),
     (r'\S+', 'sequence of non-whitespace'),
     (r'\w+', 'alphanumeric characters'),
     (r'\W+', 'non-alphanumeric')],
)

## Anchoring

In addition to describing the content of a pattern to match, the relative location can be specified in the input text where the pattern should appear by using anchoring instructions. The table below lists valid anchoring codes.

**Code**	&nbsp; &nbsp; &nbsp; **Meaning** <br>
^	&nbsp; &nbsp; &nbsp; start of string, or line <br>
$	&nbsp; &nbsp; &nbsp; end of string, or line <br>
\A	&nbsp; &nbsp; &nbsp; start of string <br>
\Z	&nbsp; &nbsp; &nbsp; end of string <br>
\b	&nbsp; &nbsp; &nbsp; empty string at the beginning or end of a word <br>
\B	&nbsp; &nbsp; &nbsp;empty string not at the beginning or end of a word <br>

In [None]:
test_patterns(
    'This is some text -- with punctuation.',
    [(r'^\w+', 'word at start of string'),
     (r'\A\w+', 'word at start of string'),
     (r'\w+\S*$', 'word near end of string'),
     (r'\w+\S*\Z', 'word near end of string'),
     (r'\w*t\w*', 'word containing t'),
     (r'\bt\w+', 't at start of word'),
     (r'\w+t\b', 't at end of word'),
     (r'\Bt\B', 't, not start or end of word')],
)


## Constraining the search

In situations where it is known in advance that only a subset of the full input should be searched, the regular expression match can be further constrained by telling re to limit the search range. For example, if the pattern must appear at the front of the input, then using *match()* instead of *search()* will anchor the search without having to explicitly include an anchor in the search pattern.


In [None]:
import re

text = 'This is some text -- with punctuation.'
pattern = 'is'

print('Text   :', text)
print('Pattern:', pattern)

m = re.match(pattern, text)
print('Match  :', m)
s = re.search(pattern, text)
print('Search :', s)