Regex Concepts

To escape characters you need to use \ escape characters, for eg if you want to search for a “?” or “.” or “", use backslash \., \?, \\. If you want to search for an email address eg: name@example.com, use: name@example\.com

. -> matches any character except newline character

\d matches digits (anything from 0-9)

\D matches anything BUT a digit

\w searches for any alphanumeric character (a-z,A-Z,0-9)

\W matches anything that is not an alphanumeric character

\s matches whitespaces (space, tab, newline)

\S matches anthing that is not a whitespace

Anchors - dont match characters but special positions

special posiotions like starting of the line, ending of the line etc

\b matches word at boundries

\B matches word that are not word boundries

for eg: Ha HaHa, \bHa matches Ha HaHa but not Ha HaHa. \bHa\b matches word boundries both at the end and at the begining: Ha HaHa only matches where Ha\b matches Ha HaHa

^ matches start of the line

^Ha matches only Ha bro HaHa HaHa, because the line starts with an Ha

$ matches end of line

$Ha only matches Ha bro HaHa HaHa beacause it is at the end of the line

character sets

[] -> matches character set inside the brackets.(NO SPACES) eg [-*&a1] etc. Matches only the first character in the sequence.

- has a special meaning in character sets []. when at the start, it matches for the character ‘-’, but when in between it is used to specify a range. eg to match numbers from 1 to 7: [1-7], or matching a-d: [a-d]

caret ^ also has special meaning and stands for “everything except”. For eg to match al characters that are not a lower case letter: [^a-z]

Quantifiers: matching more than one character at a time

    • matches 0 or more instances
    • matches 1 or more
  • ? matches 0 or one instances
  • {} matches exact numbers. eg a{3} -> a three times works same for \d \s \w stuffs
  • {num1, num2} matches an range of numbers and follows {min, max} range. eg a{1,3} one to three repetitions of a. works same for \d \s \w stuffs

Groups: allows us to specify different matches

groups are defined using parenthesis ()

for eg: to match Mr, Ms, Mrs:

M(r|s|rs)

pipe operator (|) is used to for “OR” in groups

Back references

the values in groups are stored in something called back groups.

in vscode you can call them with $grpnum (eg $2) to replace them in replace mode in “find”

however usually it is \num eg \2 for group 2

Examples

  • matching phone numbers

321-555-4321

123.555.1234

soln -> \d\d\d\W\d\d\d\W\d\d\d\d, dig dig dig non-whitespace dig…..

  • character matching

    cat

    mat

    pat

    bat

    match everything except bat

    [^b]at

In Python

remember to convert the string containing regular expressions to raw form: eg r"hello". This is done so that python does not take expressions with a leading backslash and escape sequesces of it’s own.

import re

urls = 
'''for groups
match.group(grp_num) eg match.group(3)
'''

# Substitutions
subbed_urls = pattern.sub(r'')