Regex - quick guide
05 April 2024
What is Regex
Regex stands for Regular Expression. It is a string of text that allows you to create search patterns that match text. For example this regular expression will find text that starts with "Australian" and contains the number "2023": ^Australian.*2023.
The beginning of that example has an anchor, ^ which anchors the "Australian" to the start of the text.
Just after "Australian" is a dot . which is a character class that matches any character.
The dot is followed by a star * which is a quantifier, which says 0 or more of the previous character.
Finally there is the "2023" which matches the exact character sequence 2023. The 2023 can be anywhere in the text after "Australian".
| When we say character we mean any letter, number or symbol. Any thing you can type is a character including, for example a space. A Regex will normally only match on a string up to a new line (or character class \n). | 
The regular expression ^Australian.*2023 will match the following text (often referred to as strings, or strings of characters):
- 
Australian Open 2023
 - 
Australian Parliament Committee hearings 2023-2024, Canberra.
 - 
Australian zimmerflex-#202344777662AQ-z
 
It won’t match:
- 
The Australian Open 2023.
 - 
2023 Australian of the year.
 - 
Australian of the year 2022.
 
Regex Components
Anchors
anchors are used at the begining and the end of a string or expression.
- 
^Use to specify the beginig of the string or expression. - 
$Use to specify the end of the string or expression. 
e.g. ^Fred Fintstone$ would match a string that is only "Fred Fintstone" with nothing before or after.
Quantifiers
Quantifiers specify how many of the previous character, character class or group of characters you want:
- 
*Finds 0 to more. e.g.fa*bfinds "fb", "fab", "faaaaaaaaaab" - 
+Finds 1 to more. e.g.fa+bfinds "fab", "faaaaaaaaaab" - 
?Finds 0 or 1. e.g. e.g.fa?bfinds "fb", "fab" - 
{n}Finds exactly n characters, e.g.a{3}finds "aaa". - 
{x,n}Makes a limit of characters (From x to n). e.g.fa{1,3}bfinds "fab", "faab", "faaab" 
Grouping
You can group a sequence of characters together for a purpose. Regex has a notion of a capture where the pattern within a capture group can be used in a result, for example when you want to find and replace a string, we won’t cover that.
Groups are used for matching a sequence a number of times, or creating a set of sequences that could be matched. A simple set of Parenthesis around a sequence of characters e.g. (cat), creates a capture group that can then have a quantity added. Examples:
- 
shrodinger’s (cat)? was herematches "shrodinger’s cat was here" and "shrodinger’s was here" - 
the (.at)+ sat on the matmatches "the cat sat on the mat", "the ratcat sat on the mat", "the #atgatpatcatmatfat6atbat sat on the mat" 
Groups can be split into OR blocks using the pipe special character |, for example:
- 
the (cat|dog) was herematches "the cat was here" and "the dog was here" - 
the cats? (was|were) herematches "the cat was here", "the cats were here", "the cats was here" 
Sets of Characters
You can define a range or set of characters that could be matched by putting them in square brackets. Unlike a group that defines a specific sequence of characters this says any of these characters, so [CRM]at matches "Cat", "Rat" and "Mat".
The set of characters can be represented as a range, for example:
- 
[1-3]0all the numbers from 0-3, matches "10","20","30" - 
[a-z]matches all lower case letters from a to z - 
[a-zA-Z]matches all lower and upper case letters from a to z 
You can put a ^ at the beginning of the set to not match this set of characters, for example:
- 
^[^:]*match everything from the beginning of the string except:which is useful for search and replace. If you had a string "2023-05-23 12:52:45" it would match "2023-05-23 12" 
Shorthand Character Classes
There are shortcuts for sets of characters called character classes which just define a set of characters to match. We met the . class in the introduction, which is every character.
- 
.every character except for new lines - 
\nthe new line - 
\ta tab - 
\da digit (0-9) same as[0-9] - 
\DNOT a digit (0-9) same as[^0-9] - 
\wa word character (latin) same as[a-zA-Z0-9_] - 
\WNOT a word character same as[^a-zA-Z0-9_] - 
\sspaces of any kind (space, Tab, new line) - 
\SNOT a space (space, Tab, new line) 
There are many more see https://www.regular-expressions.info/shorthand.html
A complete guide to Regular Expressions can be found at https://www.regular-expressions.info/