Write Regular Expressions
The simplest regular expression is nothing more than a string of characters (for simplicity, we’ll just call this a “string”). XYZ is a regular expression that will match the string “XYZ”.
Just matching XYZ isn’t usually very useful, though. If you’re looking for “XYZ”, you might also need to find “Xyz” or “xyz”. You can do that several different ways. In regular expressions, there are almost always different ways to match the same string. The secret to these different approaches — and one of the secrets to regular expressions in general — is metacharacters.
Table of Contents
Metacharacters
Metacharacters in regex are not characters in the string you’re trying to match; they are signals to the regular expression engine to do something special. For instance, placing a string in brackets like this [XYZ] changes your regex not because you’re searching for “[“ and “]”, but because the brackets are signals to perform the search in a particular way. It’s good to be aware of regex metacharacters because if you include one without realizing it, your regex will probably still do something — just not what you expect. Brackets are not the only metacharacters, but there are not many. Here is the whole list:
Asterisk *
The asterisk means “zero or more matches”. That is, the regex xy matches the string “xy”, while the regex xy* matches “x”, “xy”, “xyy”, and so on.
**Backslash **
One metacharacter, the backslash ( \ ), is especially important because it changes the meaning of whatever character comes next. For instance, the metacharacter “ . “ is a wildcard; it stands for any character. But sometimes you need to search for the actual character “ . “ — to do that, you put a backslash in front of the dot, and the backslash changes the meaning of the dot from “a metacharacter meaning match any character” to “just the dot character.”
When the backslash precedes certain characters it changes their meaning so they become, in effect, additional metacharacters. For example, \d means “any digit”, so it’s a different way to specify the class [0-9].
Bar |
The vertical bar means “or”. You can use it to combine regular expressions so you have, in effect, one longer and more complex regex. For example, [a-z] | [A-Z] matches any lower-case character or any upper-case character (there are, of course, other ways to obtain that same result).
Brackets [ ]
When you enclose some characters in brackets, matching works a bit differently. Characters inside brackets are a “character class”. This means they are treated as a set or group and the search looks for any one of them. That is, [XYZ] will find three matches in “XYZ” — the “X”, the “Y”, and the “Z”. That same regex will also match “X” or “Z”, and will have two matches in “XZ” and “ZY”.
You can use brackets and “-” to designate a whole range of characters. For example, [A-Z] matches all upper-case letters, and [0-9] matches all digits.
Most of the other metacharacters lose their special meanings when you enclose them in brackets. For example, while the plus is usually a regex metacharacter, [+] simply matches a plus character.
The caret has a special meaning inside brackets, but only when it’s the first character. When the caret is the first character inside brackets it means “find the complement of the following set” — loosely speaking, the set’s “opposite”. For example, the regex [A-Z] matches any upper-case letter, but [^A-Z] matches any lower-case letter.
The backslash character also has a special meaning inside brackets; it changes the meaning of the character following it. If you need to match a backslash character, you change its meaning with another backslash. Because the backslash keeps its special meaning even inside brackets, these two regular expressions are the same: \ and [\]. They each match the string “\”.
Braces { }
Braces (or “curly braces”) are used to enclose the specific number of matches you want. For example, xyz{3} matches exactly 3 repetitions of “xyz”, as in “xyzxyzxyz”. You can also include a comma xyz{3,} to indicate at least that many repetitions (and any number more), and a comma and a second specifier xyz{3,5} to indicate at least 3 and no more than 5 repetitions.
Caret ^
The caret — when it’s not enclosed by brackets — means “at the start of a string”. This is not recommended for use with Cloudlock regex policies.
Dollar $
The dollar sign means “at the end of a string”. This is not recommended for use with Cloudlock regex policies.
Dot .
We’ve already given away what the dot does; it matches any character (technically any character except a newline, the usually-invisible character created by the Enter key on your keyboard).
Parentheses ( )
Enclosing characters in parentheses makes them a “group”. You can use metacharacters with a group as if the group is a single character. For example, (xyz)+ matches one or more occurrences of “xyz”, such as “xyzxyz”.
Plus +
The plus means “one or more matches”. The regex xy+ matches “xy”, “xyy”, “xyyy” and so on, but not “x”.
Question mark ?
The question mark means “zero or one matches”. The regex xy? matches “x” and “xy”, but not “xyy”.
Predictability in Patterns
Some kinds of information are easier to “match” than others. This is usually a matter of predictability. A US zip code, for example, always includes 5 digits, and sometimes includes an additional four, with a hyphen in between the two sets. That consistency makes it predictable. A “date of birth” can be a different matter because it can be expressed several different ways. “January third, 1980”, “1/3/80”, and “1980-01-03” can all mean the same birthdate. Successful use of regular expression matching depends on
- anticipating the predictability of the information you need to match, and
- finding a way to match it every time without matching other information that’s not of interest.
Updated about 1 year ago