Cisco Cloudlock Regular Expressions Guide

Introduction

This guide is a basic introduction to regular expressions (regex) as they are used in Cloudlock policies. It is intended to provide a starting point to using and interpreting regular expressions. There is much more to know about regex, however, and many resources available.

Cisco Cloudlock uses the Java implementation of regex. There are any number of others, which differ in some ways, but those are beyond the scope of this guide.

Regular Expressions in Cisco Cloudlock

Regular expressions are used in Cisco Cloudlock policies to detect sequences of characters in documents and other objects stored in supported cloud platforms. Credit Card and Social Security Number policies use built-in, proprietary regular expressions (as well as other techniques) to find matching patterns. Custom regex policies can detect any pattern, but you must specify those patterns yourself.
As you use this guide, you can practice by using the regex tester built into the Cloudlock Policy tool:

You enter a regular expression as shown in #1, above, some test text as shown at #2, and any matches are highlighted at the bottom. There are similar regex testers available on the web — these are helpful too, but if you use one, make sure it supports Python-style regex.

Because regular expressions are so compact, they can be difficult to interpret, at least at first, and testing is important. A good process to follow in setting up a regex-based component in your cloud security monitoring program is to use a regex tester to refine your expression, then test it on sample data to make sure it identifies violations without flagging “false positives” — at least not to excess. Only when you’re satisfied with your test results should a regex be considered “ready to deploy” and placed into use.

Any single regular expression in Cisco Cloudlock is limited to 2048 characters (which would be an extremely long regex). If you find you’re getting close to the character limit, contact Cloudlock; we may be able to help achieve the same result with a smaller, more efficient regex.

Complementary Tools

While it can be essential to identify strings in documents and objects, you often need additional tools to make sure your monitoring system is pinpointing real issues. Cloudlock policies include tools that work in combination with regex to give the best results. These include:

  • Exceptions — these are, in fact, regular expressions too. But in this case you can specify when “a match is not a match”. For example, if your regular expression is looking for exposed customer records, but you know there is one labeled “John Q. Customer” that’s just a sample, you can make that an exception so it doesn’t generate an incident.
  • Proximity — It can be difficult to be sure that a particular set of characters really is what it might be. If it’s within a few characters of a label such as “DOB” or “Birthdate”, though, that can be a giveaway. Include another regex in the Cloudlock Proximity tool to help with this.
  • Threshold — sometimes quantity matters. The threshold setting enables you to identify (or identify and flag as more significant) an object containing up to 1000 pattern matches.

Reading Regular Expressions

Regular expressions generally look something like this:

\b([A-Z1-9]{5}-){4}[A-Z1-9]{5}\b 

Even experts agree that they can be difficult to read. To understand a regular expression you need to step through them carefully. Regular expressions are made up of component parts, each of which adds something to the search. One thing you can’t generally do very much, unfortunately, is format them to be more legible. Adding spaces, for example, would help readability, but would “break” your search because regex can match spaces (and other invisible “whitespace” characters) just the same as visible characters.

Writing Regular Expressions

The simplest regular expression is nothing more than a string of characters (for simplicity, we’ll just call this a “string”). XYZ is a regular expression that will match the string “XYZ”.
Just matching XYZ isn’t usually very useful, though. If you’re looking for “XYZ”, you might also need to find “Xyz” or “xyz”. You can do that several different ways. In regular expressions, there are almost always different ways to match the same string. The secret to these different approaches — and one of the secrets to regular expressions in general — is metacharacters.

Metacharacters

Metacharacters in regex are not characters in the string you’re trying to match; they are signals to the regular expression engine to do something special. For instance, placing a string in brackets like this [XYZ] changes your regex not because you’re searching for “[“ and “]”, but because the brackets are signals to perform the search in a particular way. It’s good to be aware of regex metacharacters because if you include one without realizing it, your regex will probably still do something — just not what you expect. Brackets are not the only metacharacters, but there are not many. Here is the whole list:

Asterisk *
The asterisk means “zero or more matches”. That is, the regex xy matches the string “xy”, while the regex xy* matches “x”, “xy”, “xyy”, and so on.

Backslash \
One metacharacter, the backslash ( \ ), is especially important because it changes the meaning of whatever character comes next. For instance, the metacharacter “ . “ is a wildcard; it stands for any character. But sometimes you need to search for the actual character “ . “ — to do that, you put a backslash in front of the dot, and the backslash changes the meaning of the dot from “a metacharacter meaning match any character” to “just the dot character.”
When the backslash precedes certain characters it changes their meaning so they become, in effect, additional metacharacters. For example, \d means “any digit”, so it’s a different way to specify the class [0-9].

Bar |
The vertical bar means “or”. You can use it to combine regular expressions so you have, in effect, one longer and more complex regex. For example, [a-z] | [A-Z] matches any lower-case character or any upper-case character (there are, of course, other ways to obtain that same result).

Brackets [ ]
When you enclose some characters in brackets, matching works a bit differently. Characters inside brackets are a “character class”. This means they are treated as a set or group and the search looks for any one of them. That is, [XYZ] will find three matches in “XYZ” — the “X”, the “Y”, and the “Z”. That same regex will also match “X” or “Z”, and will have two matches in “XZ” and “ZY”.
You can use brackets and “-” to designate a whole range of characters. For example, [A-Z] matches all upper-case letters, and [0-9] matches all digits.
Most of the other metacharacters lose their special meanings when you enclose them in brackets. For example, while the plus is usually a regex metacharacter, [+] simply matches a plus character.
The caret has a special meaning inside brackets, but only when it’s the first character. When the caret is the first character inside brackets it means “find the complement of the following set” — loosely speaking, the set’s “opposite”. For example, the regex [A-Z] matches any upper-case letter, but [^A-Z] matches any lower-case letter.
The backslash character also has a special meaning inside brackets; it changes the meaning of the character following it. If you need to match a backslash character, you change its meaning with another backslash. Because the backslash keeps its special meaning even inside brackets, these two regular expressions are the same: \ and [\]. They each match the string “\”.

Braces { }
Braces (or “curly braces”) are used to enclose the specific number of matches you want. For example, xyz{3} matches exactly 3 repetitions of “xyz”, as in “xyzxyzxyz”. You can also include a comma xyz{3,} to indicate at least that many repetitions (and any number more), and a comma and a second specifier xyz{3,5} to indicate at least 3 and no more than 5 repetitions.

Caret ^
The caret — when it’s not enclosed by brackets — means “at the start of a string”. This is not recommended for use with Cloudlock regex policies.

Dollar $
The dollar sign means “at the end of a string”. This is not recommended for use with Cloudlock regex policies.

Dot .
We’ve already given away what the dot does; it matches any character (technically any character except a newline, the usually-invisible character created by the Enter key on your keyboard).

Parentheses ( )
Enclosing characters in parentheses makes them a “group”. You can use metacharacters with a group as if the group is a single character. For example, (xyz)+ matches one or more occurrences of “xyz”, such as “xyzxyz”.

Plus +
The plus means “one or more matches”. The regex xy+ matches “xy”, “xyy”, “xyyy” and so on, but not “x”.

Question mark ?
The question mark means “zero or one matches”. The regex xy? matches “x” and “xy”, but not “xyy”.

Predictability in Patterns

Some kinds of information are easier to “match” than others. This is usually a matter of predictability. A US zip code, for example, always includes 5 digits, and sometimes includes an additional four, with a hyphen in between the two sets. That consistency makes it predictable. A “date of birth” can be a different matter because it can be expressed several different ways. “January third, 1980”, “1/3/80”, and “1980-01-03” can all mean the same birthdate. Successful use of regular expression matching depends on

  • anticipating the predictability of the information you need to match, and
  • finding a way to match it every time without matching other information that’s not of interest.

Creating and Testing a Regex Policy

This is a step-by-step walkthrough of creating a regex policy to monitor occurrences of a (fictional) company’s customer ID numbers in documents and other objects. The customer ID numbers can appear in one of two forms:

  • 1 letter and 7 numbers
  • 8 numbers
    The way to begin, of course, is to create a new Custom Regex policy. Next step is to start creating the regular expression itself.

Step 1: Match the first character
The first character of the customer ID can be a letter or a number. To search for a character that might be a letter or a number we can use three basic expressions:
[a-z] matches any lower-case letter
[A-Z] matches any upper-case letter
[0-9] matches any numeric character
Those can be tested individually, and combined as [a-zA-Z0-9]. A simple test shows that this matches upper-case letters, lower-case letters, and digits.

Step 2: Match the next 7 characters
The following 7 characters of the customer ID are digits. The predefined character class \d matches digits, and we can specify exactly 7 by using brackets, like this: \d{7}. When tested, this should match any string of 7 digits, and not a string including any characters other than digits.

Step 3: Combine the expressions
At this point we have two separate expressions: [a-zA-Z0-9] and \d{7}. The goal is to have a single regex, and putting them together couldn’t be easier: [a-zA-Z0-9]\d{7}. Testing the expression shows that it matches the valid customer numbers X1234567 and 12345678, but does not match 1234x567, which is not a valid customer number. This is close to what we need, but it could result in too many matches — documents could easily contain similar strings that have nothing to do with customer numbers.

Step 4: Add a proximity expression
We want to match only the strings that represent actual customer numbers. We’ve created a way to identify strings that have the right format, but we can look for additional clues as well. For instance, the string “Customer Number” or "Customer No.” might be likely to appear somewhere near the number. We can create a proximity expression to find those strings close by the possible customer numbers we locate. The following table shows how the expression could be created, step by step:

Regex
Results
Notes

1

Customer Number

Matches “Customer Number”, misses “Customer number”

2

[Cc]ustomer [Nn]umber

Matches “customer number”, misses “customer no.”

[Cc] matches upper- or lower-case “c”

3

[Cc]ustomer [Nn]umber|[Cc]ustomer No

Matches “Customer number”, “Customer No”, or “Customer not listed” misses “customer no.”

The vertical bar stands for “or”.

4

[Cc]ustomer [Nn]umber|[Cc]ustomer [Nn]o.

Matches “Customer number”, “Customer No”, and “customer no.”, misses “Customer not listed”

The dot by itself matches any character, so the backslash is needed.

Hint

You can also use the regex tester in the policy’s Regex panel to test the proximity expression. Remember to save your regex first!

Step 5: Skip Known Exceptions
Sometimes you know there is a string in the documents you’re monitoring that your regex will match, but does not really represent a problem. Training documents, for example, might include a customer ID that follows the pattern and is close to the string “Customer Number”, but is known as an example. To make sure something like this does not result in an incident, you can add an exception regex to the policy. If your organization’s “example ID” is always “C9999999”, simply add an expression matching that pattern in the policy’s Exception field.

Now you have a policy that can match customer numbers with greater accuracy and avoid matches that are “false positives”. As a final step, you might want to differentiate the significance of a document containing one customer ID as compared to one containing 100 or more.

Step 6: Thresholds and Severity
A document containing one customer ID should trigger an incident, and so should a document containing dozens or hundreds. But you might not want the severity of those incidents to be the same. You can create incidents with different severity by using the Threshold value. To do so, follow these steps:

  1. Copy your regex and proximity expression into a second policy.
  2. In the original policy, set the threshold to 1, and set the Severity to “Alert”.

3.In the second policy, set the threshold to 100 (or a value you choose), and set the Severity of that policy to “Critical”.

With these two policies, a document containing a large number of customer IDs generates a Critical incident, while a single ID still generates an incident, but at the lower Alert level. (The document with multiple IDs will actually trigger two incidents, because the Alert-level policy will still result in a match.)

Using Exceptions in a Regex Policy

Using Exceptions in a Regex Policy

Regex policies can include exceptions, which are also regular expressions, but used to exempt objects that would otherwise trigger incidents. You can use exceptions in cases where you are monitoring, for example, customer accounts, but also have one or more test accounts used in testing and training materials.

Example 1

Assume that customer account numbers are of the form: ####-####-#### (where “#” is any digit). To monitor for the presence of such account numbers, you could use this regular expression:

\d{4}-\d{4}-\d{4} 

Here \d means any digit, {4} means 4 instances, and - simply means -.

If you use as sample text: “1234-5678-9012, 9999-333-22, 1111-2222-3333, 0000-5454-4343,” (the second sequence is not a valid customer account), your Policy configuration panel would look like this, with the properly-formed account numbers correctly matched:

Now assume that your organization uses customer account numbers beginning with “0000” for testing and training purposes. They are correctly formed (see the third match, above), but they should not trigger incidents because they frequently appear in objects and/or documents stored in your platform, and they are known to be safe – that is, not real customer accounts.

An exception is also a regular expression. However, rather than monitoring the whole scope of the policy, as the “main” regex does, the exception expression tests for matches only within the results returned by the main regex.

In this example, the results consist of three matches: 1234-5678-9012, 1111-2222-3333 and 0000-5454-4343. You might use this regex to match your test accounts: 0000-\d{4}-\d{4} (where “0000” simply matches four repeated zeroes). Enter that expression as an Exception, and the Policy panel looks like this:

Example 2

This example is simply an exercise to experiment with how the regex exceptions system works.
First enter the following main regex, which will find the word “purple” followed by any other word:

(?i)purple\s\S*\b

Here (?i) means ignore case, \s means any single whitespace character, \S* means any number of non-whitespace characters, and \b means a boundary between words.
Next enter the sample text “purple prose, Purple Heart; purple rose - PURPLE people, purple cows”. This results in five matches:

Finally, enter one or more exceptions, each of which is one of the words following “purple”. Here we’ve entered “(?i)prose” and “(?i)people”:

Remember that exceptions monitor only the results from the main regex. There is no need to include “purple” in the exceptions, because every result already includes that word. All we need to match is a pattern within a result. You can experiment by entering additional exceptions to see results reflected in the regex tester.

You can enter up to 15 exceptions for each regex policy. If you need more exceptions for a given policy, you can combine them by using the regex “or” operator (vertical bar: |) as long as the combined expression does not exceed the limit of 2048 characters.

Sometimes you need a further refinement to allow for words contained within other words. In this example, “rose” is contained within “prose”, so when “rose” is an exception, it also removes “prose”:

This was not the goal; regular expressions should result in precise, not accidental matches. Adding a word boundary before the exception expression — \b(?i)rose — can ensure your exception expression (or any regular expression) is more precise:

Appendix A: Predefined Character Classes & Shortcuts

. Any character (except for line terminators)
\d A digit (same as [0-9] )
\D A non-digit (same as [^0-9] )
\s A whitespace character (same as [ \t\n\x0B\f\r] )
\S A non-whitespace character (same as [^\s] )
\w A word character (same as [a-zA-Z_0-9] — see below for more information about “word characters”)
\W A non-word character (same as [^\w] )
\s+ any and all whitespace
(?i) Ignore case — must precede the regex

Word Characters

A word character is defined as any member of the following Unicode categories:

  • Ll (Letter, Lowercase)
  • Lu (Letter, Uppercase)
  • Lt (Letter, Titlecase)
  • Lo (Letter, Other)
  • Lm (Letter, Modifier)
  • Nd (Number, Decimal Digit)
  • Pc (Punctuation, Connector)
    This category includes ten characters, the most commonly used of which is the LOWLINE character (_), u+005F.

Cisco Cloudlock Regular Expressions Guide


Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.