Demystifying Regular Expressions (Regex): Definition and Practical Usage

Q: What is a regular expression?

A regular expression, also more commonly known as a regex, is at first sight a simple character string. What makes it different? A particular syntax which, when interpreted, then describes a larger set of strings. This descriptive string is called a pattern.

1 May 2024

m de lecture

Data Science

Melanie

In the world of computing, we are often called upon to perform word-processing tasks. There's a universal tool called regex, which often presents the most powerful solutions in this field. However, regular expressions suffer from a general lack of understanding of how they work, due to their sometimes... confusing appearance.

What is a regular expression?

A regular expression, also more commonly known as a regex, is at first glance a simple string of characters. What makes it different? A special syntax which, when interpreted, describes a larger set of strings. This descriptive string is called a pattern.

It’s not a programming language in the strict sense of the term: virtually all languages have a regex library for regular expressions. What’s more, syntax varies very little from one language to the next, making compatibility between different platforms much easier – hence the term “universal” used in the introduction.

What's the point of a regular expression?

As you may have guessed, regexes enable us to isolate a certain type of text and carry out specific processing on the samples taken.

The simplest case is to replace one word with another in a document. We can also reduce our tolerance by allowing certain deviations. Here’s an example:

This code locates occurrences of the expression “regex” and its variants in the paragraph and replaces them with “regular expression(s)”.

Other uses for regexes exist in cybersecurity: in general, when you create an account on a new site, it requires a password that respects a certain number of rules. To check that the password you insert does not conflict with the contract, a regex is applied. Similarly, when entering a valid e-mail address, you need to ensure the presence of the arobase (to name but one constraint); this is relatively easy to transcribe in terms of regexes.

Finally, regexes can be found in automatic natural language processing (NLP), webscraping, pattern recognition… This list is by no means exhaustive, and you will most certainly be asked to solve a Data Science problem involving them!

A very controversial appearance

It’s this indigestible aspect that detracts from this tool, which is far from complicated to master! In fact, there are only a handful of laws governing this syntax, which we’re going to try and break down.

Let’s take a look at the following expression:

[regex to determine a valid nickname (make an image in color and in several parts)]

To determine how a regex works, the first step is to break it down into several subgroups. The concept of groups exists in formal terms: they are delimited by parentheses. Some parentheses are not capturing (such as those associated with a lookahead), but as a general rule, each group is counted in order of parenthesis opening.

A group can be capturing or non-capturing by entering the “?:” characters at the beginning of the parenthesis. To be clearer, a capturing group is a subset of the global pattern that can be isolated during a pattern search.

The circumflex “^” at the beginning of a list is a special character; it means that the regular expression only captures occurrences at the beginning of the line.
Conversely, the dollar sign $ ensures that the end of the occurrence corresponds to the end of a line.
[A-ZÉ]: When square brackets are present, this means that we accept one (and only one, if there is no further indication, as is the case here) character from the set provided. A-Z corresponds to “any letter between uppercase A and uppercase Z” (case-sensitive!), to which we add the letter É. This first part of the regex therefore captures an uppercase letter.
(pattern)+: the + sign captures patterns between 1 and an infinite number of times.
((pattern1)|(pattern2)): the symbol | means “we capture either pattern1 or pattern2”.
( ([a-zéèà\- ])(?!\3{2}) ): the first parenthesis works in the same way as the first group. Only characters present in the bracket are captured; from lower-case a to z, then accents, a space, and the special character -. To call a specific character, add a back-slash .
The second parenthesis is a negative lookahead. A lookahead, as its name implies, observes the continuation of the string and adds a condition so that the expression can be applied.
Here, this translates into the condition “the expression ([a-zéèà\- ]) captures text only if it is not followed by the expression \3{2}”.
\3{2} : When a number is escaped (i.e. preceded by \), it refers to the associated group. \3 thus corresponds to the third group of the regex!
On the other hand, if a number is enclosed in braces, it indicates the exact number of occurrences of the preceding group.
In other words, this pattern means “capture a character from the set [a-zéèà\- ] provided it is repeated no more than twice”.
(?<=\s)([A-ZÉ])(?![A-ZÉ]): as with a lookahead, the first group of parentheses is a positive lookbehind. \s captures any white space (a space, a tab). Following the same logic, we guess that this expression translates into the sentence “captures an uppercase letter (A, B, …, Z, or É) only if it is preceded by a white space and NOT followed by another uppercase letter”.
Finally, (.+) means that we accept any character without limit.

With all these rules in place, this whole regex suddenly becomes very intuitive!

regex101: a platform for checking your regexes

To check the behavior of a freshly created expression, there’s a handy site: https://regex101.com/

With this interface, enter your regex and the associated programming language, then enter any character string in the dedicated space. When the pattern captures text, it will be highlighted and read by the site.

If the previous section left you wondering, take a look at the text it does (or doesn’t) capture:

This pattern takes only valid pseudonyms according to a defined syntax.

To sum up

Regexes are both simple to use and extremely versatile; mastering this tool is a major asset. While it’s usually possible to solve a problem without using regexes, the regex solution is often the most effective of all.

We have a cheat sheet on regexes to help you practice:

Take a look at our NLP training courses. NLP is one of the most prolific areas of AI in the news, and you can read our article on how the famous ChatGPT works to see for yourself!

DataScientest News

You are not available?

Leave us your e-mail, so that we can send you your new articles when they are published!

Data Analyst

Analytics Engineer

Data Scientist

AI / Machine Learning Engineer

Data Engineer

Cloud Engineer

DevOps Engineer

MLOps

ETL Developer

Data Ops Engineer

Student Course

Amazon Web Services (AWS)

Microsoft Power BI