Regular expressions, commonly known as regex, are a tool for text processing and pattern matching.
In Python, the re
module offers a robust implementation of regex, allowing developers to handle complex text manipulation efficiently.
In this article, we'll get to grips with regular expressions and provide practical code examples — from the basic to more advanced — so you can understand how to use regex in Python.
Basics of Regular Expressions
Regular expressions are sequences of characters that define search patterns. They can be used for a variety of tasks, such as searching, editing, and manipulating text.
In this section, let's explore some basic components of regex.
Literals
Literals are the simplest form of regex patterns because they match the exact characters in a search string.
For example, if you want to find the exact word "cat" in the text "The cat sat on the mat", you'll have to use a literal that matches the pattern cat.
Metacharacters
In regular expressions, metacharacters are symbols with special meanings and purposes:
.
matches any character, except on a new line.^
matches the start of a string.$
matches the end of a string.*
matches 0 or more repetitions of the preceding element.+
matches 1 or more repetitions of the preceding element.?
matches 0 or 1 repetition of the preceding element.{}
matches a specific number of repetitions of the preceding element.
Character Classes
Character classes match any character inside a given set. Common examples are:
[abc]
: matches any of the characters a, b, or c.\d
matches any digit (equivalent to [0-9]).\w
matches any word character (alphanumeric plus underscore).\s
matches any whitespace character.
Quantifiers
Quantifiers specify the number of a character's or group's occurrences:
*
: 0 or more.+
: 1 or more.?
: 0 or 1.{n}
: Exactly n.{n,}
: n or more.{n,m}
: Between n and m.
Check out other examples with tutorials.
The re
Module in Python
To work with regex in Python, we use the module re
(provided by the standard Python library, so you don't need to install it separately).
In this section, we'll provide some basic examples of how to use the module re
and its functionalities.
The re.search()
Function
The re.search()
function searches for the first match of the regex pattern in a string.
Suppose, for example, that you want to extract any numbers in a string. We could write the following Python code:
This results in:
Let's explain the pattern = r"\d+"
:
- The prefix
r
defines a raw string that treats backslashes (\
) as literal characters and not as escape characters. \d
matches any digit (equivalent to [0-9]).+
indicates that the previous pattern (the digits) must appear one or more times.
Finally, if the re.search()
function finds a match, the match
variable returns a match object. So, the group()
method of the match
object returns the substring corresponding to the pattern found.
The re.match()
Function
The re.match()
function checks for a match only at the beginning of a string.
For example, if we want to search for a numerical match at the beginning of the previously defined text string, we could write the following:
And, as we would expect, here's the result:
In this case, there's no match with the applied regex rules in the provided string.
The re.findall()
Function
The re.findall()
function finds all matches of the regex pattern in a string, returning the output in a list.
For example, if you want to find all the numerical patterns in a string, you could write something like:
And the result is:
In this case, two numerical expressions are found in the string.
The re.sub()
Function
The re.sub()
function replaces a matched regex pattern with a replacement string.
For example, if you want to replace "123" with "many":
This results in:
Note that the re.sub()
function applies to all the matches it finds. So, for example, the following code sample:
Results in:
When To Use Each Function
re.match()
, re.search()
. and re.findall()
are similar. So when should we use one over the other?
re.search()
searches for the first place where the pattern matches a string. Use it when you need to find the first occurrence of a pattern in a string, regardless of where it is.re.match()
searches for a match only at the beginning of a string. Use it to check if a string starts with a certain pattern.re.findall()
finds all matches of a pattern in a string and returns them as a list of strings. Use it when you need to find all pattern occurrences in a string.
Advanced Regex Techniques for Python
In this section, we'll dive into some more advanced regex techniques for Python.
Lookaheads and Lookbehinds
Lookaheads and lookbehinds are types of zero-width assertions that match a position in a string based on what precedes or follows it (without including the preceding or following elements in the match itself).
A lookahead asserts that a certain pattern must follow the current position in the string but does not include this pattern in the matched result.
A lookbehind, on the other hand, asserts that a certain pattern must precede the current position in the string, but does not include this pattern in the matched result.
Let's consider a lookahead example. We want to match sequences of one or more word characters immediately followed by a period:
This results in:
Why is this the result? Because the word "test" is followed by a dot, and the sequence is matched two times in the text
variable. We use the r"\w+(?=\.)"
pattern, because:
\w+
matches one or more word characters (letters, digits, and underscores).(?=\.)
is a positive lookahead asserting that the word characters must be followed by a period. However, the period itself is not included in the match result because lookaheads are zero-width assertions.
Note: A positive lookahead (syntax:
(?=...)
) asserts that what follows the current position matches the specified pattern, while a negative lookahead (syntax:(?!...)
) states that what follows the current position does not match the specified pattern.
Now, let's consider a lookbehind example. We want to match one or more word characters immediately preceded by a whitespace character. So, consider the following:
This results in:
Let's explain the code:
(?<=\s)
is a positive lookbehind asserting that the current position in the string must be preceded by a whitespace character (\s
), such as a space, tab, or new line. The whitespace itself is not included in the match result because lookbehinds are zero-width assertions.\w+
matches one or more word characters (letters, digits, and underscores).
So, the regex pattern (?<=\s)\w+
looks for sequences of one or more word characters that are immediately preceded by a whitespace character. And since we used the
re.findall()
function, the code prints all the words of the text except for "This" because it is at the beginning of the sentence (so it has no whitespace before it).
Non-capturing Groups
Non-capturing groups (syntax: (?:...)
) are used to group parts of a pattern without capturing them for later reference. This technique is useful when you need to group parts of your pattern to apply quantifiers, alternation, or other regex operations but do not need to store the match for later use.
Let's consider a practical example. Suppose you have a sequence of numbers like "123-45-6789"
. You want to intercept the part of the number that has two digits. Here's how to do so with regex in Python:
This results in:
Let's explain the code:
(?:\d{3})
is a non-capturing group that matches exactly three digits.-
matches the hyphen character literally.(\d{2})
is a capturing group that matches exactly two digits and captures this match for later reference.(\d{4})
is another capturing group that matches exactly four digits and captures this match for later reference.match.group(1)
retrieves the content of the first capturing group from the match object. So, in this case, it prints the two digits captured by the first capturing group(\d{2})
, which is45
.
Real-World Examples and Use Cases
Now, let's examine some real-world scenarios to give a more practical idea of how and when to use regex in Python.
Data Validation
Regular expressions are commonly used to validate data formats such as email addresses, phone numbers, and dates.
For example:
In this case, the result of the print()
function is True
, meaning that the string example@example.com
has been recognized as an email address by the regex pattern. Here's how this works:
^
asserts the position at the start of the string.[\w\.-]+
matches one or more word characters (alphanumeric characters and underscores), dots, or hyphens.@
matches the@
character.\.
matches a literal dot.\w+
matches one or more word characters (alphanumeric characters and underscores).$
asserts the position at the end of the string.
So this pattern searches for word characters before and after the @
character. It also searches for a literal dot and other words after it. That's how email addresses are created.
Extracting IP Addresses
Another typical use case is extracting an IP address from a log file.
In Python, we can create it like so:
Let's explain the pattern:
\b
asserts a word boundary, ensuring the IP address is not part of a larger string of digits.\d{1,3}
matches 1 to 3 digits.\.
matches a literal dot.
Repeated three times, this pattern is the schema of an IP address.
Performance Considerations
Regular expressions can be computationally expensive, especially with complex patterns.
So, in this section, we'll provide some tips for optimizing regex performance.
Tip 1: Avoid Recompiling Patterns
If you use the same regex multiple times, compile it once with re.compile()
. Here's a typical use case:
This results in:
In this example, every time the code iterates through the list, the pattern is recompiled if you don't use the re.compile()
function, lowering execution time.
Tip 2: Use Specific Patterns
Using specific regex patterns rather than broad ones can significantly improve performance. Specific patterns reduce the number of possible matches the regex engine has to consider, speeding up the matching process.
Let's consider the following:
This results in:
In this case, we have:
.*123.*
: A less efficient pattern because.*
matches any character (except a newline) 0 or more times, both before and after "123". The regex engine has to consider a large number of potential matches.\b123\b
is more efficient because it is more specific.\b
asserts a word boundary, ensuring that "123" is matched only as a whole word. This reduces the number of possible matches the regex engine needs to evaluate.
Tip 3: Always Use Raw Strings
Using raw strings (by prefixing the pattern with r
) is a good practice, especially when dealing with backslashes and escape sequences. When you use raw strings, Python treats backslashes as literal characters rather than escape characters. This can prevent unexpected behavior or errors in your regex patterns and also results in performance improvements.
This example shows that performance is better with raw strings:
Here's the result:
On the side of error management, here's an example:
This results in:
We haven't used a raw string, so Python treats the backslashes in the pattern as escape characters. This leads to unintended behavior: specifically, \b
is interpreted as a backspace character instead of a word boundary, and the pattern fails to match email addresses correctly.
The correct pattern to avoid this unintended behavior is:
And that's it for our whistle-stop tour of regex in Python!
Wrapping Up
In this article, we covered the basics, advanced techniques, real-world use cases, and performance optimization strategies of regular expressions in Python.
By understanding and utilizing these concepts, you can handle complex text-processing tasks with ease and precision.
Happy coding!
P.S. If you'd like to read Python posts as soon as they get off the press, subscribe to our Python Wizardry newsletter and never miss a single post!