python

How to Use Regular Expressions in Python

Federico Trotta

Federico Trotta on

How to Use Regular Expressions in Python

Regular expressions, commonly known as regex, are a tool for text processing and pattern matching.

In Python, the re module offers a robust implementation of regex, allowing developers to handle complex text manipulation efficiently.

In this article, we'll get to grips with regular expressions and provide practical code examples — from the basic to more advanced — so you can understand how to use regex in Python.

Basics of Regular Expressions

Regular expressions are sequences of characters that define search patterns. They can be used for a variety of tasks, such as searching, editing, and manipulating text.

In this section, let's explore some basic components of regex.

Literals

Literals are the simplest form of regex patterns because they match the exact characters in a search string.

For example, if you want to find the exact word "cat" in the text "The cat sat on the mat", you'll have to use a literal that matches the pattern cat.

Metacharacters

In regular expressions, metacharacters are symbols with special meanings and purposes:

  • . matches any character, except on a new line.
  • ^ matches the start of a string.
  • $ matches the end of a string.
  • * matches 0 or more repetitions of the preceding element.
  • + matches 1 or more repetitions of the preceding element.
  • ? matches 0 or 1 repetition of the preceding element.
  • {} matches a specific number of repetitions of the preceding element.

Character Classes

Character classes match any character inside a given set. Common examples are:

  • [abc]: matches any of the characters a, b, or c.
  • \d matches any digit (equivalent to [0-9]).
  • \w matches any word character (alphanumeric plus underscore).
  • \s matches any whitespace character.

Quantifiers

Quantifiers specify the number of a character's or group's occurrences:

  • *: 0 or more.
  • +: 1 or more.
  • ?: 0 or 1.
  • {n}: Exactly n.
  • {n,}: n or more.
  • {n,m}: Between n and m.

Check out other examples with tutorials.

The re Module in Python

To work with regex in Python, we use the module re (provided by the standard Python library, so you don't need to install it separately).

In this section, we'll provide some basic examples of how to use the module re and its functionalities.

The re.search() Function

The re.search() function searches for the first match of the regex pattern in a string.

Suppose, for example, that you want to extract any numbers in a string. We could write the following Python code:

Python
import re # Define the pattern pattern = r"\d+" # Define the text to analyz text = "There are 123 apples." # Apply the regex match = re.search(pattern, text) print(match.group())

This results in:

plaintext
123

Let's explain the pattern = r"\d+":

  • The prefix r defines a raw string that treats backslashes (\) as literal characters and not as escape characters.
  • \d matches any digit (equivalent to [0-9]).
  • + indicates that the previous pattern (the digits) must appear one or more times.

Finally, if the re.search() function finds a match, the match variable returns a match object. So, the group() method of the match object returns the substring corresponding to the pattern found.

The re.match() Function

The re.match() function checks for a match only at the beginning of a string.

For example, if we want to search for a numerical match at the beginning of the previously defined text string, we could write the following:

Python
import re # Define the pattern pattern = r"\d+" # Define the text to analyze text = "There are 123 apples." # Apply the regex match = re.match(r"\d+", text) print(match)

And, as we would expect, here's the result:

plaintext
None

In this case, there's no match with the applied regex rules in the provided string.

The re.findall() Function

The re.findall() function finds all matches of the regex pattern in a string, returning the output in a list.

For example, if you want to find all the numerical patterns in a string, you could write something like:

Python
import re pattern = r"\d+" text = "There are 123 apples in 2 trees." matches = re.findall(pattern, text) print(matches)

And the result is:

plaintext
['123', '2']

In this case, two numerical expressions are found in the string.

The re.sub() Function

The re.sub() function replaces a matched regex pattern with a replacement string.

For example, if you want to replace "123" with "many":

Python
import re # Define the pattern pattern = r"\d+" # Define the text to analyze text = "There are 123 apples." # Apply the regex replaced_text = re.sub(r"\d+", "many", text) print(replaced_text)

This results in:

plaintext
There are many apples.

Note that the re.sub() function applies to all the matches it finds. So, for example, the following code sample:

Python
import re # Define the pattern pattern = r"\d+" # Define the text to analyze text = "There are 123 apples in 3 trees." # Apply the regex replaced_text = re.sub(r"\d+", "many", text) print(replaced_text)

Results in:

plaintext
There are many apples in many trees.

When To Use Each Function

re.match(), re.search(). and re.findall() are similar. So when should we use one over the other?

  • re.search() searches for the first place where the pattern matches a string. Use it when you need to find the first occurrence of a pattern in a string, regardless of where it is.
  • re.match() searches for a match only at the beginning of a string. Use it to check if a string starts with a certain pattern.
  • re.findall() finds all matches of a pattern in a string and returns them as a list of strings. Use it when you need to find all pattern occurrences in a string.

Advanced Regex Techniques for Python

In this section, we'll dive into some more advanced regex techniques for Python.

Lookaheads and Lookbehinds

Lookaheads and lookbehinds are types of zero-width assertions that match a position in a string based on what precedes or follows it (without including the preceding or following elements in the match itself).

A lookahead asserts that a certain pattern must follow the current position in the string but does not include this pattern in the matched result.

A lookbehind, on the other hand, asserts that a certain pattern must precede the current position in the string, but does not include this pattern in the matched result.

Let's consider a lookahead example. We want to match sequences of one or more word characters immediately followed by a period:

Python
import re pattern = r"\w+(?=\.)" text = "This is a test. Followed by another test." matches = re.findall(pattern, text) print(matches)

This results in:

plaintext
['test', 'test']

Why is this the result? Because the word "test" is followed by a dot, and the sequence is matched two times in the text variable. We use the r"\w+(?=\.)" pattern, because:

  • \w+ matches one or more word characters (letters, digits, and underscores).
  • (?=\.) is a positive lookahead asserting that the word characters must be followed by a period. However, the period itself is not included in the match result because lookaheads are zero-width assertions.

Note: A positive lookahead (syntax: (?=...)) asserts that what follows the current position matches the specified pattern, while a negative lookahead (syntax: (?!...)) states that what follows the current position does not match the specified pattern.

Now, let's consider a lookbehind example. We want to match one or more word characters immediately preceded by a whitespace character. So, consider the following:

Python
import re pattern = r"(?<=\s)\w+" text = "This is a test. Followed by another test." matches = re.findall(pattern, text) print(matches)

This results in:

plaintext
['is', 'a', 'test', 'Followed', 'by', 'another', 'test']

Let's explain the code:

  • (?<=\s) is a positive lookbehind asserting that the current position in the string must be preceded by a whitespace character (\s), such as a space, tab, or new line. The whitespace itself is not included in the match result because lookbehinds are zero-width assertions.
  • \w+ matches one or more word characters (letters, digits, and underscores).

So, the regex pattern (?<=\s)\w+ looks for sequences of one or more word characters that are immediately preceded by a whitespace character. And since we used the re.findall() function, the code prints all the words of the text except for "This" because it is at the beginning of the sentence (so it has no whitespace before it).

Non-capturing Groups

Non-capturing groups (syntax: (?:...)) are used to group parts of a pattern without capturing them for later reference. This technique is useful when you need to group parts of your pattern to apply quantifiers, alternation, or other regex operations but do not need to store the match for later use.

Let's consider a practical example. Suppose you have a sequence of numbers like "123-45-6789". You want to intercept the part of the number that has two digits. Here's how to do so with regex in Python:

Python
import re pattern = r"(?:\d{3})-(\d{2})-(\d{4})" text = "123-45-6789" match = re.search(pattern, text) print(match.group(1))

This results in:

plaintext
45

Let's explain the code:

  • (?:\d{3}) is a non-capturing group that matches exactly three digits.
  • - matches the hyphen character literally.
  • (\d{2}) is a capturing group that matches exactly two digits and captures this match for later reference.
  • (\d{4}) is another capturing group that matches exactly four digits and captures this match for later reference.
  • match.group(1) retrieves the content of the first capturing group from the match object. So, in this case, it prints the two digits captured by the first capturing group (\d{2}), which is 45.

Real-World Examples and Use Cases

Now, let's examine some real-world scenarios to give a more practical idea of how and when to use regex in Python.

Data Validation

Regular expressions are commonly used to validate data formats such as email addresses, phone numbers, and dates.

For example:

Python
pattern = r"^[\w\.-]+@[\w\.-]+\.\w+$" email = "example@example.com" match = re.match(pattern, email) print(bool(match))

In this case, the result of the print() function is True, meaning that the string example@example.com has been recognized as an email address by the regex pattern. Here's how this works:

  • ^ asserts the position at the start of the string.
  • [\w\.-]+ matches one or more word characters (alphanumeric characters and underscores), dots, or hyphens.
  • @ matches the @ character.
  • \. matches a literal dot.
  • \w+ matches one or more word characters (alphanumeric characters and underscores).
  • $ asserts the position at the end of the string.

So this pattern searches for word characters before and after the @ character. It also searches for a literal dot and other words after it. That's how email addresses are created.

Extracting IP Addresses

Another typical use case is extracting an IP address from a log file.

In Python, we can create it like so:

Python
import re pattern = r"\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b" log = "User logged in from IP 192.168.0.1" ip = re.search(pattern, log) print(ip.group())

Let's explain the pattern:

  • \b asserts a word boundary, ensuring the IP address is not part of a larger string of digits.
  • \d{1,3} matches 1 to 3 digits.
  • \. matches a literal dot.

Repeated three times, this pattern is the schema of an IP address.

Performance Considerations

Regular expressions can be computationally expensive, especially with complex patterns.

So, in this section, we'll provide some tips for optimizing regex performance.

Tip 1: Avoid Recompiling Patterns

If you use the same regex multiple times, compile it once with re.compile(). Here's a typical use case:

Python
import re import time # Sample texts texts = ["123", "abc 456", "def 789 ghi"] * 10000 # Pattern without compiling pattern1 = r"\d+" start_time = time.time() for text in texts: match = re.search(pattern1, text) end_time = time.time() time_without_compile = end_time - start_time # Compiling the pattern pattern2 = re.compile(r"\d+") start_time = time.time() for text in texts: match = pattern2.search(text) end_time = time.time() time_with_compile = end_time - start_time print(f"without compile:{time_without_compile: .3}, with compile:{time_with_compile: .3}")

This results in:

plaintext
without compile: 0.0286, with compile: 0.0178

In this example, every time the code iterates through the list, the pattern is recompiled if you don't use the re.compile() function, lowering execution time.

Tip 2: Use Specific Patterns

Using specific regex patterns rather than broad ones can significantly improve performance. Specific patterns reduce the number of possible matches the regex engine has to consider, speeding up the matching process.

Let's consider the following:

Python
import re import time # Sample texts texts = ["abc123xyz", "123", "a123b", "x123y", "nonumber", "123", "test"] * 50000 # Less efficient pattern pattern1 = r".*123.*" start_time = time.time() for text in texts: match = re.search(pattern1, text) end_time = time.time() time_with_less_efficient_pattern = end_time - start_time # More efficient pattern pattern2 = r"\b123\b" start_time = time.time() for text in texts: match = re.search(pattern2, text) end_time = time.time() time_with_more_efficient_pattern = end_time - start_time print(f"time with not efficient pattern: {time_with_less_efficient_pattern: .3}, time with efficient pattern: {time_with_more_efficient_pattern: .3}")

This results in:

plaintext
time with not efficient pattern: 0.412, time with efficient pattern: 0.385

In this case, we have:

  • .*123.*: A less efficient pattern because .* matches any character (except a newline) 0 or more times, both before and after "123". The regex engine has to consider a large number of potential matches.
  • \b123\b is more efficient because it is more specific. \b asserts a word boundary, ensuring that "123" is matched only as a whole word. This reduces the number of possible matches the regex engine needs to evaluate.

Tip 3: Always Use Raw Strings

Using raw strings (by prefixing the pattern with r) is a good practice, especially when dealing with backslashes and escape sequences. When you use raw strings, Python treats backslashes as literal characters rather than escape characters. This can prevent unexpected behavior or errors in your regex patterns and also results in performance improvements.

This example shows that performance is better with raw strings:

Python
import re import time # Define a regex pattern without using a raw string pattern_without_raw = "\d+" # Define the same regex pattern using a raw string pattern_with_raw = r"\d+" # Create some sample text to search text = "123 456 789 012" * 3000000 # Search using the pattern without raw string start_time = time.time() matches_without_raw = re.findall(pattern_without_raw, text) end_time = time.time() time_without_raw = end_time - start_time # Search using the pattern with raw string start_time = time.time() matches_with_raw = re.findall(pattern_with_raw, text) end_time = time.time() time_with_raw = end_time - start_time # Print the results and performance comparison print(f"Time taken without raw string:{time_without_raw: .3}") print(f"Time taken with raw string: {time_with_raw: .3}")

Here's the result:

plaintext
Time taken without raw string: 1.65 Time taken with raw string: 1.6

On the side of error management, here's an example:

Python
import re # Define a regex pattern to match email addresses pattern_without_raw = "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b" # Sample text containing email addresses text = "Contact us at email@example.com or visit our website www.example.com" # Try to find email addresses using the pattern without a raw string matches_without_raw = re.findall(pattern_without_raw, text) # Print the matches print(f"Matches without raw string: { matches_without_raw}")

This results in:

plaintext
Matches without raw string: []

We haven't used a raw string, so Python treats the backslashes in the pattern as escape characters. This leads to unintended behavior: specifically, \b is interpreted as a backspace character instead of a word boundary, and the pattern fails to match email addresses correctly.

The correct pattern to avoid this unintended behavior is:

Python
pattern = r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b"

And that's it for our whistle-stop tour of regex in Python!

Wrapping Up

In this article, we covered the basics, advanced techniques, real-world use cases, and performance optimization strategies of regular expressions in Python.

By understanding and utilizing these concepts, you can handle complex text-processing tasks with ease and precision.

Happy coding!

P.S. If you'd like to read Python posts as soon as they get off the press, subscribe to our Python Wizardry newsletter and never miss a single post!

Federico Trotta

Federico Trotta

Guest author Federico is a freelance Technical Writer who specializes in writing technical articles and documenting digital products. His mission is to democratize software through technical content.

All articles by Federico Trotta

Become our next author!

Find out more

AppSignal monitors your apps

AppSignal provides insights for Ruby, Rails, Elixir, Phoenix, Node.js, Express and many other frameworks and libraries. We are located in beautiful Amsterdam. We love stroopwafels. If you do too, let us know. We might send you some!

Discover AppSignal
AppSignal monitors your apps