The need to manipulate strings comes up quite often, whether it's to validate user-provided values or transform text into structured data that can be used programmatically.
Most often, we'll reach for regular expressions to accomplish this task, but sometimes there's a better solution to the problem: parser combinators. In this two-part article, we'll explore how they work.
Before moving on, let's define what 'parsing' is:
A parser is a software component that takes input data (frequently text) and builds a data structure.
Source: Wikipedia.
In other words, when talking about transforming text into structured data, we essentially mean parsing.
Let's get into it!
Parser Combinators: The What, How, and Why
So what are parser combinators? Well, as their name implies, they are parsers that can be combined — to make bigger, better parsers.
In effect, parser combinators enable the writing of complex parsers from simpler ones. You're less likely to make mistakes, and the parsers are a lot easier to maintain due to better readability.
In contrast, regular expressions are often a poor choice for non-trivial parsing. Writing them is error-prone, they're challenging to maintain, and their format doesn't lend itself well to documentation/explanation.
To illustrate working with parser combinators, we'll work on transforming a string representation of a Dutch phone number into a canonical representation. This can prove useful, as a web form can capture a user's phone number to later be used programmatically, for example.
The "Parser" and "Combinator" Bits: NimbleParsec
We'll be using NimbleParsec as the resulting parsers are more efficient, but other libraries such as Combine also exist and work well.
As an aside, note that it's been pointed out that NimbleParsec is more akin to a parser generator than a parser combinator: that's true, but the difference won't matter when learning how to make use of parser generators.
Parser Examples in Elixir
As mentioned before, parsing is the act of taking unstructured data (typically strings) and transforming them into structured data.
We've got some examples right in Elixir.
We can turn a string containing a URI into a data structure via URI.new/1
:
Similarly, we can convert a string representation of an integer into an actual integer by using Integer.parse/2
:
Each of these examples will accept a string and either return a parsed result (possibly with "leftover" string content, as in the Integer.parse/2
example) or an error indication.
Combinator Examples in Elixir
What if we had lists indicating how many visitors came from URIs (formatted as "URI & count" pairs)? It would be nice to have a single function to do the heavy lifting for us:
Easy, I hear you say — just do something like:
That's indeed pretty straightforward. But what if there are several URI/visits pairs per line? Then the logic is already much more involved. And what if these URI/visits pairs are in reverse order (visits first, then URI)?
Using a parser combinator approach, we look at the problem differently. We've got a parser for URIs, and another one for integers. Let's write a parser for white space, then combine those three parsers into a single one for our specific use case. In pseudo-code, it would look like this:
And since we're now in combinator-land, we could then use parse_visit_count
and combine it with other parsers to step up the complexity. Neat, right?
Code Walkthrough: Our Parser Combinator Example
Let's work through a concrete example of using a parser combinator: we'll parse a phone number into a data structure.
The phone number we'll use for our examples is one from the Netherlands: +31 20 42 84 105.
According to Wikipedia, this number can be formatted in various ways (note we'll only be covering a subset of the possibilities within this article):
The area code ('A') is commonly separated with a dash ('-') and sometimes a space from the subscriber's number ('B'). The length of the area code for landlines is either 2 or 3 digits, depending on the population density of the area. This leaves 7 or 6 digits for the subscriber's number, resulting in a format of either 0AA-BBBBBBB or 0AAA-BBBBBB. [...] The trunk prefix '0' is dropped when prefixed by the country code: +31 AA BBBBBBBB, [...], etcetera. [...]
In other words, this phone number can also be formatted as (among others):
- +31 20 4284105
- +31 20-42 84 105
- 020-42 84 105
- 020-4284105
- 020 42 84 105
Deconstructing the Problem
From a quick look at the formats above, we can tell that the main "chunks" we need to extract are:
- Country code - the "+31" prefix, which may be absent
- Area code - the "020", "(0)20", or "20" value
- Subscriber number - the remaining seven digits, which may be separated by spaces
Getting Started
We'll use NimbleParsec, so let's go ahead and create a new project with mix new demo
and add the dependency in mix.exs
:
Run mix deps.get
to ensure everything's ready for us. With that in place, let's start writing our phone number parser, which we'll do within its own module:
Within this PhoneNumber.Parser
module, we need to import NimbleParsec
so we can call on its functions (such as string
) more conveniently. Then, we define our parser. Right now, that consists of only parsing the exact phone number string via NimbleParsec's string/2
function (hence why the phone number value is hardcoded). And last but not least, we call defparsec/3
to "finalize" the parser and make it executable.
Let's give it a spin! Open an IEx session by typing iex -S mix
in the project's root folder from within a terminal. We can then try our parser:
The result is a tuple containing six terms. The first is the :ok
or :error
tag indicating whether the provided string could be parsed.
The second value is (in the :ok
case) a list of the parsed information, or (in the :error
case) the error reason.
The remaining values can be ignored as we won't get into them in this article.
Parsing Local Numbers
Now that we've created a parser for an exact string, let's dig deeper into NimbleParsec's functionality by expanding the variety of phone numbers our parser will be able to process.
Local phone numbers look like this:
- 020-42 84 105
- 020-4284105
- 020 42 84 105
There's an area code, then a separator which can be a space or a dash. This is followed by the subscriber number, which can contain spaces to make it easier to read.
Note also that the "area code" portion consists of the trunk prefix (which is "0") followed by two or three numbers indicating the area itself.
So conceptually, we're looking at writing a parser that looks like area_code <> separator <> subscriber_number
.
Let's start with the area code:
Let's give it a whirl in iex -S mix
:
We can see our parser is properly extracting out the area code: it's being returned as ["0", "20"]
because the area_code
combinator is composed of trunk_prefix
(yielding "0"), followed by string("20")
(logically yielding "20"). The remainder of the input string is left unparsed (as "-42 84 105"
). Success!
So what's going on? We create a trunk_prefix
parser that will match the specific "0" string by using string/2
.
Then, we define an area_code
parser that matches this trunk_prefix
followed by the "20" string. And finally, we define the parse
function as the parser specified by the area_code
combinator by calling defparsec/3
.
We're on a roll, so let's add to our current parser so that it also captures the separator. This separator can be either a dash or a space. It sounds like a perfect match for choice/3
, which will apply the first parser it finds that can parse the input.
In choice([string("-"), string(" ")])
, the choice([...])
combinator will attempt to parse a "-"
string. If it's not able to, it will attempt parsing a " "
string. Only if both of these options fail, does the entire choice
combinator fail. Conversely, if the first string("-")
option succeeds, none of the subsequent parsers provided to choice
will be tested.
You'll notice that we weren't able to simply pipe the two area_code
and separator
combinators into each other: separator
doesn't accept an argument, so we can't pass in area_code
. NimbleParsec does, however, make it easy to concatenate two combinators with concat/2
, so that's exactly how we plug these two functions into one another. Let's check things still work as expected:
Great! Both the dash and space separators are getting picked up, while the invalid slash yields an error.
Let's now write an overly accepting parser that we can come back to and tighten down: we want to capture all digit and space characters until "end of string". To do so, we'll use utf8_string/3
to parse digits, times/3
to get us repetition, and finally, eos/1
to represent "end of string". It looks like this:
We define our digit
combinator as a UTF8 string that contains values in the 0-9 range and is of length 1: in other words, it's a string of a single digit.
Note we've also improved the area_code
combinator: instead of matching only the specific "20" area code, it will match any two digits. The side effect is that the parse results will be slightly different: ["0", "2", "0", "-", ...]
instead of ["0", "20", "-", ...]
.
Finally, since eos()
has an optional combinator argument, there's no need to wrap it with a concat()
call within the pipeline.
Time for a Short Intermission
We've covered a decent amount of ground so far. Let's take a break and let all of that sink in.
Our parser is starting to take shape, but it's still not very good. We'll improve it in the next installment where we'll investigate writing custom parsers, among other topics.
Until next time, happy coding!
P.S. If you'd like to read Elixir Alchemy posts as soon as they get off the press, subscribe to our Elixir Alchemy newsletter and never miss a single post!