In this post, we'll use two of Elixir's most powerful features - pattern matching and recursion - to sanitize a string by removing invalid Unicode "Specials" characters.
Let's dive straight in!
The Problem: Strings with Unicode Specials
Unicode Specials is a short Unicode block of characters allocated at characters U+FFF0-FFFF. If one of these characters is present in a string of text, that text is not correctly encoded Unicode text. Here's the full set of those characters:
0xFFF0 0xFFF1 0xFFF2 0xFFF3 0xFFF4 0xFFF5 0xFFF6 0xFFF7 0xFFF8 0xFFF0 0xFFF10 0xFFF11 0xFFF12 0xFFF13 0xFFF14 0xFFFA 0xFFFB 0xFFFC 0xFFFD 0xFFFE 0xFFFF
You might encounter incorrectly encoded strings that contain a member of the Unicode Specials block. Let's say you need a way to process those strings anyway. Maybe you have some code that reads file contents or receives string input from another external source, like a form file upload or external API. In this case, you will want to remove these invalid characters to continue processing the text.
Elixir makes this easy with the help of binary pattern matching and recursion. We'll build out a function that removes Unicode specials from a string and replaces them with the
"�" replacement character, so your program can continue to process the string input. When we're done, you'll have a handy function you can use to sanitize any string and a deeper understanding of two Elixir language features that will prove useful in many other scenarios.
Let's get started!
The Solution: Sanitize Your String with Pattern Matching and Recursion in Elixir
Our goal is to produce a function -
CleanString.unicode_only/2 - that takes in any string and returns that string, with Unicode Specials replaced with the
"�" replacement character. Here are a few examples of our finished product in action:
CleanString.unicode_only("abc" <> <<0xFFFF::16>>) => "abc�" CleanString.unicode_only("a" <> <<0xFFFF::16>> <> "bc" <> <<0xFFFF::16>>) => "a�bc�" CleanString.unicode_only("abc" <> <<0xFFF2::16>> <> "def") => "abc�def" CleanString.unicode_only(<<0xFFF5::16>> <> "def") => "�def" CleanString.unicode_only(<<0xFFF5::16>> <> <<0xFFF4::16>> <> "def") => "��def"
Step 1: Break Down the Problem
Before we write any code, let's break down this problem one step at a time:
- For each character in a string
- Check if that character is present in the
- If it is, remove it from the string
- And replace it with the replacement character
- If it is not present in the
- Keep that character
Changing elements in place in a string is a tricky and potentially expensive operation, but Elixir gives us another way.
We can use binary pattern matching to iterate over each element in our string, recursively breaking off the "head" or first character of the string. Then, depending on whether or not that character is considered valid, we will add it to the head of a new string.
Pulling the head off a string and adding to the head of a new string are very performant tasks in Elixir--it doesn't matter how long the string is, Elixir will only ever need to remove from or add to the beginning of that string.
Let's break down our problem one more time, this time with our new strategy in place:
- For each character in a string
- Check if that character is present in the
- If it is, add a replacement character to the head of a new string
- If it is not, add that character to the head of a new string
This will result in a brand new string that contains only replacement or valid Unicode characters.
With our plan in place, we're ready to write some code.
Step 2: Define the Module in Elixir
First up, we'll define a module,
CleanString, that sets a module attribute,
@specials, to a list of binary strings that represent the characters in the specials block listed above:
defmodule CleanString do @specials [ <<255, 240>>, <<255, 241>>, <<255, 242>>, <<255, 243>>, <<255, 244>>, <<255, 245>>, <<255, 246>>, <<255, 247>>, <<255, 248>>, <<255, 240>>, <<255, 16>>, <<255, 17>>, <<255, 18>>, <<255, 19>>, <<255, 20>>, <<255, 250>>, <<255, 251>>, <<255, 252>>, <<255, 253>> <<255, 254>>, <<255, 255>> ] end
This list of binary strings is generated by taking each specials character and evaluating its hexadecimal bitstring version like this:
iex> <<0xFFFF::16>> <<255, 255>>
We want our module to store this list of binary strings so that we can check each character of the given string against this list and remove it if it's found.
Now that we have the basics of our module in place, let's start building the
Step 3: Define the
unicode_only/2 Function in Elixir
unicode_only/2 function will take in two arguments: the original string and the new "clean" string we're building.
We plan to call our function recursively. The first time, we call it with only one argument--the argument of the original string--like this:
CleanString.unicode_only("abc" <> <<0xFFFF::16>>) => "abc�"
In this scenario, we need to default the "clean string" second argument to an empty string. Let's define that default argument function head first:
defmodule CleanString do @specials [ # ... ] def unicode_only(string, new_string \\ "") end
Great, now we're ready to define the remainder of our function heads. Here's where Elixir's powerful binary pattern matching feature comes in.
We'll define one version of the function that pattern matches a scenario where the character at the head of the string is included in the
@specials list, and one in which it isn't. Let's start with that first scenario:
@replacement_character "�" def unicode_only(<<head::binary-size(2)>> <> tail, new_string) when head in @specials do unicode_only(tail, @replacement_character <> new_string) end
First, we've added a new module attribute,
@replacement_character, that stores our replacement character. Then, we've added a version of the
unicode_only/2 function that uses binary pattern matching to extract the first two bytes of the string and set them equal to a variable,
Why do we need to capture the first two bytes of the string to check if the first character is part of the Unicode Specials block? This is because each individual specials character is represented by a two-byte binary string.
Let's take a closer look at how we capture those two bytes now. The magic happens in the first argument given to
<<head::binary-size(2)>> <> tail
Elixir allows you to pattern match in the function head. This means that when
unicode_only/2 is called, the first argument given will be pattern matched against this definition. For example, if we call the function like this:
original_string = "abc" <> <<0xFFFF::16>> CleanString.unicode_only(original_string)
Then the following pattern match will be executed:
<<head::binary-size(2)>> <> tail = original_string
As mentioned above, this results in setting the variable
head equal to the first two bytes of the string. Meanwhile, the
tail variable is bound to the remainder of the string. Then, we use a guard clause to check if that variable is included in the
If it is, we don't want to add it to the new clean string we're constructing. Instead, we want to add the replacement character:
@replacement_character <> new_string
Now we need to perform this same operation for the next character in the original string, and the next one, and so on. This is where recursion comes in. We will call
unicode_only/2 again. This time with only the
tail of the original string as a first argument, and the newly constructed clean string as a second argument:
unicode_only(tail, @replacement_character <> new_string)
In this way, we recursively pull the head off the string, until the string is empty. Now, we have a function head that runs when the first two bytes of the string are included in
Let's define another version of our function that will run when the first character isn't included there.
def unicode_only(<<head::binary-size(1)>> <> tail, new_string) do new_string = head <> new_string unicode_only(tail, new_string) end
Once again, we use binary pattern matching in the function head. This time, we pattern match the first argument given to
unicode_only/2 against the following:
<<head::binary-size(1)>> <> tail
Here, we pull off the first byte of the string and set it equal to the
head variable, and we set the remainder of the string equal to
tail. We only want to grab the first byte of the string here because non-specials characters are represented by a single byte, for example:
iex> <<99>> "c"
Moving our attention back to the function body, let's talk about how this function behaves. In the scenario where the first character is not included in
@specials we do want to retain that character in our new string. So we add it to the head of the new string, like this:
head <> new_string
Then, we're ready to recursively call our function again, with the remaining
tail of the original string and our updated new string:
unicode_only(tail, head <> new_string)
We're almost done. We just need a way to break out of our recursive loop when the time is right. When do we want to stop recursing? When we've processed every character in the original string. When this happens, the original string will be empty, and the first argument passed to
unicode_only/2 in our recursive loop will be
"". Let's implement that function head now, once again using pattern matching:
def unicode_only("", new_string), do: String.reverse(new_string)
When the original string is finally empty, we don't want to call
unicode_only/2 again. Instead, we'll return the reversed new string.
We need to reverse the string because we added each processed character to the head of the new string to keep our code performant. So, we added each retained or replacement character in the opposite order from their placement in the original string. Reversing the string when we're done puts everything back in the right order.
Putting it all together, we end up with this:
defmodule CleanString do @specials [ <<255, 240>>, <<255, 241>>, <<255, 242>>, <<255, 243>>, <<255, 244>>, <<255, 245>>, <<255, 246>>, <<255, 247>>, <<255, 248>>, <<255, 240>>, <<255, 16>>, <<255, 17>>, <<255, 18>>, <<255, 19>>, <<255, 20>>, <<255, 250>>, <<255, 251>>, <<255, 252>>, <<255, 254>>, <<255, 255>> ] @replacement_character "�" def unicode_only(string, new_string \\ "") def unicode_only(<<head::binary-size(2)>> <> tail, new_string) when head in @specials do unicode_only(tail, @replacement_character <> new_string) end def unicode_only(<<head::binary-size(1)>> <> tail, new_string) do unicode_only(tail, head <> new_string) end def unicode_only("", new_string), do: String.reverse(new_string) end
With a set of functions implemented in around ten lines of code, we've created an elegant solution for a common, tough problem. Thanks to Elixir's powerful pattern matching functionality, which we used here to pattern match binary strings in function heads, we have a handy function for cleaning bad characters out of any string.
String sanitization is something that you may need to do in the wild, and Elixir is a perfect fit for the job. Beyond that specific scenario though, this solution demonstrates how nicely Elixir pattern matching and recursion go hand in hand.
Instead of implementing complex and difficult-to-read
if statements, our recursive code flow was built entirely with pattern matching. Each scenario was mapped to a different
unicode_only/2 function head.
Reach for function head pattern matching whenever you want to execute different function behavior based on different inputs, and you'll end up with clean, highly readable, and maintainable code. You can even pattern match on single characters, or groups of characters, in a binary string with the syntax you've seen here.
Taken together, pattern matching and recursion are a match made in Elixir heaven. Add these tools to your Elixir toolbox and take them out whenever you have a complex input processing problem to solve.
P.S. If you'd like to read Elixir Alchemy posts as soon as they get off the press, subscribe to our Elixir Alchemy newsletter and never miss a single post!