elixir

Sanitize Strings in Elixir with Pattern Matching and Recursion

Sophie DeBenedetto

Sophie DeBenedetto on

Sanitize Strings in Elixir with Pattern Matching and Recursion

In this post, we'll use two of Elixir's most powerful features - pattern matching and recursion - to sanitize a string by removing invalid Unicode "Specials" characters.

Let's dive straight in!

The Problem: Strings with Unicode Specials

Unicode Specials is a short Unicode block of characters allocated at characters U+FFF0-FFFF. If one of these characters is present in a string of text, that text is not correctly encoded Unicode text. Here's the full set of those characters:

0xFFF0
0xFFF1
0xFFF2
0xFFF3
0xFFF4
0xFFF5
0xFFF6
0xFFF7
0xFFF8
0xFFF0
0xFFF10
0xFFF11
0xFFF12
0xFFF13
0xFFF14
0xFFFA
0xFFFB
0xFFFC
0xFFFD
0xFFFE
0xFFFF

You might encounter incorrectly encoded strings that contain a member of the Unicode Specials block. Let's say you need a way to process those strings anyway. Maybe you have some code that reads file contents or receives string input from another external source, like a form file upload or external API. In this case, you will want to remove these invalid characters to continue processing the text.

Elixir makes this easy with the help of binary pattern matching and recursion. We'll build out a function that removes Unicode specials from a string and replaces them with the "�" replacement character, so your program can continue to process the string input. When we're done, you'll have a handy function you can use to sanitize any string and a deeper understanding of two Elixir language features that will prove useful in many other scenarios.

Let's get started!

The Solution: Sanitize Your String with Pattern Matching and Recursion in Elixir

Our goal is to produce a function - CleanString.unicode_only/2 - that takes in any string and returns that string, with Unicode Specials replaced with the "�" replacement character. Here are a few examples of our finished product in action:

CleanString.unicode_only("abc" <> <<0xFFFF::16>>)
=> "abc�"
CleanString.unicode_only("a" <> <<0xFFFF::16>> <> "bc" <> <<0xFFFF::16>>)
=> "a�bc�"
CleanString.unicode_only("abc" <> <<0xFFF2::16>> <> "def")
=> "abc�def"
CleanString.unicode_only(<<0xFFF5::16>> <> "def")
=> "�def"
CleanString.unicode_only(<<0xFFF5::16>> <> <<0xFFF4::16>> <> "def")
=> "��def"

Step 1: Break Down the Problem

Before we write any code, let's break down this problem one step at a time:

  • For each character in a string
  • Check if that character is present in the @specials list
  • If it is, remove it from the string
  • And replace it with the replacement character
  • If it is not present in the @specials list
  • Keep that character

Changing elements in place in a string is a tricky and potentially expensive operation, but Elixir gives us another way.

We can use binary pattern matching to iterate over each element in our string, recursively breaking off the "head" or first character of the string. Then, depending on whether or not that character is considered valid, we will add it to the head of a new string.

Pulling the head off a string and adding to the head of a new string are very performant tasks in Elixir--it doesn't matter how long the string is, Elixir will only ever need to remove from or add to the beginning of that string.

Let's break down our problem one more time, this time with our new strategy in place:

  • For each character in a string
  • Check if that character is present in the @specials list
  • If it is, add a replacement character to the head of a new string
  • If it is not, add that character to the head of a new string

This will result in a brand new string that contains only replacement or valid Unicode characters.

With our plan in place, we're ready to write some code.

Step 2: Define the Module in Elixir

First up, we'll define a module, CleanString, that sets a module attribute, @specials, to a list of binary strings that represent the characters in the specials block listed above:

defmodule CleanString do
    @specials [
    <<255, 240>>,
    <<255, 241>>,
    <<255, 242>>,
    <<255, 243>>,
    <<255, 244>>,
    <<255, 245>>,
    <<255, 246>>,
    <<255, 247>>,
    <<255, 248>>,
    <<255, 240>>,
    <<255, 16>>,
    <<255, 17>>,
    <<255, 18>>,
    <<255, 19>>,
    <<255, 20>>,
    <<255, 250>>,
    <<255, 251>>,
    <<255, 252>>,
    <<255, 253>>
    <<255, 254>>,
    <<255, 255>>
  ]
end

This list of binary strings is generated by taking each specials character and evaluating its hexadecimal bitstring version like this:

iex> <<0xFFFF::16>>
<<255, 255>>

We want our module to store this list of binary strings so that we can check each character of the given string against this list and remove it if it's found.

Now that we have the basics of our module in place, let's start building the unicode_only/2 function.

Step 3: Define the unicode_only/2 Function in Elixir

The unicode_only/2 function will take in two arguments: the original string and the new "clean" string we're building.

We plan to call our function recursively. The first time, we call it with only one argument--the argument of the original string--like this:

CleanString.unicode_only("abc" <> <<0xFFFF::16>>)
=> "abc�"

In this scenario, we need to default the "clean string" second argument to an empty string. Let's define that default argument function head first:

defmodule CleanString do
 
  @specials [
    # ...
  ]
 
  def unicode_only(string, new_string \\ "")
end

Great, now we're ready to define the remainder of our function heads. Here's where Elixir's powerful binary pattern matching feature comes in.

We'll define one version of the function that pattern matches a scenario where the character at the head of the string is included in the @specials list, and one in which it isn't. Let's start with that first scenario:

@replacement_character "�"
 
def unicode_only(<<head::binary-size(2)>> <> tail, new_string)
    when head in @specials do
  unicode_only(tail, @replacement_character <> new_string)
end

First, we've added a new module attribute, @replacement_character, that stores our replacement character. Then, we've added a version of the unicode_only/2 function that uses binary pattern matching to extract the first two bytes of the string and set them equal to a variable, head.

Why do we need to capture the first two bytes of the string to check if the first character is part of the Unicode Specials block? This is because each individual specials character is represented by a two-byte binary string.

Let's take a closer look at how we capture those two bytes now. The magic happens in the first argument given to unicode_only/2:

<<head::binary-size(2)>> <> tail

Elixir allows you to pattern match in the function head. This means that when unicode_only/2 is called, the first argument given will be pattern matched against this definition. For example, if we call the function like this:

original_string = "abc" <> <<0xFFFF::16>>
CleanString.unicode_only(original_string)

Then the following pattern match will be executed:

<<head::binary-size(2)>> <> tail = original_string

As mentioned above, this results in setting the variable head equal to the first two bytes of the string. Meanwhile, the tail variable is bound to the remainder of the string. Then, we use a guard clause to check if that variable is included in the @specials list.

If it is, we don't want to add it to the new clean string we're constructing. Instead, we want to add the replacement character:

@replacement_character <> new_string

Now we need to perform this same operation for the next character in the original string, and the next one, and so on. This is where recursion comes in. We will call unicode_only/2 again. This time with only the tail of the original string as a first argument, and the newly constructed clean string as a second argument:

unicode_only(tail, @replacement_character <> new_string)

In this way, we recursively pull the head off the string, until the string is empty. Now, we have a function head that runs when the first two bytes of the string are included in @specials.

Let's define another version of our function that will run when the first character isn't included there.

 def unicode_only(<<head::binary-size(1)>> <> tail, new_string) do
  new_string = head <> new_string
  unicode_only(tail, new_string)
end

Once again, we use binary pattern matching in the function head. This time, we pattern match the first argument given to unicode_only/2 against the following:

<<head::binary-size(1)>> <> tail

Here, we pull off the first byte of the string and set it equal to the head variable, and we set the remainder of the string equal to tail. We only want to grab the first byte of the string here because non-specials characters are represented by a single byte, for example:

iex> <<99>>
"c"

Moving our attention back to the function body, let's talk about how this function behaves. In the scenario where the first character is not included in @specials we do want to retain that character in our new string. So we add it to the head of the new string, like this:

head <> new_string

Then, we're ready to recursively call our function again, with the remaining tail of the original string and our updated new string:

unicode_only(tail, head <> new_string)

We're almost done. We just need a way to break out of our recursive loop when the time is right. When do we want to stop recursing? When we've processed every character in the original string. When this happens, the original string will be empty, and the first argument passed to unicode_only/2 in our recursive loop will be "". Let's implement that function head now, once again using pattern matching:

def unicode_only("", new_string), do: String.reverse(new_string)

When the original string is finally empty, we don't want to call unicode_only/2 again. Instead, we'll return the reversed new string.

We need to reverse the string because we added each processed character to the head of the new string to keep our code performant. So, we added each retained or replacement character in the opposite order from their placement in the original string. Reversing the string when we're done puts everything back in the right order.

Putting it all together, we end up with this:

defmodule CleanString do
 
  @specials [
    <<255, 240>>,
    <<255, 241>>,
    <<255, 242>>,
    <<255, 243>>,
    <<255, 244>>,
    <<255, 245>>,
    <<255, 246>>,
    <<255, 247>>,
    <<255, 248>>,
    <<255, 240>>,
    <<255, 16>>,
    <<255, 17>>,
    <<255, 18>>,
    <<255, 19>>,
    <<255, 20>>,
    <<255, 250>>,
    <<255, 251>>,
    <<255, 252>>,
    <<255, 254>>,
    <<255, 255>>
  ]
 
  @replacement_character "�"
 
  def unicode_only(string, new_string \\ "")
 
  def unicode_only(<<head::binary-size(2)>> <> tail, new_string)
      when head in @specials do
    unicode_only(tail, @replacement_character <> new_string)
  end
 
  def unicode_only(<<head::binary-size(1)>> <> tail, new_string) do
    unicode_only(tail, head <> new_string)
  end
 
  def unicode_only("", new_string), do: String.reverse(new_string)
end

Wrap Up

With a set of functions implemented in around ten lines of code, we've created an elegant solution for a common, tough problem. Thanks to Elixir's powerful pattern matching functionality, which we used here to pattern match binary strings in function heads, we have a handy function for cleaning bad characters out of any string.

String sanitization is something that you may need to do in the wild, and Elixir is a perfect fit for the job. Beyond that specific scenario though, this solution demonstrates how nicely Elixir pattern matching and recursion go hand in hand.

Instead of implementing complex and difficult-to-read if statements, our recursive code flow was built entirely with pattern matching. Each scenario was mapped to a different unicode_only/2 function head.

Reach for function head pattern matching whenever you want to execute different function behavior based on different inputs, and you'll end up with clean, highly readable, and maintainable code. You can even pattern match on single characters, or groups of characters, in a binary string with the syntax you've seen here.

Taken together, pattern matching and recursion are a match made in Elixir heaven. Add these tools to your Elixir toolbox and take them out whenever you have a complex input processing problem to solve.

Happy coding!

P.S. If you'd like to read Elixir Alchemy posts as soon as they get off the press, subscribe to our Elixir Alchemy newsletter and never miss a single post!

Share this article

RSS
Sophie DeBenedetto

Sophie DeBenedetto

Our guest author Sophie is a Senior Engineer at GitHub, co-author of Programming Phoenix LiveView, and co-host of the BeamRad.io podcast. She has a passion for coding education. Historically, she is a cat person but will admit to owning a dog.

-> All articles by Sophie DeBenedetto-> Become an AppSignal author

AppSignal monitors your apps

AppSignal provides insights for Ruby, Rails, Elixir, Phoenix, Node.js, Express and many other frameworks and libraries. We are located in beautiful Amsterdam. We love stroopwafels. If you do too, let us know. We might send you some!

Discover AppSignal
AppSignal monitors your apps