Logo of AppSignal

Menu

Cleaning Up Ruby Strings 13 Times Faster

Maud Maud de Vries on

When translating your thoughts into code, most likely, you use the methods that you are most familiar with. These are methods that are top of mind and come automatically to you: you see a string that needs cleaning up and your fingers type the methods that will get the result.

Often, the methods that you type automatically are the most generic Ruby methods, because they are the ones that we read and write more than others, e.g. #gsub is a generic method to substitute characters in strings. But, Ruby has so much more to offer, with more specialized convenience methods for standard operations.

I love Ruby’s rich idiom mostly because it makes code more elegant and easier to read. If we want to benefit from this richness, we need to spend time refactoring even the simplest parts of our code—for instance, cleaning up a string—and it takes a bit of an effort to expand our vocabulary. The question is: is the extra effort worth it?

Four Ways to Remove Spaces

Here’s a string that represents a credit card number: “055 444 285”. To work with it, we want to remove the spaces. #gsub can do this; with #gsub you can substitute anything with everything. But there are other options.

1
2
3
4
5
6
7
string = "055 444 285"
string.gsub(/ /, '')
string.gsub(' ', '')
string.tr(' ', '')
string.delete(' ')

# => "055444285"

It’s the expressiveness that I like most about the convenience methods. The last one is a good example of this: it doesn’t get more obvious than “delete spaces”. Thinking about trade-offs between options, readability is my first priority, unless of course, it causes performance problems. So, let’s see how much pain my favorite solution, #delete really causes.

I benchmarked the examples above. Which one of these methods do you think is the fastest?

1
2
3
4
5
6
7
8
9
10
Benchmark.ips do |x|
  x.config(time: 30, warmup: 2)

  x.report('gsub')           { string.gsub(/ /, '') }
  x.report('gsub, no regex') { string.gsub(' ', '') }
  x.report('tr')             { string.tr(' ','') }
  x.report('delete')         { string.delete(' ') }

  x.compare!
end

Guess the order from most to least performant. Open the toggle to see the result

1
2
3
4
5
Comparison:
  delete:          2326817.5 i/s
  tr:              2121629.8 i/s   - 1.10x  slower
  gsub, no regex:  868184.1 i/s    - 2.68x  slower
  gsub:            474970.5 i/s    - 4.90x  slower

I wasn’t surprised about the order, but the differences in speed still surprised me. #gsub is not only slower, but it also requires an extra effort for the reader to ‘decode’ the arguments. Let’s see how this comparison works out when cleaning up more than just spaces.

Pick Your Numbers

Take the following phone number: '(408) 974-2414'. Let’s say we only need the numericals => 4089742414. I added a #scan as well because I like that it expresses more clearly that we aim for some particular things, instead of trying to remove all the things we don’t want.

1
2
3
4
5
6
7
8
9
Benchmark.ips do |x|
  x.config(time: 30, warmup: 2)

  x.report ('gsub')           { string.gsub(/[^0-9] /, '') }
  x.report('tr')              { string.tr("^0-9", "") }
  x.report('delete_chars')    { string.delete("^0-9") }
  x.report('scan')            { string.scan(/[0-9]/).join }
  x.compare!
end

Again, guess the order, then open the toggle to see the answer

1
2
3
4
5
Comparison:
  delete_chars:   2006750.8 i/s
  tr:             1856429.0 i/s   - 1.08x  slower
  gsub:           523174.7 i/s    - 3.84x  slower
  scan:           227717.4 i/s    - 8.81x  slower

Using a regex slows things down, that’s not surprising. And the intention revealing expressiveness of #scan costs us dearly. But looking at how Ruby’s specialized methods handle cleaning up, gave me a taste for more.

On the Money

Let’s try some ways of removing the substring "€ " from the string "€ 300". Some of the following solutions specify the exact substring "€ ", some will simply remove all currency symbols or all non-numerical characters.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Benchmark.ips do |x|
  x.config(time: 30, warmup: 2)

  x.report('delete specific chars')  { string.delete("€ ") }
  x.report('delete non-numericals')  { string.delete("^0-9") }
  x.report('delete prefix')          { string.delete_prefix("€ ") }
  x.report('delete prefix, strip')   { string.delete_prefix("€").strip }

  x.report('gsub')                   { string.gsub(/€ /, '') }
  x.report('gsub-non-nums')          { string.gsub(/[^0-9]/, '') }
  x.report('tr')                     { string.tr("€ ", "") }
  x.report('slice array')            { string.chars.slice(2..-1).join }
  x.report('split')                  { string.split.last }
  x.report('scan nums')              { string.scan(/\d/).join }
  x.compare!
end

You may expect, and correctly so, that the winner is one of the #deletes. But which one of the #delete variants do you expect to be the fastest? Plus: one of the other methods is faster than some of the #deletes. Which one?

Guess and then open.

1
2
3
4
5
6
7
8
9
10
11
Comparison:
        delete prefix:   4236218.6 i/s
 delete prefix, strip:   3116439.6 i/s - 1.36x  slower
                split:   2139602.2 i/s - 1.98x  slower
delete non-numericals:   1949754.0 i/s - 2.17x  slower
delete specific chars:   1045651.9 i/s - 4.05x  slower
                   tr:   951352.0 i/s  - 4.45x  slower
          slice array:   681196.2 i/s  - 6.22x  slower
                 gsub:   548588.3 i/s  - 7.72x  slower
        gsub-non-nums:   489744.8 i/s  - 8.65x  slower
            scan nums:   418978.8 i/s  - 10.11x  slower

I was surprised that even slicing an array is faster than #gsub and I’m always pleased to see how fast #split is. And note that deleting all non-numericals is faster than deleting a specific substring.

Follow the Money

Let’s remove the currency after the number. (I skipped the slower #gsub variants.)

1
2
3
4
5
6
7
8
9
10
11
Benchmark.ips do |x|
  x.config(time: 30, warmup: 2)

  x.report('gsub')                        { string.gsub(/ USD/, '')
  x.report('tr')                          { string.tr(" USD", "") }
  x.report('delete_chars')                { string.delete("^0-9")
  x.report('delete_suffix')               { string.delete_suffix(" USD") }
  x.report('to_i.to_s')                   { string.to_i.to_s }
  x.report("split")                       { string.split.first }
  x.compare!
end

There’s a draw between winners. Which 2 do you expect to compete for being the fastest?

And: guess how much slower #gsub is here.

1
2
3
4
5
6
7
8
Comparison:
       delete_suffix:      4354205.4 i/s
           to_i.to_s:      4307614.6 i/s - same-ish: difference falls within error
               split:      2870187.8 i/s - 1.52x  slower
        delete_chars:      1989566.1 i/s - 2.19x  slower
                  tr:      1853957.1 i/s - 2.35x  slower
                gsub:      524080.6 i/s - 13.22x  slower

There isn’t always a specialized method that will suit your needs. You can’t use #to_i if you need to keep a leading “0”. And #delete_suffix leans heavily on the assumption that the currency is US Dollars.

The specialized methods are like precision tools—suitable for a specific task in a specific context. So there will always be cases where #gsub is exactly what we need. It is versatile, and it’s always top of mind. But it can be a bit harder to process and is often slower, even slower than I expected. To me, Ruby’s richness is also one of the reasons that makes it so much fun to work with. The speed wins are a nice bonus.

Guest author Maud de Vries is a freelance Ruby on Rails developer, a Coach for (solo) entrepreneurs and she used to be an editor as well. The writer inside sometimes escapes.

We’re hiring: ✍️ (Remote) Editor in Chief @AppSignal ✏️

10 latest articles

Go back
Ruby magic icon

Subscribe to

Ruby Magic

Magicians never share their secrets. But we do. Sign up for our Ruby Magic email series and receive deep insights about garbage collection, memory allocation, concurrency and much more.

We'd like to set cookies, read why.