Cleaning Up Ruby Strings 13 Times Faster

Maud de Vries

Maud de Vries on

Cleaning Up Ruby Strings 13 Times Faster

When translating your thoughts into code, most likely, you use the methods that you are most familiar with. These are methods that are top of mind and come automatically to you: you see a string that needs cleaning up and your fingers type the methods that will get the result.

Often, the methods that you type automatically are the most generic Ruby methods, because they are the ones that we read and write more than others, e.g. #gsub is a generic method to substitute characters in strings. But, Ruby has so much more to offer, with more specialized convenience methods for standard operations.

I love Ruby's rich idiom mostly because it makes code more elegant and easier to read. If we want to benefit from this richness, we need to spend time refactoring even the simplest parts of our code—for instance, cleaning up a string—and it takes a bit of an effort to expand our vocabulary. The question is: is the extra effort worth it?

Four Ways to Remove Spaces

Here's a string that represents a credit card number: "055 444 285". To work with it, we want to remove the spaces. #gsub can do this; with #gsub you can substitute anything with everything. But there are other options.

1string = "055 444 285"
2string.gsub(/ /, '')
3string.gsub(' ', '')' ', '')
5string.delete(' ')
7# => "055444285"

It's the expressiveness that I like most about the convenience methods. The last one is a good example of this: it doesn't get more obvious than "delete spaces". Thinking about trade-offs between options, readability is my first priority, unless of course, it causes performance problems. So, let's see how much pain my favorite solution, #delete really causes.

I benchmarked the examples above. Which one of these methods do you think is the fastest?

1Benchmark.ips do |x|
2  x.config(time: 30, warmup: 2)
4'gsub')           { string.gsub(/ /, '') }
5'gsub, no regex') { string.gsub(' ', '') }
6'tr')             {' ','') }
7'delete')         { string.delete(' ') }
Guess the order from most to least performant. Open the toggle to see the result
2  delete:          2326817.5 i/s
3  tr:              2121629.8 i/s   - 1.10x  slower
4  gsub, no regex:  868184.1 i/s    - 2.68x  slower
5  gsub:            474970.5 i/s    - 4.90x  slower

I wasn't surprised about the order, but the differences in speed still surprised me. #gsub is not only slower, but it also requires an extra effort for the reader to 'decode' the arguments. Let's see how this comparison works out when cleaning up more than just spaces.

Pick Your Numbers

Take the following phone number: '(408) 974-2414'. Let's say we only need the numericals => 4089742414. I added a #scan as well because I like that it expresses more clearly that we aim for some particular things, instead of trying to remove all the things we don't want.

1Benchmark.ips do |x|
2  x.config(time: 30, warmup: 2)
4 ('gsub')           { string.gsub(/[^0-9] /, '') }
5'tr')              {"^0-9", "") }
6'delete_chars')    { string.delete("^0-9") }
7'scan')            { string.scan(/[0-9]/).join }
Again, guess the order, then open the toggle to see the answer
2  delete_chars:   2006750.8 i/s
3  tr:             1856429.0 i/s   - 1.08x  slower
4  gsub:           523174.7 i/s    - 3.84x  slower
5  scan:           227717.4 i/s    - 8.81x  slower

Using a regex slows things down, that's not surprising. And the intention revealing expressiveness of #scan costs us dearly. But looking at how Ruby's specialized methods handle cleaning up, gave me a taste for more.

On the Money

Let's try some ways of removing the substring "€ " from the string "€ 300". Some of the following solutions specify the exact substring "€ ", some will simply remove all currency symbols or all non-numerical characters.

1Benchmark.ips do |x|
2  x.config(time: 30, warmup: 2)
4'delete specific chars')  { string.delete("€ ") }
5'delete non-numericals')  { string.delete("^0-9") }
6'delete prefix')          { string.delete_prefix("€ ") }
7'delete prefix, strip')   { string.delete_prefix("€").strip }
9'gsub')                   { string.gsub(/€ /, '') }
10'gsub-non-nums')          { string.gsub(/[^0-9]/, '') }
11'tr')                     {"€ ", "") }
12'slice array')            { string.chars.slice(2..-1).join }
13'split')                  { string.split.last }
14'scan nums')              { string.scan(/\d/).join }

You may expect, and correctly so, that the winner is one of the #deletes. But which one of the #delete variants do you expect to be the fastest? Plus: one of the other methods is faster than some of the #deletes. Which one?

Guess and then open.
2        delete prefix:   4236218.6 i/s
3 delete prefix, strip:   3116439.6 i/s - 1.36x  slower
4                split:   2139602.2 i/s - 1.98x  slower
5delete non-numericals:   1949754.0 i/s - 2.17x  slower
6delete specific chars:   1045651.9 i/s - 4.05x  slower
7                   tr:   951352.0 i/s  - 4.45x  slower
8          slice array:   681196.2 i/s  - 6.22x  slower
9                 gsub:   548588.3 i/s  - 7.72x  slower
10        gsub-non-nums:   489744.8 i/s  - 8.65x  slower
11            scan nums:   418978.8 i/s  - 10.11x  slower

I was surprised that even slicing an array is faster than #gsub and I'm always pleased to see how fast #split is. And note that deleting all non-numericals is faster than deleting a specific substring.

Follow the Money

Let's remove the currency after the number. (I skipped the slower #gsub variants.)

1Benchmark.ips do |x|
2  x.config(time: 30, warmup: 2)
4'gsub')                        { string.gsub(/ USD/, '')
5'tr')                          {" USD", "") }
6'delete_chars')                { string.delete("^0-9")
7'delete_suffix')               { string.delete_suffix(" USD") }
8'to_i.to_s')                   { string.to_i.to_s }
9"split")                       { string.split.first }

There's a draw between winners. Which 2 do you expect to compete for being the fastest?

And: guess _how much_ slower `#gsub` is here.
2       delete_suffix:      4354205.4 i/s
3           to_i.to_s:      4307614.6 i/s - same-ish: difference falls within error
4               split:      2870187.8 i/s - 1.52x  slower
5        delete_chars:      1989566.1 i/s - 2.19x  slower
6                  tr:      1853957.1 i/s - 2.35x  slower
7                gsub:      524080.6 i/s - 13.22x  slower

There isn't always a specialized method that will suit your needs. You can't use #to_i if you need to keep a leading "0". And #delete_suffix leans heavily on the assumption that the currency is US Dollars.

The specialized methods are like precision tools—suitable for a specific task in a specific context. So there will always be cases where #gsub is exactly what we need. It is versatile, and it's always top of mind. But it can be a bit harder to process and is often slower, even slower than I expected. To me, Ruby's richness is also one of the reasons that makes it so much fun to work with. The speed wins are a nice bonus.

Share this article

Maud de Vries

Maud de Vries

Guest author Maud de Vries is a freelance Ruby on Rails developer, a Coach for (solo) entrepreneurs and she used to be an editor as well. The writer inside sometimes escapes.

All articles by Maud de Vries

AppSignal monitors your apps

AppSignal provides insights for Ruby, Rails, Elixir, Phoenix, Node.js, Express and many other frameworks and libraries. We are located in beautiful Amsterdam. We love stroopwafels. If you do too, let us know. We might send you some!

Discover AppSignal
AppSignal monitors your apps