Ruby is not only a fun language, it also comes with an excellent standard library. Some of which are not that known, and are almost hidden Gems. Today guest writer Michael Kohl highlights a favorite: Stringscanner.
Ruby's hidden Gems: StringScanner
One can get quite far without having to resort to installing third party gems, from data structures like OpenStruct and Set over CSV parsing to benchmarking. However, there are some less well-known libraries available in Ruby's standard installation that can be very useful, one of which is StringScanner
which according to the documentation "provides lexical scanning operations on a string".
Scanning and parsing
So what does "lexical scanning" mean exactly? Essentially it describes the process of taking an input string and extracting meaningful bits of information from it, following certain rules. For example, this can be seen at the first stage of a compiler which takes an expression like 2 + 1
as input and turns it into the following sequence of tokens:
[{ number: "1" }, {operator: "+"}, { number: "1"}]
Lexical scanners are usually implemented as finite-state automata and there are several well-known tools available that can generate them for us (e.g. ANTLR or Ragel).
However, sometimes our parsing needs aren't that elaborate, and a simpler library like the regular expression based StringScanner
can come in very handy in such situations. It works by remembering the location of a so-called scan pointer which is nothing more than an index into the string. The scanning process then tries to match the code right after the scan pointer with the provided expression. Apart from matching operations, StringScanner
also provides methods for moving the scan pointer (moving forwards or backwards through the string), looking ahead (seeing what's next without modifying the scan pointer just yet) as well as finding out where in the string we currently are (is it the beginning or end of a line/the entire string etc).
Parsing Rails Logs
Enough theory, let's see StringScanner
in action. The following example will take a Rails' log entry like the one below,
and parse it into the following hash:
NB: While this makes for a good example for StringScanner
a real application would be better off using Lograge and its JSON log formatter.
In order to use StringScanner
we first need to require it:
After this we can initialize a new instance by passing the log entry as an argument to the constructor. At the same time we'll also define an empty hash to hold the result of our parsing efforts:
We can now use the scanner's pos method to get the current location of our scan pointer. As expected, the result is 0
, the first character of the string:
Let's visualize this so the process will be easier to follow along:
Started GET "/" for 127.0.0.1 at 2017-08-20 20:53:10 +0900 ^ ... Completed 200 OK in 79ms (Views: 78.8ms | ActiveRecord: 0.0ms)
For further introspection of the scanner's state we can use beginning_of_line?
and eos?
to confirm that the scan pointer currently is at the beginning of a line and that we have not yet fully consumed our input:
The first bit of information we want to extract is the HTTP request method, which can be found right after the word "Started" followed by a space. We can use the scanner's appropriately named skip method to advance the scan pointer, which will return the number of ignored characters, which in our case is 8. Additionally we can use matched? to confirm that everything worked as expected:
The scan pointer is now right before the request method:
Started GET "/" for 127.0.0.1 at 2017-08-20 20:53:10 +0900 ^ ... Completed 200 OK in 79ms (Views: 78.8ms | ActiveRecord: 0.0ms)
Now we can use scan_until to extract the actual value, which returns the entire regular expression match. Since the request method is all in uppercase, we can use a simple character class and the +
operator which matches one or characters:
After this operation the scan pointer will be at the final "T" of the word "GET".
Started GET "/" for 127.0.0.1 at 2017-08-20 20:53:10 +0900 ^ ... Completed 200 OK in 79ms (Views: 78.8ms | ActiveRecord: 0.0ms)
To extract the requested path, we will therefore need to skip one space and then extract everything enclosed in double quotes. There are several ways to achieve this, one of them is via a capture group (the part of the Regular expression included in parenthesis, i.e. (.+)
) which matches one or more of any character:
However, we will not be using the return value of this scan
operation directly, but instead use captures to get the value of the first capture group instead:
We successfully extracted the path and the scan pointer is now at the closing double quote:
Started GET "/" for 127.0.0.1 at 2017-08-20 20:53:10 +0900 ^ ... Completed 200 OK in 79ms (Views: 78.8ms | ActiveRecord: 0.0ms)
To parse the IP address from the log, we once again use skip
to ignore the string "for" surrounded by spaces and then use scan_until
to match one or more non whitespace characters (\s
is the character class representing whitespace and [^\s]
is its negation):
Can you tell where the scan pointer will be now? Think about it for a moment and then compare your answer to the solution:
Started GET "/" for 127.0.0.1 at 2017-08-20 20:53:10 +0900 ^ ... Completed 200 OK in 79ms (Views: 78.8ms | ActiveRecord: 0.0ms)
Parsing the timestamp should feel very familiar by now. First we use trusty old skip
to ignore the literal string " at "
and then use scan_until
to read until the end of the current line, which is represented by $
in regular expressions:
The next piece of information we're interested in is the HTTP status code on the last line, so we'll use skip_until to take us all the way to the space after the word "Completed".
As the name suggests this works similarly to scan_until
but instead of returning the matched string it returns the number of skipped over characters. This puts the scan pointer right in front of the HTTP status code we're interested in.
Started GET "/" for 127.0.0.1 at 2017-08-20 20:53:10 +0900 ... Completed 200 OK in 79ms (Views: 78.8ms | ActiveRecord: 0.0ms) ^
Now before we scan the actual HTTP response code, wouldn't it be nice if we could tell if the HTTP response code denotes a success (for the sake of this example any code in the 2xx range) or failure (all other ranges)? To achieve this we will make use of peek to look at the next character, without actually moving the scan pointer.
Now we can use scan to read the next three characters, represented by the regular expression /\d{3}/
:
Once again the scan pointer will be right at the end of the previously matched regular expression:
Started GET "/" for 127.0.0.1 at 2017-08-20 20:53:10 +0900 ... Completed 200 OK in 79ms (Views: 78.8ms | ActiveRecord: 0.0ms) ^
The last bit of information we want to extract from our log entry is the execution time in milliseconds, which can be achieved by skip
ping over the string " OK in "
and then reading everything up to and including the literal string "ms"
.
And with that last bit in there, we have the hash we wanted.
Summary
Ruby's StringScanner
occupies a nice middle ground between simple regular expressions and a full-blown lexer. It isn't the best choice for complex scanning and parsing needs. But it's straightforward nature makes it easy for everyone with basic regular expression knowledge to extract information from input strings and I've used those successfully in production code in the past. We hope you'll discover this hidden Gem.
PS: Let us know what you think are hidden Gems we should highlight next!