Introduction
Some time ago I had a phone interview to test my programming skills and after a few basic questions with some live coding – how would you implement memcpy()? how would you implement memmove() – and after an expansion for exponentiation that was out of my league without consulting Knuth – the challenge was given:
“Write a regular expression to match an IPv4 address.”
…and I felt really good about this – I consider myself to be a serious regexp geek, not Spencer / Wall / Schwartz / Rossum, but I have at least written a few lexers, parsers, matchers and a bunch of fileglob routines – so I started out with the obvious base for refinement:
\d+\.\d+\.\d+\.\d+
“Yes, but that will match any octets.”
Yes, yes, quite true – so let’s do some restricted matching on it:
\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}
…or in fact I wrote:
(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})
…because (having done this a few hundred times in the past two decades) I reasoned that the programmer would want to parse-out the octets at this point, and so it’d be worthwhile saving them; my interviewer followed through with:
“Yes, but that will still match values larger than 255!”
Well, OK – I thought – if you want to start being strict with the parser you could use the regular expression engine to reject inputs outside the range 0..255; you can do that, but you are starting to overload the code. Nonetheless playing along in the spirit of the endeavour (and with the clock ticking) – I began to mod the code, and eventually each octet started to look like:
(\d|\d\d|1\d\d|2(([0-4]\d)|5[0-5]))
…the total looking somewhat like:
(\d|\d\d|1\d\d|2(([0-4]\d)|5[0-5]))\.
(\d|\d\d|1\d\d|2(([0-4]\d)|5[0-5]))\.
(\d|\d\d|1\d\d|2(([0-4]\d)|5[0-5]))\.
(\d|\d\d|1\d\d|2(([0-4]\d)|5[0-5]))
…but I said to the guy “Look, you really don’t want to be doing it this way, it’ll spend all of its time backtracking…” which brought the prompt retort:
“It will not! There is no backtracking!” [ed: verbatim, I think]
Hmm. Okay. Maybe that was the wrong way to explain it?
What I was trying to convey was that the pattern matcher would be thrashing backwards and forwards (because of all the parenthetical OR-separated subgroups) looking for fixed points to anchor against; but we were very short for time and in fact the interview was soon terminated for overrunning – just after I pasted a code-fragment of how I would more verbosely but I felt more wisely go about solving the problem…
[update: see also the DFA vs: NFA comments in the Reddit thread]
Yet one week later the feedback from the interview was deemed to have been “negative” – and that was the end of that job application.
So be it. Maybe I need to be able to write my own exponentiation routine to be a real programmer?
However, not being one to let a performance issue drop I decided to document the experience for my blog, and hence this posting.
Anyone who’s read the first edition of The Camel Book[1] will remember the section that tells you that multiple, simple regular expressions are dealt with much faster than long and complex ones, and it’s kinda obvious why that happens when you consider what it’s trying to achieve – but the cost/scale of the problem often gets missed.
But let’s put that aside those concerns for a moment and look at three other issues:
Non-Performance Concerns
- Code Maintainability: the above is atrocious from the perspective of maintenance, it’s the sort of thing which unfairly gives Perl a bad reputation. It looks complex, the core pattern is repeated four times – meaning four things to poke, four places to make mistakes – plus the above is actually buggy, as we will soon demonstrate.
- Code Coupling: you should not be using your parser to do semantic validation; syntactic validation certainly, but in compiler design you don’t hotwire lex to check for integer overflow, instead you say this looks like an integer and pass it off to another module which does input validation and optimisation.
Perhaps my interviewer knew this, perhaps not; but it’s very easy (but incorrect) to assume that parsing comes for free and any validation processing that is put on top of that is high and extra cost.
- Code Complexity: what happens when the next specification change requires you to match only legitimate class-A/B/C addresses, and skip multicast. What then? It’s already bad enough…
So those are three more reasons not to solve the challenge this way, but I am making a performance claim so let’s get back to the experiment:
Lab Conditions
I created a standard blob of test data from 2Gb of Apache webserver logfiles and errorfiles; actually it was 700-ish megabytes of log data concatenated thrice, but that’s close enough for government work.
I then built a small Perl harness for pushing this 2Gb file through a regular expression, doing some small amount of pointless work to ensure that nothing is being optimised-away, while printing the matches for debugging.
Finally: I wrote a shellscript wrapper to run the whole thing against the clock.
All code for this posting is available at:code.google.com/p/regular-expression-experiment; the stuff in the posting will likely be a little munged by HTML making it inoperable.
The Perl Harness
Small and simple, the Perl harness is this:
$pat = qr!( REGULAR EXPRESSION )!ox;
while (<>) {
while (s/$pat/-/o) {
print "$1\n";
}
}
The Challenge
So let’s run again through the development and refinement of this regular expression challenge; first let’s define our own problem statement:
Take a large, arbitrary logfile; extract from it anything that looks like a legitimate IPv4 address in decimal-dotted-quad notation, possibly multiple ones per line, possibly embedded within strings of text (hostnames?) rather than nicely tokenised. For the moment ignore the problem of leading zeros and/or zero-padding. Make three timed passes over the data for each test. Quiescent machine with 4Gb RAM so the input file can prettymuch live in the disk buffer cache.
The Simple Approach
Let’s start with the simple pattern which will produce wrong results:
$pat = qr!( \d+\. \d+\. \d+\. \d+ )!ox;runtimes: 1m37.020s 1m37.204s 1m36.845s
We know this pattern will match illegal strings like “1000.1000.1000.1000” however it provides a useful fast-and-dumb baseline time of about 1m37s. If we could guarantee that no illegal data were in the file, this would be the time we would have to beat – but then that would be no fun.
Note that we only have one set of parenthesis in the pattern, saving out the whole thing for printing; if we also save the individual octets with more parenthses for later processing, what’s the extra cost?
$pat = qr!( (\d+)\. (\d+)\. (\d+)\. (\d+) )!ox;runtimes: 1m48.240s 1m48.152s 1m48.114s
Answer: it costs an extra 11 seconds with no other code changes; not much, but it does demonstrate that regular expression simplicity is a factor in performance, even if you change nothing else.
Now let’s try and legitimise the output and print only those strings with legitimate “octets” which are in the range 0..255:
$pat = qr!( (\d+)\. (\d+)\. (\d+)\. (\d+) )!ox;...
print "$1\n" if (($2 < 256) and ($3 < 256) and ($4 < 256) and ($5 < 256));
runtimes: 2m7.649s 2m8.172s 2m7.414s
The test shows that this works really quite well, and the four octet string-to-int-and-comparison cost for every line of output has added only another 20s to runtime.
SUMMARY FOR THIS SECTION: 2m7s is the time to beat.
The Slightly More Clever Matching Approach
So let’s try to get clever – after all, all that string-to-int conversion has to be more expensive than regular expressions, surely? A succession of development is necessary, however:
$pat = qr!( \d{1,3}\. \d{1,3}\. \d{1,3}\. \d{1,3} )!ox;
There’s a bug in the above code because the regexp is not adequately greedy and so if it encounters a string like foo-12.34.56.1000-bar then it will return 12.34.56.100 ignoring the trailing “0” that would disqualify this from being an IP address. [nb: EXAMPLE FIXED]
In short, this regexp would synthesise bad data, so let’s improve it:
$pat = qr!( (\d{1,3})\. (\d{1,3})\. (\d{1,3})\. (\d{1,3}) )!ox;...
print "$1\n" if (($1 < 256) and ($2 < 256) and ($3 < 256) and ($4 < 256));
Nope, you can’t do this because the match passes the bad data into the integer comparison test. It’s useful to do this to get a sense of the timings but bzzzzt! try again:
$pat = qr!( (\d{1,3})\. (\d{1,3})\. (\d{1,3})\. (\d{1,3})\b # ANCHOR TO WORD BOUNDARY )!ox;...
print "$1\n" if (($2 < 256) and ($3 < 256) and ($4 < 256) and ($5 < 256));
runtimes: 2m32.568s 2m33.067s 2m32.716s
That’s clever, isn’t it? Anchor the tail of the regexp to a word boundary so that the last octet must be comprised of between one and three digits at the end of a word; plus you still have to check for “octets” like 999.
And it takes about 25 seconds longer than the dumb-parse-and-compare method.
Ouch.
SUMMARY FOR THIS SECTION: “smarter pattern matching” = 120% FAIL.
The Complex Matching Approach
Let’s throw more science at the problem; first-up are the non-solutions:
$pat = qr!(
(\d|\d\d|1\d\d|2[0-4]\d|25[0-5])\.
(\d|\d\d|1\d\d|2[0-4]\d|25[0-5])\.
(\d|\d\d|1\d\d|2[0-4]\d|25[0-5])\.
(\d|\d\d|1\d\d|2[0-4]\d|25[0-5]) # nongreedy, RE will take shortest match
)!ox;
Oy veh, this is a revision of the behemoth from above. It’s also buggy. The problem is the left-to-right evaluation of the parethesised subgroups – in short if you feed 255.255.255.255 into this match, you’ll probably get 255.255.255.2 returned to you because the first \d matched the first character of the final octet, and returned that.
I have no idea whether there’s a mandatory left-to-right execution of matches within groups (POSIX?) so the results might end up being implementation-dependent – and that is always a bad thing.
Never fear – we can just reverse the order:
$pat = qr!(
(25[0-5]|2[0-4]\d|1\d\d|\d\d|\d)\.
(25[0-5]|2[0-4]\d|1\d\d|\d\d|\d)\.
(25[0-5]|2[0-4]\d|1\d\d|\d\d|\d)\.
(25[0-5]|2[0-4]\d|1\d\d|\d\d|\d)
…except that this one suffers the same “unanchored” problem as the “slightly more clever” set, so that 12.34.56.789 synthesises 12.34.56.78, data that can’t be an IP address and didn’t exist in the original.
However we know how to fix this, don’t we?
$pat = qr!(
(25[0-5]|2[0-4]\d|1\d\d|\d\d|\d)\.
(25[0-5]|2[0-4]\d|1\d\d|\d\d|\d)\.
(25[0-5]|2[0-4]\d|1\d\d|\d\d|\d)\.
(25[0-5]|2[0-4]\d|1\d\d|\d\d|\d)\b # ANCHOR TO WORD BOUNDARY
)!ox;
runtimes: 3m8.960s 3m8.586s 3m8.445s
So solving this problem by pushing values (“…or a three digit number beginning with ’25’…”) into the regular expression has added 1 minute to the program runtime.
SUMMARY FOR THIS SECTION: “complex pattern matching” = 148% FAIL.
That’s really bad, and that’s what I was trying to get across to my interviewer; but let’s try and salvage this approach, just one last time…
The Hybrid Approach
$pat = qr!(
(\d|\d\d|1\d\d|2[0-4]\d|25[0-5])\.
(\d|\d\d|1\d\d|2[0-4]\d|25[0-5])\.
(\d|\d\d|1\d\d|2[0-4]\d|25[0-5])\.
(\d+) # GREEDY BUT NEEDS TO BE BOUNDSCHECKED
)!ox;...
print "$1\n" if ($5 < 256);
runtimes: 2m54.555s 2m53.972s 2m53.931s
You get an intermediate result, recovering about 14 seconds of runtime by not trying to coersce the matcher into working out how big a decimal number should be, and instead letting it get on with identifying things that look like sequences of decimal numerals; and that's only with respect to the final octet.
SUMMARY FOR THIS SECTION: "hybrid pattern matching" = 136% FAIL.
So it looks like fast-and-dumb wins... except there's one more twist in the tale.
The Anchoring Bug
Remember we anchored a few of the patterns above with \b to tie the final octet to a word boundary? Well comparison of the output showed those algorithms to be rejecting strings of the form:
foo192.168.1.1bar
...because the final octet was not on a "word boundary" as defined by Perl regular expressions; in Perl \b is defined as a interchange between alphanumeric and non-alphanumeric and the final octet in the example above is not on such an interchange; so all of those implementations are also dropping potentially-viable inputs. Plus the anchoring issue also applies to the head-end, too, which we missed in the first round of regexps. Argh.
This aspect beat me - I stared at the regular expressions technical reference for a while trying to address this before deciding "to hell with this, it's going to still be slower than the parse-and-convert method".
Which (again) was my point all along.
The Summary
There are still a bunch of open questions and ambiguities in the above problem definition, and guesses that we have to make about potential inputs and whether they create viable outputs. Also: what about leading zero padding? Are they permissible? If so, how many zeroes should be permitted? Some of the regular expressions cited above permit leading zeros, others don't.
Not to mention: what about IPv6? 🙂
But to get the performance boosts I've described above, I can at least provide the following advice:
- you - or someone - really ought to be specifying the problem really clearly before implementation
- and then you should be implementing in a style that satisfies all of the performance and maintainability issues that are raised above.
- use your patternmatcher to match patterns, not to compare values.
- k.i.s.s. works really well for regexp.
- this is a question which is really hard to treat properly in 10 minutes.
- the camel book is still epic.
--
[1] I have January 1991 first edition, love it so much that I rescued it from my Saab after crashing it in 1993 (winter, bend, black ice) into a river and flooding the car; I washed the book in pure water to remove the silt and spent four weeks drying it in a cupboard before I could open it again.
Leave a Reply