280 lines
13 KiB
Markdown
280 lines
13 KiB
Markdown
|
Your friendly guide to understanding the performance characteristics of this
|
||
|
crate.
|
||
|
|
||
|
This guide assumes some familiarity with the public API of this crate, which
|
||
|
can be found here: http://doc.rust-lang.org/regex/regex/index.html
|
||
|
|
||
|
## Theory vs. Practice
|
||
|
|
||
|
One of the design goals of this crate is to provide worst case linear time
|
||
|
behavior with respect to the text searched using finite state automata. This
|
||
|
means that, *in theory*, the performance of this crate is much better than most
|
||
|
regex implementations, which typically use backtracking which has worst case
|
||
|
exponential time.
|
||
|
|
||
|
For example, try opening a Python interpreter and typing this:
|
||
|
|
||
|
>>> import re
|
||
|
>>> re.search('(a*)*c', 'a' * 30).span()
|
||
|
|
||
|
I'll wait.
|
||
|
|
||
|
At some point, you'll figure out that it won't terminate any time soon. ^C it.
|
||
|
|
||
|
The promise of this crate is that *this pathological behavior can't happen*.
|
||
|
|
||
|
With that said, just because we have protected ourselves against worst case
|
||
|
exponential behavior doesn't mean we are immune from large constant factors
|
||
|
or places where the current regex engine isn't quite optimal. This guide will
|
||
|
detail those cases and provide guidance on how to avoid them, among other
|
||
|
bits of general advice.
|
||
|
|
||
|
## Thou Shalt Not Compile Regular Expressions In A Loop
|
||
|
|
||
|
**Advice**: Use `lazy_static` to amortize the cost of `Regex` compilation.
|
||
|
|
||
|
Don't do it unless you really don't mind paying for it. Compiling a regular
|
||
|
expression in this crate is quite expensive. It is conceivable that it may get
|
||
|
faster some day, but I wouldn't hold out hope for, say, an order of magnitude
|
||
|
improvement. In particular, compilation can take any where from a few dozen
|
||
|
microseconds to a few dozen milliseconds. Yes, milliseconds. Unicode character
|
||
|
classes, in particular, have the largest impact on compilation performance. At
|
||
|
the time of writing, for example, `\pL{100}` takes around 44ms to compile. This
|
||
|
is because `\pL` corresponds to every letter in Unicode and compilation must
|
||
|
turn it into a proper automaton that decodes a subset of UTF-8 which
|
||
|
corresponds to those letters. Compilation also spends some cycles shrinking the
|
||
|
size of the automaton.
|
||
|
|
||
|
This means that in order to realize efficient regex matching, one must
|
||
|
*amortize the cost of compilation*. Trivially, if a call to `is_match` is
|
||
|
inside a loop, then make sure your call to `Regex::new` is *outside* that loop.
|
||
|
|
||
|
In many programming languages, regular expressions can be conveniently defined
|
||
|
and compiled in a global scope, and code can reach out and use them as if
|
||
|
they were global static variables. In Rust, there is really no concept of
|
||
|
life-before-main, and therefore, one cannot utter this:
|
||
|
|
||
|
static MY_REGEX: Regex = Regex::new("...").unwrap();
|
||
|
|
||
|
Unfortunately, this would seem to imply that one must pass `Regex` objects
|
||
|
around to everywhere they are used, which can be especially painful depending
|
||
|
on how your program is structured. Thankfully, the
|
||
|
[`lazy_static`](https://crates.io/crates/lazy_static)
|
||
|
crate provides an answer that works well:
|
||
|
|
||
|
#[macro_use] extern crate lazy_static;
|
||
|
extern crate regex;
|
||
|
|
||
|
use regex::Regex;
|
||
|
|
||
|
fn some_helper_function(text: &str) -> bool {
|
||
|
lazy_static! {
|
||
|
static ref MY_REGEX: Regex = Regex::new("...").unwrap();
|
||
|
}
|
||
|
MY_REGEX.is_match(text)
|
||
|
}
|
||
|
|
||
|
In other words, the `lazy_static!` macro enables us to define a `Regex` *as if*
|
||
|
it were a global static value. What is actually happening under the covers is
|
||
|
that the code inside the macro (i.e., `Regex::new(...)`) is run on *first use*
|
||
|
of `MY_REGEX` via a `Deref` impl. The implementation is admittedly magical, but
|
||
|
it's self contained and everything works exactly as you expect. In particular,
|
||
|
`MY_REGEX` can be used from multiple threads without wrapping it in an `Arc` or
|
||
|
a `Mutex`. On that note...
|
||
|
|
||
|
## Using a regex from multiple threads
|
||
|
|
||
|
**Advice**: The performance impact from using a `Regex` from multiple threads
|
||
|
is likely negligible. If necessary, clone the `Regex` so that each thread gets
|
||
|
its own copy. Cloning a regex does not incur any additional memory overhead
|
||
|
than what would be used by using a `Regex` from multiple threads
|
||
|
simultaneously. *Its only cost is ergonomics.*
|
||
|
|
||
|
It is supported and encouraged to define your regexes using `lazy_static!` as
|
||
|
if they were global static values, and then use them to search text from
|
||
|
multiple threads simultaneously.
|
||
|
|
||
|
One might imagine that this is possible because a `Regex` represents a
|
||
|
*compiled* program, so that any allocation or mutation is already done, and is
|
||
|
therefore read-only. Unfortunately, this is not true. Each type of search
|
||
|
strategy in this crate requires some kind of mutable scratch space to use
|
||
|
*during search*. For example, when executing a DFA, its states are computed
|
||
|
lazily and reused on subsequent searches. Those states go into that mutable
|
||
|
scratch space.
|
||
|
|
||
|
The mutable scratch space is an implementation detail, and in general, its
|
||
|
mutation should not be observable from users of this crate. Therefore, it uses
|
||
|
interior mutability. This implies that `Regex` can either only be used from one
|
||
|
thread, or it must do some sort of synchronization. Either choice is
|
||
|
reasonable, but this crate chooses the latter, in particular because it is
|
||
|
ergonomic and makes use with `lazy_static!` straight forward.
|
||
|
|
||
|
Synchronization implies *some* amount of overhead. When a `Regex` is used from
|
||
|
a single thread, this overhead is negligible. When a `Regex` is used from
|
||
|
multiple threads simultaneously, it is possible for the overhead of
|
||
|
synchronization from contention to impact performance. The specific cases where
|
||
|
contention may happen is if you are calling any of these methods repeatedly
|
||
|
from multiple threads simultaneously:
|
||
|
|
||
|
* shortest_match
|
||
|
* is_match
|
||
|
* find
|
||
|
* captures
|
||
|
|
||
|
In particular, every invocation of one of these methods must synchronize with
|
||
|
other threads to retrieve its mutable scratch space before searching can start.
|
||
|
If, however, you are using one of these methods:
|
||
|
|
||
|
* find_iter
|
||
|
* captures_iter
|
||
|
|
||
|
Then you may not suffer from contention since the cost of synchronization is
|
||
|
amortized on *construction of the iterator*. That is, the mutable scratch space
|
||
|
is obtained when the iterator is created and retained throughout its lifetime.
|
||
|
|
||
|
## Only ask for what you need
|
||
|
|
||
|
**Advice**: Prefer in this order: `is_match`, `find`, `captures`.
|
||
|
|
||
|
There are three primary search methods on a `Regex`:
|
||
|
|
||
|
* is_match
|
||
|
* find
|
||
|
* captures
|
||
|
|
||
|
In general, these are ordered from fastest to slowest.
|
||
|
|
||
|
`is_match` is fastest because it doesn't actually need to find the start or the
|
||
|
end of the leftmost-first match. It can quit immediately after it knows there
|
||
|
is a match. For example, given the regex `a+` and the haystack, `aaaaa`, the
|
||
|
search will quit after examing the first byte.
|
||
|
|
||
|
In constrast, `find` must return both the start and end location of the
|
||
|
leftmost-first match. It can use the DFA matcher for this, but must run it
|
||
|
forwards once to find the end of the match *and then run it backwards* to find
|
||
|
the start of the match. The two scans and the cost of finding the real end of
|
||
|
the leftmost-first match make this more expensive than `is_match`.
|
||
|
|
||
|
`captures` is the most expensive of them all because it must do what `find`
|
||
|
does, and then run either the bounded backtracker or the Pike VM to fill in the
|
||
|
capture group locations. Both of these are simulations of an NFA, which must
|
||
|
spend a lot of time shuffling states around. The DFA limits the performance hit
|
||
|
somewhat by restricting the amount of text that must be searched via an NFA
|
||
|
simulation.
|
||
|
|
||
|
One other method not mentioned is `shortest_match`. This method has precisely
|
||
|
the same performance characteristics as `is_match`, except it will return the
|
||
|
end location of when it discovered a match. For example, given the regex `a+`
|
||
|
and the haystack `aaaaa`, `shortest_match` may return `1` as opposed to `5`,
|
||
|
the latter of which being the correct end location of the leftmost-first match.
|
||
|
|
||
|
## Literals in your regex may make it faster
|
||
|
|
||
|
**Advice**: Literals can reduce the work that the regex engine needs to do. Use
|
||
|
them if you can, especially as prefixes.
|
||
|
|
||
|
In particular, if your regex starts with a prefix literal, the prefix is
|
||
|
quickly searched before entering the (much slower) regex engine. For example,
|
||
|
given the regex `foo\w+`, the literal `foo` will be searched for using
|
||
|
Boyer-Moore. If there's no match, then no regex engine is ever used. Only when
|
||
|
there's a match is the regex engine invoked at the location of the match, which
|
||
|
effectively permits the regex engine to skip large portions of a haystack.
|
||
|
If a regex is comprised entirely of literals (possibly more than one), then
|
||
|
it's possible that the regex engine can be avoided entirely even when there's a
|
||
|
match.
|
||
|
|
||
|
When one literal is found, Boyer-Moore is used. When multiple literals are
|
||
|
found, then an optimized version of Aho-Corasick is used.
|
||
|
|
||
|
This optimization is in particular extended quite a bit in this crate. Here are
|
||
|
a few examples of regexes that get literal prefixes detected:
|
||
|
|
||
|
* `(foo|bar)` detects `foo` and `bar`
|
||
|
* `(a|b)c` detects `ac` and `bc`
|
||
|
* `[ab]foo[yz]` detects `afooy`, `afooz`, `bfooy` and `bfooz`
|
||
|
* `a?b` detects `a` and `b`
|
||
|
* `a*b` detects `a` and `b`
|
||
|
* `(ab){3,6}` detects `ababab`
|
||
|
|
||
|
Literals in anchored regexes can also be used for detecting non-matches very
|
||
|
quickly. For example, `^foo\w+` and `\w+foo$` may be able to detect a non-match
|
||
|
just by examing the first (or last) three bytes of the haystack.
|
||
|
|
||
|
## Unicode word boundaries may prevent the DFA from being used
|
||
|
|
||
|
**Advice**: In most cases, `\b` should work well. If not, use `(?-u:\b)`
|
||
|
instead of `\b` if you care about consistent performance more than correctness.
|
||
|
|
||
|
It's a sad state of the current implementation. At the moment, the DFA will try
|
||
|
to interpret Unicode word boundaries as if they were ASCII word boundaries.
|
||
|
If the DFA comes across any non-ASCII byte, it will quit and fall back to an
|
||
|
alternative matching engine that can handle Unicode word boundaries correctly.
|
||
|
The alternate matching engine is generally quite a bit slower (perhaps by an
|
||
|
order of magnitude). If necessary, this can be ameliorated in two ways.
|
||
|
|
||
|
The first way is to add some number of literal prefixes to your regular
|
||
|
expression. Even though the DFA may not be used, specialized routines will
|
||
|
still kick in to find prefix literals quickly, which limits how much work the
|
||
|
NFA simulation will need to do.
|
||
|
|
||
|
The second way is to give up on Unicode and use an ASCII word boundary instead.
|
||
|
One can use an ASCII word boundary by disabling Unicode support. That is,
|
||
|
instead of using `\b`, use `(?-u:\b)`. Namely, given the regex `\b.+\b`, it
|
||
|
can be transformed into a regex that uses the DFA with `(?-u:\b).+(?-u:\b)`. It
|
||
|
is important to limit the scope of disabling the `u` flag, since it might lead
|
||
|
to a syntax error if the regex could match arbitrary bytes. For example, if one
|
||
|
wrote `(?-u)\b.+\b`, then a syntax error would be returned because `.` matches
|
||
|
any *byte* when the Unicode flag is disabled.
|
||
|
|
||
|
The second way isn't appreciably different than just using a Unicode word
|
||
|
boundary in the first place, since the DFA will speculatively interpret it as
|
||
|
an ASCII word boundary anyway. The key difference is that if an ASCII word
|
||
|
boundary is used explicitly, then the DFA won't quit in the presence of
|
||
|
non-ASCII UTF-8 bytes. This results in giving up correctness in exchange for
|
||
|
more consistent performance.
|
||
|
|
||
|
N.B. When using `bytes::Regex`, Unicode support is disabled by default, so one
|
||
|
can simply write `\b` to get an ASCII word boundary.
|
||
|
|
||
|
## Excessive counting can lead to exponential state blow up in the DFA
|
||
|
|
||
|
**Advice**: Don't write regexes that cause DFA state blow up if you care about
|
||
|
match performance.
|
||
|
|
||
|
Wait, didn't I say that this crate guards against exponential worst cases?
|
||
|
Well, it turns out that the process of converting an NFA to a DFA can lead to
|
||
|
an exponential blow up in the number of states. This crate specifically guards
|
||
|
against exponential blow up by doing two things:
|
||
|
|
||
|
1. The DFA is computed lazily. That is, a state in the DFA only exists in
|
||
|
memory if it is visited. In particular, the lazy DFA guarantees that *at
|
||
|
most* one state is created for every byte of input. This, on its own,
|
||
|
guarantees linear time complexity.
|
||
|
2. Of course, creating a new state for *every* byte of input means that search
|
||
|
will go incredibly slow because of very large constant factors. On top of
|
||
|
that, creating a state for every byte in a large haystack could result in
|
||
|
exorbitant memory usage. To ameliorate this, the DFA bounds the number of
|
||
|
states it can store. Once it reaches its limit, it flushes its cache. This
|
||
|
prevents reuse of states that it already computed. If the cache is flushed
|
||
|
too frequently, then the DFA will give up and execution will fall back to
|
||
|
one of the NFA simulations.
|
||
|
|
||
|
In effect, this crate will detect exponential state blow up and fall back to
|
||
|
a search routine with fixed memory requirements. This does, however, mean that
|
||
|
searching will be much slower than one might expect. Regexes that rely on
|
||
|
counting in particular are strong aggravators of this behavior. For example,
|
||
|
matching `[01]*1[01]{20}$` against a random sequence of `0`s and `1`s.
|
||
|
|
||
|
In the future, it may be possible to increase the bound that the DFA uses,
|
||
|
which would allow the caller to choose how much memory they're willing to
|
||
|
spend.
|
||
|
|
||
|
## Resist the temptation to "optimize" regexes
|
||
|
|
||
|
**Advice**: This ain't a backtracking engine.
|
||
|
|
||
|
An entire book was written on how to optimize Perl-style regular expressions.
|
||
|
Most of those techniques are not applicable for this library. For example,
|
||
|
there is no problem with using non-greedy matching or having lots of
|
||
|
alternations in your regex.
|