333 lines
12 KiB
Markdown
333 lines
12 KiB
Markdown
|
regex
|
||
|
=====
|
||
|
This crate provides routines for searching strings for matches of a [regular
|
||
|
expression] (aka "regex"). The regex syntax supported by this crate is similar
|
||
|
to other regex engines, but it lacks several features that are not known how to
|
||
|
implement efficiently. This includes, but is not limited to, look-around and
|
||
|
backreferences. In exchange, all regex searches in this crate have worst case
|
||
|
`O(m * n)` time complexity, where `m` is proportional to the size of the regex
|
||
|
and `n` is proportional to the size of the string being searched.
|
||
|
|
||
|
[regular expression]: https://en.wikipedia.org/wiki/Regular_expression
|
||
|
|
||
|
[![Build status](https://github.com/rust-lang/regex/workflows/ci/badge.svg)](https://github.com/rust-lang/regex/actions)
|
||
|
[![Crates.io](https://img.shields.io/crates/v/regex.svg)](https://crates.io/crates/regex)
|
||
|
|
||
|
### Documentation
|
||
|
|
||
|
[Module documentation with examples](https://docs.rs/regex).
|
||
|
The module documentation also includes a comprehensive description of the
|
||
|
syntax supported.
|
||
|
|
||
|
Documentation with examples for the various matching functions and iterators
|
||
|
can be found on the
|
||
|
[`Regex` type](https://docs.rs/regex/*/regex/struct.Regex.html).
|
||
|
|
||
|
### Usage
|
||
|
|
||
|
To bring this crate into your repository, either add `regex` to your
|
||
|
`Cargo.toml`, or run `cargo add regex`.
|
||
|
|
||
|
Here's a simple example that matches a date in YYYY-MM-DD format and prints the
|
||
|
year, month and day:
|
||
|
|
||
|
```rust
|
||
|
use regex::Regex;
|
||
|
|
||
|
fn main() {
|
||
|
let re = Regex::new(r"(?x)
|
||
|
(?P<year>\d{4}) # the year
|
||
|
-
|
||
|
(?P<month>\d{2}) # the month
|
||
|
-
|
||
|
(?P<day>\d{2}) # the day
|
||
|
").unwrap();
|
||
|
|
||
|
let caps = re.captures("2010-03-14").unwrap();
|
||
|
assert_eq!("2010", &caps["year"]);
|
||
|
assert_eq!("03", &caps["month"]);
|
||
|
assert_eq!("14", &caps["day"]);
|
||
|
}
|
||
|
```
|
||
|
|
||
|
If you have lots of dates in text that you'd like to iterate over, then it's
|
||
|
easy to adapt the above example with an iterator:
|
||
|
|
||
|
```rust
|
||
|
use regex::Regex;
|
||
|
|
||
|
fn main() {
|
||
|
let re = Regex::new(r"(\d{4})-(\d{2})-(\d{2})").unwrap();
|
||
|
let hay = "On 2010-03-14, foo happened. On 2014-10-14, bar happened.";
|
||
|
|
||
|
let mut dates = vec![];
|
||
|
for (_, [year, month, day]) in re.captures_iter(hay).map(|c| c.extract()) {
|
||
|
dates.push((year, month, day));
|
||
|
}
|
||
|
assert_eq!(dates, vec![
|
||
|
("2010", "03", "14"),
|
||
|
("2014", "10", "14"),
|
||
|
]);
|
||
|
}
|
||
|
```
|
||
|
|
||
|
### Usage: Avoid compiling the same regex in a loop
|
||
|
|
||
|
It is an anti-pattern to compile the same regular expression in a loop since
|
||
|
compilation is typically expensive. (It takes anywhere from a few microseconds
|
||
|
to a few **milliseconds** depending on the size of the regex.) Not only is
|
||
|
compilation itself expensive, but this also prevents optimizations that reuse
|
||
|
allocations internally to the matching engines.
|
||
|
|
||
|
In Rust, it can sometimes be a pain to pass regular expressions around if
|
||
|
they're used from inside a helper function. Instead, we recommend using the
|
||
|
[`once_cell`](https://crates.io/crates/once_cell) crate to ensure that
|
||
|
regular expressions are compiled exactly once. For example:
|
||
|
|
||
|
```rust
|
||
|
use {
|
||
|
once_cell::sync::Lazy,
|
||
|
regex::Regex,
|
||
|
};
|
||
|
|
||
|
fn some_helper_function(haystack: &str) -> bool {
|
||
|
static RE: Lazy<Regex> = Lazy::new(|| Regex::new(r"...").unwrap());
|
||
|
RE.is_match(haystack)
|
||
|
}
|
||
|
|
||
|
fn main() {
|
||
|
assert!(some_helper_function("abc"));
|
||
|
assert!(!some_helper_function("ac"));
|
||
|
}
|
||
|
```
|
||
|
|
||
|
Specifically, in this example, the regex will be compiled when it is used for
|
||
|
the first time. On subsequent uses, it will reuse the previous compilation.
|
||
|
|
||
|
### Usage: match regular expressions on `&[u8]`
|
||
|
|
||
|
The main API of this crate (`regex::Regex`) requires the caller to pass a
|
||
|
`&str` for searching. In Rust, an `&str` is required to be valid UTF-8, which
|
||
|
means the main API can't be used for searching arbitrary bytes.
|
||
|
|
||
|
To match on arbitrary bytes, use the `regex::bytes::Regex` API. The API is
|
||
|
identical to the main API, except that it takes an `&[u8]` to search on instead
|
||
|
of an `&str`. The `&[u8]` APIs also permit disabling Unicode mode in the regex
|
||
|
even when the pattern would match invalid UTF-8. For example, `(?-u:.)` is
|
||
|
not allowed in `regex::Regex` but is allowed in `regex::bytes::Regex` since
|
||
|
`(?-u:.)` matches any byte except for `\n`. Conversely, `.` will match the
|
||
|
UTF-8 encoding of any Unicode scalar value except for `\n`.
|
||
|
|
||
|
This example shows how to find all null-terminated strings in a slice of bytes:
|
||
|
|
||
|
```rust
|
||
|
use regex::bytes::Regex;
|
||
|
|
||
|
let re = Regex::new(r"(?-u)(?<cstr>[^\x00]+)\x00").unwrap();
|
||
|
let text = b"foo\xFFbar\x00baz\x00";
|
||
|
|
||
|
// Extract all of the strings without the null terminator from each match.
|
||
|
// The unwrap is OK here since a match requires the `cstr` capture to match.
|
||
|
let cstrs: Vec<&[u8]> =
|
||
|
re.captures_iter(text)
|
||
|
.map(|c| c.name("cstr").unwrap().as_bytes())
|
||
|
.collect();
|
||
|
assert_eq!(vec![&b"foo\xFFbar"[..], &b"baz"[..]], cstrs);
|
||
|
```
|
||
|
|
||
|
Notice here that the `[^\x00]+` will match any *byte* except for `NUL`,
|
||
|
including bytes like `\xFF` which are not valid UTF-8. When using the main API,
|
||
|
`[^\x00]+` would instead match any valid UTF-8 sequence except for `NUL`.
|
||
|
|
||
|
### Usage: match multiple regular expressions simultaneously
|
||
|
|
||
|
This demonstrates how to use a `RegexSet` to match multiple (possibly
|
||
|
overlapping) regular expressions in a single scan of the search text:
|
||
|
|
||
|
```rust
|
||
|
use regex::RegexSet;
|
||
|
|
||
|
let set = RegexSet::new(&[
|
||
|
r"\w+",
|
||
|
r"\d+",
|
||
|
r"\pL+",
|
||
|
r"foo",
|
||
|
r"bar",
|
||
|
r"barfoo",
|
||
|
r"foobar",
|
||
|
]).unwrap();
|
||
|
|
||
|
// Iterate over and collect all of the matches.
|
||
|
let matches: Vec<_> = set.matches("foobar").into_iter().collect();
|
||
|
assert_eq!(matches, vec![0, 2, 3, 4, 6]);
|
||
|
|
||
|
// You can also test whether a particular regex matched:
|
||
|
let matches = set.matches("foobar");
|
||
|
assert!(!matches.matched(5));
|
||
|
assert!(matches.matched(6));
|
||
|
```
|
||
|
|
||
|
|
||
|
### Usage: regex internals as a library
|
||
|
|
||
|
The [`regex-automata` directory](./regex-automata/) contains a crate that
|
||
|
exposes all of the internal matching engines used by the `regex` crate. The
|
||
|
idea is that the `regex` crate exposes a simple API for 99% of use cases, but
|
||
|
`regex-automata` exposes oodles of customizable behaviors.
|
||
|
|
||
|
[Documentation for `regex-automata`.](https://docs.rs/regex-automata)
|
||
|
|
||
|
|
||
|
### Usage: a regular expression parser
|
||
|
|
||
|
This repository contains a crate that provides a well tested regular expression
|
||
|
parser, abstract syntax and a high-level intermediate representation for
|
||
|
convenient analysis. It provides no facilities for compilation or execution.
|
||
|
This may be useful if you're implementing your own regex engine or otherwise
|
||
|
need to do analysis on the syntax of a regular expression. It is otherwise not
|
||
|
recommended for general use.
|
||
|
|
||
|
[Documentation for `regex-syntax`.](https://docs.rs/regex-syntax)
|
||
|
|
||
|
|
||
|
### Crate features
|
||
|
|
||
|
This crate comes with several features that permit tweaking the trade off
|
||
|
between binary size, compilation time and runtime performance. Users of this
|
||
|
crate can selectively disable Unicode tables, or choose from a variety of
|
||
|
optimizations performed by this crate to disable.
|
||
|
|
||
|
When all of these features are disabled, runtime match performance may be much
|
||
|
worse, but if you're matching on short strings, or if high performance isn't
|
||
|
necessary, then such a configuration is perfectly serviceable. To disable
|
||
|
all such features, use the following `Cargo.toml` dependency configuration:
|
||
|
|
||
|
```toml
|
||
|
[dependencies.regex]
|
||
|
version = "1.3"
|
||
|
default-features = false
|
||
|
# Unless you have a specific reason not to, it's good sense to enable standard
|
||
|
# library support. It enables several optimizations and avoids spin locks. It
|
||
|
# also shouldn't meaningfully impact compile times or binary size.
|
||
|
features = ["std"]
|
||
|
```
|
||
|
|
||
|
This will reduce the dependency tree of `regex` down to two crates:
|
||
|
`regex-syntax` and `regex-automata`.
|
||
|
|
||
|
The full set of features one can disable are
|
||
|
[in the "Crate features" section of the documentation](https://docs.rs/regex/1.*/#crate-features).
|
||
|
|
||
|
|
||
|
### Performance
|
||
|
|
||
|
One of the goals of this crate is for the regex engine to be "fast." What that
|
||
|
is a somewhat nebulous goal, it is usually interpreted in one of two ways.
|
||
|
First, it means that all searches take worst case `O(m * n)` time, where
|
||
|
`m` is proportional to `len(regex)` and `n` is proportional to `len(haystack)`.
|
||
|
Second, it means that even aside from the time complexity constraint, regex
|
||
|
searches are "fast" in practice.
|
||
|
|
||
|
While the first interpretation is pretty unambiguous, the second one remains
|
||
|
nebulous. While nebulous, it guides this crate's architecture and the sorts of
|
||
|
the trade offs it makes. For example, here are some general architectural
|
||
|
statements that follow as a result of the goal to be "fast":
|
||
|
|
||
|
* When given the choice between faster regex searches and faster _Rust compile
|
||
|
times_, this crate will generally choose faster regex searches.
|
||
|
* When given the choice between faster regex searches and faster _regex compile
|
||
|
times_, this crate will generally choose faster regex searches. That is, it is
|
||
|
generally acceptable for `Regex::new` to get a little slower if it means that
|
||
|
searches get faster. (This is a somewhat delicate balance to strike, because
|
||
|
the speed of `Regex::new` needs to remain somewhat reasonable. But this is why
|
||
|
one should avoid re-compiling the same regex over and over again.)
|
||
|
* When given the choice between faster regex searches and simpler API
|
||
|
design, this crate will generally choose faster regex searches. For example,
|
||
|
if one didn't care about performance, we could like get rid of both of
|
||
|
the `Regex::is_match` and `Regex::find` APIs and instead just rely on
|
||
|
`Regex::captures`.
|
||
|
|
||
|
There are perhaps more ways that being "fast" influences things.
|
||
|
|
||
|
While this repository used to provide its own benchmark suite, it has since
|
||
|
been moved to [rebar](https://github.com/BurntSushi/rebar). The benchmarks are
|
||
|
quite extensive, and there are many more than what is shown in rebar's README
|
||
|
(which is just limited to a "curated" set meant to compare performance between
|
||
|
regex engines). To run all of this crate's benchmarks, first start by cloning
|
||
|
and installing `rebar`:
|
||
|
|
||
|
```text
|
||
|
$ git clone https://github.com/BurntSushi/rebar
|
||
|
$ cd rebar
|
||
|
$ cargo install --path ./
|
||
|
```
|
||
|
|
||
|
Then build the benchmark harness for just this crate:
|
||
|
|
||
|
```text
|
||
|
$ rebar build -e '^rust/regex$'
|
||
|
```
|
||
|
|
||
|
Run all benchmarks for this crate as tests (each benchmark is executed once to
|
||
|
ensure it works):
|
||
|
|
||
|
```text
|
||
|
$ rebar measure -e '^rust/regex$' -t
|
||
|
```
|
||
|
|
||
|
Record measurements for all benchmarks and save them to a CSV file:
|
||
|
|
||
|
```text
|
||
|
$ rebar measure -e '^rust/regex$' | tee results.csv
|
||
|
```
|
||
|
|
||
|
Explore benchmark timings:
|
||
|
|
||
|
```text
|
||
|
$ rebar cmp results.csv
|
||
|
```
|
||
|
|
||
|
See the `rebar` documentation for more details on how it works and how to
|
||
|
compare results with other regex engines.
|
||
|
|
||
|
|
||
|
### Hacking
|
||
|
|
||
|
The `regex` crate is, for the most part, a pretty thin wrapper around the
|
||
|
[`meta::Regex`](https://docs.rs/regex-automata/latest/regex_automata/meta/struct.Regex.html)
|
||
|
from the
|
||
|
[`regex-automata` crate](https://docs.rs/regex-automata/latest/regex_automata/).
|
||
|
Therefore, if you're looking to work on the internals of this crate, you'll
|
||
|
likely either want to look in `regex-syntax` (for parsing) or `regex-automata`
|
||
|
(for construction of finite automata and the search routines).
|
||
|
|
||
|
My [blog on regex internals](https://blog.burntsushi.net/regex-internals/)
|
||
|
goes into more depth.
|
||
|
|
||
|
|
||
|
### Minimum Rust version policy
|
||
|
|
||
|
This crate's minimum supported `rustc` version is `1.65.0`.
|
||
|
|
||
|
The policy is that the minimum Rust version required to use this crate can be
|
||
|
increased in minor version updates. For example, if regex 1.0 requires Rust
|
||
|
1.20.0, then regex 1.0.z for all values of `z` will also require Rust 1.20.0 or
|
||
|
newer. However, regex 1.y for `y > 0` may require a newer minimum version of
|
||
|
Rust.
|
||
|
|
||
|
|
||
|
### License
|
||
|
|
||
|
This project is licensed under either of
|
||
|
|
||
|
* Apache License, Version 2.0, ([LICENSE-APACHE](LICENSE-APACHE) or
|
||
|
https://www.apache.org/licenses/LICENSE-2.0)
|
||
|
* MIT license ([LICENSE-MIT](LICENSE-MIT) or
|
||
|
https://opensource.org/licenses/MIT)
|
||
|
|
||
|
at your option.
|
||
|
|
||
|
The data in `regex-syntax/src/unicode_tables/` is licensed under the Unicode
|
||
|
License Agreement
|
||
|
([LICENSE-UNICODE](https://www.unicode.org/copyright.html#License)).
|