55 lines
1.4 KiB
Markdown
55 lines
1.4 KiB
Markdown
|
utf8-ranges
|
||
|
===========
|
||
|
This crate converts contiguous ranges of Unicode scalar values to UTF-8 byte
|
||
|
ranges. This is useful when constructing byte based automata from Unicode.
|
||
|
Stated differently, this lets one embed UTF-8 decoding as part of one's
|
||
|
automaton.
|
||
|
|
||
|
[![Linux build status](https://api.travis-ci.org/BurntSushi/utf8-ranges.png)](https://travis-ci.org/BurntSushi/utf8-ranges)
|
||
|
[![](http://meritbadge.herokuapp.com/utf8-ranges)](https://crates.io/crates/walkdir)
|
||
|
|
||
|
Dual-licensed under MIT or the [UNLICENSE](http://unlicense.org).
|
||
|
|
||
|
|
||
|
### Documentation
|
||
|
|
||
|
[http://burntsushi.net/rustdoc/utf8_ranges/](http://burntsushi.net/rustdoc/utf8_ranges/)
|
||
|
|
||
|
|
||
|
### Example
|
||
|
|
||
|
This shows how to convert a scalar value range (e.g., the basic multilingual
|
||
|
plane) to a sequence of byte based character classes.
|
||
|
|
||
|
|
||
|
```rust
|
||
|
extern crate utf8_ranges;
|
||
|
|
||
|
use utf8_ranges::Utf8Sequences;
|
||
|
|
||
|
fn main() {
|
||
|
for range in Utf8Sequences::new('\u{0}', '\u{FFFF}') {
|
||
|
println!("{:?}", range);
|
||
|
}
|
||
|
}
|
||
|
```
|
||
|
|
||
|
The output:
|
||
|
|
||
|
```
|
||
|
[0-7F]
|
||
|
[C2-DF][80-BF]
|
||
|
[E0][A0-BF][80-BF]
|
||
|
[E1-EC][80-BF][80-BF]
|
||
|
[ED][80-9F][80-BF]
|
||
|
[EE-EF][80-BF][80-BF]
|
||
|
```
|
||
|
|
||
|
These ranges can then be used to build an automaton. Namely:
|
||
|
|
||
|
1. Every arbitrary sequence of bytes matches exactly one of the sequences of
|
||
|
ranges or none of them.
|
||
|
2. Every match sequence of bytes is guaranteed to be valid UTF-8. (Erroneous
|
||
|
encodings of surrogate codepoints in UTF-8 cannot match any of the byte
|
||
|
ranges above.)
|