utf8-ranges =========== This crate converts contiguous ranges of Unicode scalar values to UTF-8 byte ranges. This is useful when constructing byte based automata from Unicode. Stated differently, this lets one embed UTF-8 decoding as part of one's automaton. [![Linux build status](https://api.travis-ci.org/BurntSushi/utf8-ranges.png)](https://travis-ci.org/BurntSushi/utf8-ranges) [![](http://meritbadge.herokuapp.com/utf8-ranges)](https://crates.io/crates/walkdir) Dual-licensed under MIT or the [UNLICENSE](http://unlicense.org). ### Documentation [http://burntsushi.net/rustdoc/utf8_ranges/](http://burntsushi.net/rustdoc/utf8_ranges/) ### Example This shows how to convert a scalar value range (e.g., the basic multilingual plane) to a sequence of byte based character classes. ```rust extern crate utf8_ranges; use utf8_ranges::Utf8Sequences; fn main() { for range in Utf8Sequences::new('\u{0}', '\u{FFFF}') { println!("{:?}", range); } } ``` The output: ``` [0-7F] [C2-DF][80-BF] [E0][A0-BF][80-BF] [E1-EC][80-BF][80-BF] [ED][80-9F][80-BF] [EE-EF][80-BF][80-BF] ``` These ranges can then be used to build an automaton. Namely: 1. Every arbitrary sequence of bytes matches exactly one of the sequences of ranges or none of them. 2. Every match sequence of bytes is guaranteed to be valid UTF-8. (Erroneous encodings of surrogate codepoints in UTF-8 cannot match any of the byte ranges above.)