Crate regex + + [−] + + [src]
+This crate provides a native implementation of regular expressions that is +heavily based on RE2 both in syntax and in implementation. Notably, +backreferences and arbitrary lookahead/lookbehind assertions are not +provided. In return, regular expression searching provided by this package +has excellent worst-case performance. The specific syntax supported is +documented further down.
+ +This crate's documentation provides some simple examples, describes Unicode
+support and exhaustively lists the supported syntax. For more specific
+details on the API, please see the documentation for the Regex
type.
Usage
+This crate is on crates.io and can be
+used by adding regex
to your dependencies in your project's Cargo.toml
.
[dependencies]
+regex = "0.1.8"
+
+
+and this to your crate root:
++extern crate regex; ++ +
First example: find a date
+General use of regular expressions in this package involves compiling an +expression and then using it to search, split or replace text. For example, +to confirm that some text resembles a date:
++use regex::Regex; +let re = Regex::new(r"^\d{4}-\d{2}-\d{2}$").unwrap(); +assert!(re.is_match("2014-01-01")); ++ +
Notice the use of the ^
and $
anchors. In this crate, every expression
+is executed with an implicit .*?
at the beginning and end, which allows
+it to match anywhere in the text. Anchors can be used to ensure that the
+full text matches an expression.
This example also demonstrates the utility of
+raw strings
+in Rust, which
+are just like regular strings except they are prefixed with an r
and do
+not process any escape sequences. For example, "\\d"
is the same
+expression as r"\d"
.
The regex!
macro
+Rust's compile-time meta-programming facilities provide a way to write a
+regex!
macro which compiles regular expressions when your program
+compiles. Said differently, if you only use regex!
to build regular
+expressions in your program, then your program cannot compile with an
+invalid regular expression. Moreover, the regex!
macro compiles the
+given expression to native Rust code, which ideally makes it faster.
+Unfortunately (or fortunately), the dynamic implementation has had a lot
+more optimization work put it into it currently, so it is faster than
+the regex!
macro in most cases.
To use the regex!
macro, you must enable the plugin
feature and import
+the regex_macros
crate as a syntax extension:
+#![feature(plugin)] +#![plugin(regex_macros)] +extern crate regex; + +fn main() { + let re = regex!(r"^\d{4}-\d{2}-\d{2}$"); + assert!(re.is_match("2014-01-01")); +} ++ +
There are a few things worth mentioning about using the regex!
macro.
+Firstly, the regex!
macro only accepts string literals.
+Secondly, the regex
crate must be linked with the name regex
since
+the generated code depends on finding symbols in the regex
crate.
One downside of using the regex!
macro is that it can increase the
+size of your program's binary since it generates specialized Rust code.
+The extra size probably won't be significant for a small number of
+expressions, but 100+ calls to regex!
will probably result in a
+noticeably bigger binary.
NOTE: This is implemented using a compiler plugin, which is not
+available on the Rust 1.0 beta/stable channels. Therefore, you'll only
+be able to use regex!
on the nightlies.
Example: iterating over capture groups
+This crate provides convenient iterators for matching an expression +repeatedly against a search string to find successive non-overlapping +matches. For example, to find all dates in a string and be able to access +them by their component pieces:
++let re = Regex::new(r"(\d{4})-(\d{2})-(\d{2})").unwrap(); +let text = "2012-03-14, 2013-01-01 and 2014-07-05"; +for cap in re.captures_iter(text) { + println!("Month: {} Day: {} Year: {}", + cap.at(2).unwrap_or(""), cap.at(3).unwrap_or(""), + cap.at(1).unwrap_or("")); +} +// Output: +// Month: 03 Day: 14 Year: 2012 +// Month: 01 Day: 01 Year: 2013 +// Month: 07 Day: 05 Year: 2014 ++ +
Notice that the year is in the capture group indexed at 1
. This is
+because the entire match is stored in the capture group at index 0
.
Example: replacement with named capture groups
+Building on the previous example, perhaps we'd like to rearrange the date +formats. This can be done with text replacement. But to make the code +clearer, we can name our capture groups and use those names as variables +in our replacement text:
++let re = Regex::new(r"(?P<y>\d{4})-(?P<m>\d{2})-(?P<d>\d{2})").unwrap(); +let before = "2012-03-14, 2013-01-01 and 2014-07-05"; +let after = re.replace_all(before, "$m/$d/$y"); +assert_eq!(after, "03/14/2012, 01/01/2013 and 07/05/2014"); ++ +
The replace
methods are actually polymorphic in the replacement, which
+provides more flexibility than is seen here. (See the documentation for
+Regex::replace
for more details.)
Note that if your regex gets complicated, you can use the x
flag to
+enable insigificant whitespace mode, which also lets you write comments:
+let re = Regex::new(r"(?x) + (?P<y>\d{4}) # the year + - + (?P<m>\d{2}) # the month + - + (?P<d>\d{2}) # the day +").unwrap(); +let before = "2012-03-14, 2013-01-01 and 2014-07-05"; +let after = re.replace_all(before, "$m/$d/$y"); +assert_eq!(after, "03/14/2012, 01/01/2013 and 07/05/2014"); ++ +
Pay for what you use
+With respect to searching text with a regular expression, there are three +questions that can be asked:
+ +-
+
- Does the text match this expression? +
- If so, where does it match? +
- Where are the submatches? +
Generally speaking, this crate could provide a function to answer only #3, +which would subsume #1 and #2 automatically. However, it can be +significantly more expensive to compute the location of submatches, so it's +best not to do it if you don't need to.
+ +Therefore, only use what you need. For example, don't use find
if you
+only need to test if an expression matches a string. (Use is_match
+instead.)
Unicode
+This implementation executes regular expressions only on sequences of +Unicode scalar values while exposing match locations as byte indices into +the search string.
+ +Currently, only simple case folding is supported. Namely, when matching +case-insensitively, the characters are first mapped using the +simple case folding +mapping.
+ +Regular expressions themselves are also only interpreted as a sequence +of Unicode scalar values. This means you can use Unicode characters +directly in your expression:
++let re = Regex::new(r"(?i)Δ+").unwrap(); +assert_eq!(re.find("ΔδΔ"), Some((0, 6))); ++ +
Finally, Unicode general categories and scripts are available as character +classes. For example, you can match a sequence of numerals, Greek or +Cherokee letters:
++let re = Regex::new(r"[\pN\p{Greek}\p{Cherokee}]+").unwrap(); +assert_eq!(re.find("abcΔᎠβⅠᏴγδⅡxyz"), Some((3, 23))); ++ +
Syntax
+The syntax supported in this crate is almost in an exact correspondence +with the syntax supported by RE2. It is documented below.
+ +Note that the regular expression parser and abstract syntax are exposed in
+a separate crate,
+regex-syntax
.
Matching one character
++. any character except new line (includes new line with s flag) +[xyz] A character class matching either x, y or z. +[^xyz] A character class matching any character except x, y and z. +[a-z] A character class matching any character in range a-z. +\d digit (\p{Nd}) +\D not digit +[:alpha:] ASCII character class ([A-Za-z]) +[:^alpha:] Negated ASCII character class ([^A-Za-z]) +\pN One-letter name Unicode character class +\p{Greek} Unicode character class (general category or script) +\PN Negated one-letter name Unicode character class +\P{Greek} negated Unicode character class (general category or script) ++ +
Any named character class may appear inside a bracketed [...]
character
+class. For example, [\p{Greek}\pN]
matches any Greek or numeral
+character.
Composites
++xy concatenation (x followed by y) +x|y alternation (x or y, prefer x) ++ +
Repetitions
++x* zero or more of x (greedy) +x+ one or more of x (greedy) +x? zero or one of x (greedy) +x*? zero or more of x (ungreedy) +x+? one or more of x (ungreedy) +x?? zero or one of x (ungreedy) +x{n,m} at least n x and at most m x (greedy) +x{n,} at least n x (greedy) +x{n} exactly n x +x{n,m}? at least n x and at most m x (ungreedy) +x{n,}? at least n x (ungreedy) +x{n}? exactly n x ++ +
Empty matches
++^ the beginning of text (or start-of-line with multi-line mode) +$ the end of text (or end-of-line with multi-line mode) +\A only the beginning of text (even with multi-line mode enabled) +\z only the end of text (even with multi-line mode enabled) +\b a Unicode word boundary (\w on one side and \W, \A, or \z on other) +\B not a Unicode word boundary ++ +
Grouping and flags
++(exp) numbered capture group (indexed by opening parenthesis) +(?P<name>exp) named (also numbered) capture group (allowed chars: [_0-9a-zA-Z]) +(?:exp) non-capturing group +(?flags) set flags within current group +(?flags:exp) set flags for exp (non-capturing) ++ +
Flags are each a single character. For example, (?x)
sets the flag x
+and (?-x)
clears the flag x
. Multiple flags can be set or cleared at
+the same time: (?xy)
sets both the x
and y
flags and (?x-y)
sets
+the x
flag and clears the y
flag.
All flags are by default disabled. They are:
+ ++i case-insensitive +m multi-line mode: ^ and $ match begin/end of line +s allow . to match \n +U swap the meaning of x* and x*? +x ignore whitespace and allow line comments (starting with `#`) ++ +
Here's an example that matches case-insensitively for only part of the +expression:
++let re = Regex::new(r"(?i)a+(?-i)b+").unwrap(); +let cap = re.captures("AaAaAbbBBBb").unwrap(); +assert_eq!(cap.at(0), Some("AaAaAbb")); ++ +
Notice that the a+
matches either a
or A
, but the b+
only matches
+b
.
Escape sequences
++\* literal *, works for any punctuation character: \.+*?()|[]{}^$ +\a bell (\x07) +\f form feed (\x0C) +\t horizontal tab +\n new line +\r carriage return +\v vertical tab (\x0B) +\123 octal character code (up to three digits) +\x7F hex character code (exactly two digits) +\x{10FFFF} any hex character code corresponding to a Unicode code point ++ +
Perl character classes (Unicode friendly)
+These classes are based on the definitions provided in +UTS#18:
+ ++\d digit (\p{Nd}) +\D not digit +\s whitespace (\p{White_Space}) +\S not whitespace +\w word character (\p{Alphabetic} + \p{M} + \d + \p{Pc} + \p{Join_Control}) +\W not word character ++ +
ASCII character classes
++[:alnum:] alphanumeric ([0-9A-Za-z]) +[:alpha:] alphabetic ([A-Za-z]) +[:ascii:] ASCII ([\x00-\x7F]) +[:blank:] blank ([\t ]) +[:cntrl:] control ([\x00-\x1F\x7F]) +[:digit:] digits ([0-9]) +[:graph:] graphical ([!-~]) +[:lower:] lower case ([a-z]) +[:print:] printable ([ -~]) +[:punct:] punctuation ([!-/:-@[-`{-~]) +[:space:] whitespace ([\t\n\v\f\r ]) +[:upper:] upper case ([A-Z]) +[:word:] word characters ([0-9A-Za-z_]) +[:xdigit:] hex digit ([0-9A-Fa-f]) ++ +
Untrusted input
+This crate can handle both untrusted regular expressions and untrusted +search text.
+ +Untrusted regular expressions are handled by capping the size of a compiled
+regular expression. (See Regex::with_size_limit
.) Without this, it would
+be trivial for an attacker to exhaust your system's memory with expressions
+like a{100}{100}{100}
.
Untrusted search text is allowed because the matching engine(s) in this
+crate have time complexity O(mn)
(with m ~ regex
and n ~ search text
), which means there's no way to cause exponential blow-up like with
+some other regular expression engines. (We pay for this by disallowing
+features like arbitrary look-ahead and back-references.)
Structs
+Captures | +
+ Captures represents a group of captured strings for a single match. + + |
+
FindCaptures | +
+ An iterator that yields all non-overlapping capture groups matching a +particular regular expression. + + |
+
FindMatches | +
+ An iterator over all non-overlapping matches for a particular string. + + |
+
NoExpand | +
+ NoExpand indicates literal string replacement. + + |
+
RegexSplits | +
+ Yields all substrings delimited by a regular expression match. + + |
+
RegexSplitsN | +
+ Yields at most |
+
SubCaptures | +
+ An iterator over capture groups for a particular match of a regular +expression. + + |
+
SubCapturesNamed | +
+ An Iterator over named capture groups as a tuple with the group +name and the value. + + |
+
SubCapturesPos | +
+ An iterator over capture group positions for a particular match of a +regular expression. + + |
+
Enums
+Error | +
+ An error that occurred during parsing or compiling a regular expression. + + |
+
Regex | +
+ A compiled regular expression + + |
+
Traits
+Replacer | +
+ Replacer describes types that can be used to replace matches in a string. + + |
+
Functions
+is_match | +
+ Tests if the given regular expression matches somewhere in the text given. + + |
+
quote | +
+ Escapes all regular expression meta characters in |
+