+

Crate regex + + [−] + + [src]

+

This crate provides a native implementation of regular expressions that is +heavily based on RE2 both in syntax and in implementation. Notably, +backreferences and arbitrary lookahead/lookbehind assertions are not +provided. In return, regular expression searching provided by this package +has excellent worst-case performance. The specific syntax supported is +documented further down.

+ +

This crate's documentation provides some simple examples, describes Unicode +support and exhaustively lists the supported syntax. For more specific +details on the API, please see the documentation for the Regex type.

+ +

Usage

+

This crate is on crates.io and can be +used by adding regex to your dependencies in your project's Cargo.toml.

+ +

[dependencies]
+regex = "0.1.8"
+

+ +

and this to your crate root:

+

+extern crate regex;
+

+ +

First example: find a date

+

General use of regular expressions in this package involves compiling an +expression and then using it to search, split or replace text. For example, +to confirm that some text resembles a date:

+

+use regex::Regex;
+let re = Regex::new(r"^\d{4}-\d{2}-\d{2}$").unwrap();
+assert!(re.is_match("2014-01-01"));
+

+ +

Notice the use of the ^ and $ anchors. In this crate, every expression +is executed with an implicit .*? at the beginning and end, which allows +it to match anywhere in the text. Anchors can be used to ensure that the +full text matches an expression.

+ +

This example also demonstrates the utility of +raw strings +in Rust, which +are just like regular strings except they are prefixed with an r and do +not process any escape sequences. For example, "\\d" is the same +expression as r"\d".

+ +

The `regex!` macro

+

Rust's compile-time meta-programming facilities provide a way to write a +regex! macro which compiles regular expressions when your program +compiles. Said differently, if you only use regex! to build regular +expressions in your program, then your program cannot compile with an +invalid regular expression. Moreover, the regex! macro compiles the +given expression to native Rust code, which ideally makes it faster. +Unfortunately (or fortunately), the dynamic implementation has had a lot +more optimization work put it into it currently, so it is faster than +the regex! macro in most cases.

+ +

To use the regex! macro, you must enable the plugin feature and import +the regex_macros crate as a syntax extension:

+

+#![feature(plugin)]
+#![plugin(regex_macros)]
+extern crate regex;
+
+fn main() {
+    let re = regex!(r"^\d{4}-\d{2}-\d{2}$");
+    assert!(re.is_match("2014-01-01"));
+}
+

+ +

There are a few things worth mentioning about using the regex! macro. +Firstly, the regex! macro only accepts string literals. +Secondly, the regex crate must be linked with the name regex since +the generated code depends on finding symbols in the regex crate.

+ +

One downside of using the regex! macro is that it can increase the +size of your program's binary since it generates specialized Rust code. +The extra size probably won't be significant for a small number of +expressions, but 100+ calls to regex! will probably result in a +noticeably bigger binary.

+ +

NOTE: This is implemented using a compiler plugin, which is not +available on the Rust 1.0 beta/stable channels. Therefore, you'll only +be able to use regex! on the nightlies.

+ +

Example: iterating over capture groups

+

This crate provides convenient iterators for matching an expression +repeatedly against a search string to find successive non-overlapping +matches. For example, to find all dates in a string and be able to access +them by their component pieces:

+

+let re = Regex::new(r"(\d{4})-(\d{2})-(\d{2})").unwrap();
+let text = "2012-03-14, 2013-01-01 and 2014-07-05";
+for cap in re.captures_iter(text) {
+    println!("Month: {} Day: {} Year: {}",
+             cap.at(2).unwrap_or(""), cap.at(3).unwrap_or(""),
+             cap.at(1).unwrap_or(""));
+}
+// Output:
+// Month: 03 Day: 14 Year: 2012
+// Month: 01 Day: 01 Year: 2013
+// Month: 07 Day: 05 Year: 2014
+

+ +

Notice that the year is in the capture group indexed at 1. This is +because the entire match is stored in the capture group at index 0.

+ +

Example: replacement with named capture groups

+

Building on the previous example, perhaps we'd like to rearrange the date +formats. This can be done with text replacement. But to make the code +clearer, we can name our capture groups and use those names as variables +in our replacement text:

+

+let re = Regex::new(r"(?P<y>\d{4})-(?P<m>\d{2})-(?P<d>\d{2})").unwrap();
+let before = "2012-03-14, 2013-01-01 and 2014-07-05";
+let after = re.replace_all(before, "$m/$d/$y");
+assert_eq!(after, "03/14/2012, 01/01/2013 and 07/05/2014");
+

+ +

The replace methods are actually polymorphic in the replacement, which +provides more flexibility than is seen here. (See the documentation for +Regex::replace for more details.)

+ +

Note that if your regex gets complicated, you can use the x flag to +enable insigificant whitespace mode, which also lets you write comments:

+

+let re = Regex::new(r"(?x)
+  (?P<y>\d{4}) # the year
+  -
+  (?P<m>\d{2}) # the month
+  -
+  (?P<d>\d{2}) # the day
+").unwrap();
+let before = "2012-03-14, 2013-01-01 and 2014-07-05";
+let after = re.replace_all(before, "$m/$d/$y");
+assert_eq!(after, "03/14/2012, 01/01/2013 and 07/05/2014");
+

+ +

Pay for what you use

+

With respect to searching text with a regular expression, there are three +questions that can be asked:

+ +

Does the text match this expression?
If so, where does it match?
Where are the submatches?

+ +

Generally speaking, this crate could provide a function to answer only #3, +which would subsume #1 and #2 automatically. However, it can be +significantly more expensive to compute the location of submatches, so it's +best not to do it if you don't need to.

+ +

Therefore, only use what you need. For example, don't use find if you +only need to test if an expression matches a string. (Use is_match +instead.)

+ +

Unicode

+

This implementation executes regular expressions only on sequences of +Unicode scalar values while exposing match locations as byte indices into +the search string.

+ +

Currently, only simple case folding is supported. Namely, when matching +case-insensitively, the characters are first mapped using the +simple case folding +mapping.

+ +

Regular expressions themselves are also only interpreted as a sequence +of Unicode scalar values. This means you can use Unicode characters +directly in your expression:

+

+let re = Regex::new(r"(?i)Δ+").unwrap();
+assert_eq!(re.find("ΔδΔ"), Some((0, 6)));
+

+ +

Finally, Unicode general categories and scripts are available as character +classes. For example, you can match a sequence of numerals, Greek or +Cherokee letters:

+

+let re = Regex::new(r"[\pN\p{Greek}\p{Cherokee}]+").unwrap();
+assert_eq!(re.find("abcΔᎠβⅠᏴγδⅡxyz"), Some((3, 23)));
+

+ +

Syntax

+

The syntax supported in this crate is almost in an exact correspondence +with the syntax supported by RE2. It is documented below.

+ +

Note that the regular expression parser and abstract syntax are exposed in +a separate crate, +regex-syntax.

+ +

Matching one character

+

+.           any character except new line (includes new line with s flag)
+[xyz]       A character class matching either x, y or z.
+[^xyz]      A character class matching any character except x, y and z.
+[a-z]       A character class matching any character in range a-z.
+\d          digit (\p{Nd})
+\D          not digit
+[:alpha:]   ASCII character class ([A-Za-z])
+[:^alpha:]  Negated ASCII character class ([^A-Za-z])
+\pN         One-letter name Unicode character class
+\p{Greek}   Unicode character class (general category or script)
+\PN         Negated one-letter name Unicode character class
+\P{Greek}   negated Unicode character class (general category or script)
+

+ +

Any named character class may appear inside a bracketed [...] character +class. For example, [\p{Greek}\pN] matches any Greek or numeral +character.

+ +

Composites

+

+xy    concatenation (x followed by y)
+x|y   alternation (x or y, prefer x)
+

+ +

Repetitions

+

+x*        zero or more of x (greedy)
+x+        one or more of x (greedy)
+x?        zero or one of x (greedy)
+x*?       zero or more of x (ungreedy)
+x+?       one or more of x (ungreedy)
+x??       zero or one of x (ungreedy)
+x{n,m}    at least n x and at most m x (greedy)
+x{n,}     at least n x (greedy)
+x{n}      exactly n x
+x{n,m}?   at least n x and at most m x (ungreedy)
+x{n,}?    at least n x (ungreedy)
+x{n}?     exactly n x
+

+ +

Empty matches

+

+^     the beginning of text (or start-of-line with multi-line mode)
+$     the end of text (or end-of-line with multi-line mode)
+\A    only the beginning of text (even with multi-line mode enabled)
+\z    only the end of text (even with multi-line mode enabled)
+\b    a Unicode word boundary (\w on one side and \W, \A, or \z on other)
+\B    not a Unicode word boundary
+

+ +

Grouping and flags

+

+(exp)          numbered capture group (indexed by opening parenthesis)
+(?P<name>exp)  named (also numbered) capture group (allowed chars: [_0-9a-zA-Z])
+(?:exp)        non-capturing group
+(?flags)       set flags within current group
+(?flags:exp)   set flags for exp (non-capturing)
+

+ +

Flags are each a single character. For example, (?x) sets the flag x +and (?-x) clears the flag x. Multiple flags can be set or cleared at +the same time: (?xy) sets both the x and y flags and (?x-y) sets +the x flag and clears the y flag.

+ +

All flags are by default disabled. They are:

+ +

+i     case-insensitive
+m     multi-line mode: ^ and $ match begin/end of line
+s     allow . to match \n
+U     swap the meaning of x* and x*?
+x     ignore whitespace and allow line comments (starting with `#`)
+

+ +

Here's an example that matches case-insensitively for only part of the +expression:

+

+let re = Regex::new(r"(?i)a+(?-i)b+").unwrap();
+let cap = re.captures("AaAaAbbBBBb").unwrap();
+assert_eq!(cap.at(0), Some("AaAaAbb"));
+

+ +

Notice that the a+ matches either a or A, but the b+ only matches +b.

+ +

Escape sequences

+

+\*         literal *, works for any punctuation character: \.+*?()|[]{}^$
+\a         bell (\x07)
+\f         form feed (\x0C)
+\t         horizontal tab
+\n         new line
+\r         carriage return
+\v         vertical tab (\x0B)
+\123       octal character code (up to three digits)
+\x7F       hex character code (exactly two digits)
+\x{10FFFF} any hex character code corresponding to a Unicode code point
+

+ +

Perl character classes (Unicode friendly)

+

These classes are based on the definitions provided in +UTS#18:

+ +

+\d     digit (\p{Nd})
+\D     not digit
+\s     whitespace (\p{White_Space})
+\S     not whitespace
+\w     word character (\p{Alphabetic} + \p{M} + \d + \p{Pc} + \p{Join_Control})
+\W     not word character
+

+ +

ASCII character classes

+

+[:alnum:]    alphanumeric ([0-9A-Za-z])
+[:alpha:]    alphabetic ([A-Za-z])
+[:ascii:]    ASCII ([\x00-\x7F])
+[:blank:]    blank ([\t ])
+[:cntrl:]    control ([\x00-\x1F\x7F])
+[:digit:]    digits ([0-9])
+[:graph:]    graphical ([!-~])
+[:lower:]    lower case ([a-z])
+[:print:]    printable ([ -~])
+[:punct:]    punctuation ([!-/:-@[-`{-~])
+[:space:]    whitespace ([\t\n\v\f\r ])
+[:upper:]    upper case ([A-Z])
+[:word:]     word characters ([0-9A-Za-z_])
+[:xdigit:]   hex digit ([0-9A-Fa-f])
+

+ +

Untrusted input

+

This crate can handle both untrusted regular expressions and untrusted +search text.

+ +

Untrusted regular expressions are handled by capping the size of a compiled +regular expression. (See Regex::with_size_limit.) Without this, it would +be trivial for an attacker to exhaust your system's memory with expressions +like a{100}{100}{100}.

+ +

Untrusted search text is allowed because the matching engine(s) in this +crate have time complexity O(mn) (with m ~ regex and n ~ search text), which means there's no way to cause exponential blow-up like with +some other regular expression engines. (We pay for this by disallowing +features like arbitrary look-ahead and back-references.)

+

Structs

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

Captures	+ Captures represents a group of captured strings for a single match. + +
FindCaptures	+ An iterator that yields all non-overlapping capture groups matching a +particular regular expression. + +
FindMatches	+ An iterator over all non-overlapping matches for a particular string. + +
NoExpand	+ NoExpand indicates literal string replacement. + +
RegexSplits	+ Yields all substrings delimited by a regular expression match. + +
RegexSplitsN	+ Yields at most `N` substrings delimited by a regular expression match. + +
SubCaptures	+ An iterator over capture groups for a particular match of a regular +expression. + +
SubCapturesNamed	+ An Iterator over named capture groups as a tuple with the group +name and the value. + +
SubCapturesPos	+ An iterator over capture group positions for a particular match of a +regular expression. + +

Enums

+ + + + + + + + + + +

Error	+ An error that occurred during parsing or compiling a regular expression. + +
Regex	+ A compiled regular expression + +

Traits

+ + + + + +

Replacer

+

Replacer describes types that can be used to replace matches in a string.

+ +

Functions