Crate aho_corasick + + [−] + + [src]
+An implementation of the +Aho-Corasick string search algorithm.
+ +The Aho-Corasick algorithm is principally useful when you need to search many +large texts for a fixed (possibly large) set of keywords. In particular, the +Aho-Corasick algorithm preprocesses the set of keywords by constructing a +finite state machine. The search phase is then a quick linear scan through the +text. Each character in the search text causes a state transition in the +automaton. Matches are reported when the automaton enters a match state.
+ +Examples
+The main type exposed by this crate is AcAutomaton
, which can be constructed
+from an iterator of pattern strings:
+use aho_corasick::{Automaton, AcAutomaton}; + +let aut = AcAutomaton::new(vec!["apple", "maple"]); + +// AcAutomaton also implements `FromIterator`: +let aut: AcAutomaton = ["apple", "maple"].iter().cloned().collect(); ++ +
Finding matches can be done with find
:
+use aho_corasick::{Automaton, AcAutomaton, Match}; + +let aut = AcAutomaton::new(vec!["apple", "maple"]); +let mut it = aut.find("I like maple apples."); +assert_eq!(it.next(), Some(Match { + pati: 1, + start: 7, + end: 12, +})); +assert_eq!(it.next(), Some(Match { + pati: 0, + start: 13, + end: 18, +})); +assert_eq!(it.next(), None); ++ +
Use find_overlapping
if you want to report all matches, even if they
+overlap with each other.
+use aho_corasick::{Automaton, AcAutomaton, Match}; + +let aut = AcAutomaton::new(vec!["abc", "a"]); +let matches: Vec<_> = aut.find_overlapping("abc").collect(); +assert_eq!(matches, vec![ + Match { pati: 1, start: 0, end: 1}, Match { pati: 0, start: 0, end: 3 }, +]); + +// Regular `find` will report only one match: +let matches: Vec<_> = aut.find("abc").collect(); +assert_eq!(matches, vec![Match { pati: 1, start: 0, end: 1}]); ++ +
Finally, there are also methods for finding matches on streams. Namely, the +search text does not have to live in memory. It's useful to run this on files +that can't fit into memory:
++use std::fs::File; + +use aho_corasick::{Automaton, AcAutomaton}; + +let aut = AcAutomaton::new(vec!["foo", "bar", "baz"]); +let rdr = File::open("search.txt").unwrap(); +for m in aut.stream_find(rdr) { + let m = m.unwrap(); // could be an IO error + println!("Pattern '{}' matched at: ({}, {})", + aut.pattern(m.pati), m.start, m.end); +} ++ +
There is also stream_find_overlapping
, which is just like find_overlapping
,
+but it operates on streams.
Please see dict-search.rs
in this crate's examples
directory for a more
+complete example. It creates a large automaton from a dictionary and can do a
+streaming match over arbitrarily large data.
Memory usage
+A key aspect of an Aho-Corasick implementation is how the state transitions +are represented. The easiest way to make the automaton fast is to store a +sparse 256-slot map in each state. It maps an input byte to a state index. +This makes the matching loop extremely fast, since it translates to a simple +pointer read.
+ +The problem is that as the automaton accumulates more states, you end up paying
+a 256 * 4
(4
is for the u32
state index) byte penalty for every state
+regardless of how many transitions it has.
To solve this, only states near the root of the automaton have this sparse +map representation. States near the leaves of the automaton use a dense mapping +that requires a linear scan.
+ +(The specific limit currently set is 3
, so that states with a depth less than
+or equal to 3
are less memory efficient. The result is that the memory usage
+of the automaton stops growing rapidly past ~60MB, even for automatons with
+thousands of patterns.)
If you'd like to opt for the less-memory-efficient-but-faster version, then
+you can construct an AcAutomaton
with a Sparse
transition strategy:
+use aho_corasick::{Automaton, AcAutomaton, Match, Sparse}; + +let aut = AcAutomaton::<Sparse>::with_transitions(vec!["abc", "a"]); +let matches: Vec<_> = aut.find("abc").collect(); +assert_eq!(matches, vec![Match { pati: 1, start: 0, end: 1}]); ++
Structs
+AcAutomaton | +
+ An Aho-Corasick finite automaton. + + |
+
Dense | +
+ State transitions that can be stored either sparsely or densely. + + |
+
FullAcAutomaton | +
+ A complete Aho-Corasick automaton. + + |
+
Match | +
+ Records a match in the search text. + + |
+
Matches | +
+ An iterator of non-overlapping matches for in-memory text. + + |
+
MatchesOverlapping | +
+ An iterator of overlapping matches for in-memory text. + + |
+
Sparse | +
+ State transitions that are always sparse. + + |
+
StreamMatches | +
+ An iterator of non-overlapping matches for streaming text. + + |
+
StreamMatchesOverlapping | +
+ An iterator of overlapping matches for streaming text. + + |
+
Traits
+Automaton | +
+ An abstraction over automatons and their corresponding iterators. + + |
+
Transitions | +
+ An abstraction over state transition strategies. + + |
+
Type Definitions
+StateIdx | +
+ The integer type used for the state index. + + |
+