Crate aho_corasick [−] [src]
An implementation of the Aho-Corasick string search algorithm.
The Aho-Corasick algorithm is principally useful when you need to search many large texts for a fixed (possibly large) set of keywords. In particular, the Aho-Corasick algorithm preprocesses the set of keywords by constructing a finite state machine. The search phase is then a quick linear scan through the text. Each character in the search text causes a state transition in the automaton. Matches are reported when the automaton enters a match state.
Examples
The main type exposed by this crate is AcAutomaton
, which can be constructed
from an iterator of pattern strings:
use aho_corasick::{Automaton, AcAutomaton}; let aut = AcAutomaton::new(vec!["apple", "maple"]); // AcAutomaton also implements `FromIterator`: let aut: AcAutomaton = ["apple", "maple"].iter().cloned().collect();
Finding matches can be done with find
:
use aho_corasick::{Automaton, AcAutomaton, Match}; let aut = AcAutomaton::new(vec!["apple", "maple"]); let mut it = aut.find("I like maple apples."); assert_eq!(it.next(), Some(Match { pati: 1, start: 7, end: 12, })); assert_eq!(it.next(), Some(Match { pati: 0, start: 13, end: 18, })); assert_eq!(it.next(), None);
Use find_overlapping
if you want to report all matches, even if they
overlap with each other.
use aho_corasick::{Automaton, AcAutomaton, Match}; let aut = AcAutomaton::new(vec!["abc", "a"]); let matches: Vec<_> = aut.find_overlapping("abc").collect(); assert_eq!(matches, vec![ Match { pati: 1, start: 0, end: 1}, Match { pati: 0, start: 0, end: 3 }, ]); // Regular `find` will report only one match: let matches: Vec<_> = aut.find("abc").collect(); assert_eq!(matches, vec![Match { pati: 1, start: 0, end: 1}]);
Finally, there are also methods for finding matches on streams. Namely, the search text does not have to live in memory. It's useful to run this on files that can't fit into memory:
use std::fs::File; use aho_corasick::{Automaton, AcAutomaton}; let aut = AcAutomaton::new(vec!["foo", "bar", "baz"]); let rdr = File::open("search.txt").unwrap(); for m in aut.stream_find(rdr) { let m = m.unwrap(); // could be an IO error println!("Pattern '{}' matched at: ({}, {})", aut.pattern(m.pati), m.start, m.end); }
There is also stream_find_overlapping
, which is just like find_overlapping
,
but it operates on streams.
Please see dict-search.rs
in this crate's examples
directory for a more
complete example. It creates a large automaton from a dictionary and can do a
streaming match over arbitrarily large data.
Memory usage
A key aspect of an Aho-Corasick implementation is how the state transitions are represented. The easiest way to make the automaton fast is to store a sparse 256-slot map in each state. It maps an input byte to a state index. This makes the matching loop extremely fast, since it translates to a simple pointer read.
The problem is that as the automaton accumulates more states, you end up paying
a 256 * 4
(4
is for the u32
state index) byte penalty for every state
regardless of how many transitions it has.
To solve this, only states near the root of the automaton have this sparse map representation. States near the leaves of the automaton use a dense mapping that requires a linear scan.
(The specific limit currently set is 3
, so that states with a depth less than
or equal to 3
are less memory efficient. The result is that the memory usage
of the automaton stops growing rapidly past ~60MB, even for automatons with
thousands of patterns.)
If you'd like to opt for the less-memory-efficient-but-faster version, then
you can construct an AcAutomaton
with a Sparse
transition strategy:
use aho_corasick::{Automaton, AcAutomaton, Match, Sparse}; let aut = AcAutomaton::<Sparse>::with_transitions(vec!["abc", "a"]); let matches: Vec<_> = aut.find("abc").collect(); assert_eq!(matches, vec![Match { pati: 1, start: 0, end: 1}]);
Structs
AcAutomaton |
An Aho-Corasick finite automaton. |
Dense |
State transitions that can be stored either sparsely or densely. |
FullAcAutomaton |
A complete Aho-Corasick automaton. |
Match |
Records a match in the search text. |
Matches |
An iterator of non-overlapping matches for in-memory text. |
MatchesOverlapping |
An iterator of overlapping matches for in-memory text. |
Sparse |
State transitions that are always sparse. |
StreamMatches |
An iterator of non-overlapping matches for streaming text. |
StreamMatchesOverlapping |
An iterator of overlapping matches for streaming text. |
Traits
Automaton |
An abstraction over automatons and their corresponding iterators. |
Transitions |
An abstraction over state transition strategies. |
Type Definitions
StateIdx |
The integer type used for the state index. |