Crate aho_corasick [] [src]

An implementation of the Aho-Corasick string search algorithm.

The Aho-Corasick algorithm is principally useful when you need to search many large texts for a fixed (possibly large) set of keywords. In particular, the Aho-Corasick algorithm preprocesses the set of keywords by constructing a finite state machine. The search phase is then a quick linear scan through the text. Each character in the search text causes a state transition in the automaton. Matches are reported when the automaton enters a match state.

Examples

The main type exposed by this crate is AcAutomaton, which can be constructed from an iterator of pattern strings:

use aho_corasick::{Automaton, AcAutomaton};

let aut = AcAutomaton::new(vec!["apple", "maple"]);

// AcAutomaton also implements `FromIterator`:
let aut: AcAutomaton = ["apple", "maple"].iter().cloned().collect();

Finding matches can be done with find:

use aho_corasick::{Automaton, AcAutomaton, Match};

let aut = AcAutomaton::new(vec!["apple", "maple"]);
let mut it = aut.find("I like maple apples.");
assert_eq!(it.next(), Some(Match {
    pati: 1,
    start: 7,
    end: 12,
}));
assert_eq!(it.next(), Some(Match {
    pati: 0,
    start: 13,
    end: 18,
}));
assert_eq!(it.next(), None);

Use find_overlapping if you want to report all matches, even if they overlap with each other.

use aho_corasick::{Automaton, AcAutomaton, Match};

let aut = AcAutomaton::new(vec!["abc", "a"]);
let matches: Vec<_> = aut.find_overlapping("abc").collect();
assert_eq!(matches, vec![
    Match { pati: 1, start: 0, end: 1}, Match { pati: 0, start: 0, end: 3 },
]);

// Regular `find` will report only one match:
let matches: Vec<_> = aut.find("abc").collect();
assert_eq!(matches, vec![Match { pati: 1, start: 0, end: 1}]);

Finally, there are also methods for finding matches on streams. Namely, the search text does not have to live in memory. It's useful to run this on files that can't fit into memory:

use std::fs::File;

use aho_corasick::{Automaton, AcAutomaton};

let aut = AcAutomaton::new(vec!["foo", "bar", "baz"]);
let rdr = File::open("search.txt").unwrap();
for m in aut.stream_find(rdr) {
    let m = m.unwrap(); // could be an IO error
    println!("Pattern '{}' matched at: ({}, {})",
             aut.pattern(m.pati), m.start, m.end);
}

There is also stream_find_overlapping, which is just like find_overlapping, but it operates on streams.

Please see dict-search.rs in this crate's examples directory for a more complete example. It creates a large automaton from a dictionary and can do a streaming match over arbitrarily large data.

Memory usage

A key aspect of an Aho-Corasick implementation is how the state transitions are represented. The easiest way to make the automaton fast is to store a sparse 256-slot map in each state. It maps an input byte to a state index. This makes the matching loop extremely fast, since it translates to a simple pointer read.

The problem is that as the automaton accumulates more states, you end up paying a 256 * 4 (4 is for the u32 state index) byte penalty for every state regardless of how many transitions it has.

To solve this, only states near the root of the automaton have this sparse map representation. States near the leaves of the automaton use a dense mapping that requires a linear scan.

(The specific limit currently set is 3, so that states with a depth less than or equal to 3 are less memory efficient. The result is that the memory usage of the automaton stops growing rapidly past ~60MB, even for automatons with thousands of patterns.)

If you'd like to opt for the less-memory-efficient-but-faster version, then you can construct an AcAutomaton with a Sparse transition strategy:

use aho_corasick::{Automaton, AcAutomaton, Match, Sparse};

let aut = AcAutomaton::<Sparse>::with_transitions(vec!["abc", "a"]);
let matches: Vec<_> = aut.find("abc").collect();
assert_eq!(matches, vec![Match { pati: 1, start: 0, end: 1}]);

Structs

AcAutomaton

An Aho-Corasick finite automaton.

Dense

State transitions that can be stored either sparsely or densely.

FullAcAutomaton

A complete Aho-Corasick automaton.

Match

Records a match in the search text.

Matches

An iterator of non-overlapping matches for in-memory text.

MatchesOverlapping

An iterator of overlapping matches for in-memory text.

Sparse

State transitions that are always sparse.

StreamMatches

An iterator of non-overlapping matches for streaming text.

StreamMatchesOverlapping

An iterator of overlapping matches for streaming text.

Traits

Automaton

An abstraction over automatons and their corresponding iterators.

Transitions

An abstraction over state transition strategies.

Type Definitions

StateIdx

The integer type used for the state index.