cargo : aho-corasick @ 1.1.4
PE Patrick Elsen signed 2026-05-28 published 2026-05-28

DESIGN.md

482 lines · markdown

This document describes the internal design of this crate, which is an objectlesson in what happens when you take a fairly simple old algorithm likeAho-Corasick and make it fast and production ready.The target audience of this document is Rust programmers that have somefamiliarity with string searching, however, one does not need to know theAho-Corasick algorithm in order to read this (it is explained below). Oneshould, however, know what a trie is. (If you don't, go read its Wikipediaarticle.)The center-piece of this crate is an implementation of Aho-Corasick. On itsown, Aho-Corasick isn't that complicated. The complex pieces come from thedifferent variants of Aho-Corasick implemented in this crate. Specifically,they are:* Aho-Corasick as a noncontiguous NFA. States have their transitions  represented sparsely, and each state puts its transitions in its own separate  allocation. Hence the same "noncontiguous."* Aho-Corasick as a contiguous NFA. This NFA uses a single allocation to  represent the transitions of all states. That is, transitions are laid out  contiguously in memory. Moreover, states near the starting state are  represented densely, such that finding the next state ID takes a constant  number of instructions.* Aho-Corasick as a DFA. In this case, all states are represented densely in  a transition table that uses one allocation.* Supporting "standard" match semantics, along with its overlapping variant,  in addition to leftmost-first and leftmost-longest semantics. The "standard"  semantics are typically what you see in a textbook description of  Aho-Corasick. However, Aho-Corasick is also useful as an optimization in  regex engines, which often use leftmost-first or leftmost-longest semantics.  Thus, it is useful to implement those semantics here. The "standard" and  "leftmost" search algorithms are subtly different, and also require slightly  different construction algorithms.* Support for ASCII case insensitive matching.* Support for accelerating searches when the patterns all start with a small  number of fixed bytes. Or alternatively, when the  patterns all contain a  small number of rare bytes. (Searching for these bytes uses SIMD vectorized  code courtesy of `memchr`.)* Transparent support for alternative SIMD vectorized search routines for  smaller number of literals, such as the Teddy algorithm. We called these  "packed" search routines because they use SIMD. They can often be an order of  magnitude faster than just Aho-Corasick, but don't scale as well.* Support for searching streams. This can reuse most of the underlying code,  but does require careful buffering support.* Support for anchored searches, which permit efficient "is prefix" checks for  a large number of patterns.When you combine all of this together along with trying to make everything asfast as possible, what you end up with is enitrely too much code with too much`unsafe`. Alas, I was not smart enough to figure out how to reduce it. Instead,we will explain it.# BasicsThe fundamental problem this crate is trying to solve is to determine theoccurrences of possibly many patterns in a haystack. The naive way to solvethis is to look for a match for each pattern at each position in the haystack:    for i in 0..haystack.len():      for p in patterns.iter():        if haystack[i..].starts_with(p.bytes()):          return Match(p.id(), i, i + p.bytes().len())Those four lines are effectively all this crate does. The problem with thosefour lines is that they are very slow, especially when you're searching for alarge number of patterns.While there are many different algorithms available to solve this, a popularone is Aho-Corasick. It's a common solution because it's not too hard toimplement, scales quite well even when searching for thousands of patterns andis generally pretty fast. Aho-Corasick does well here because, regardless ofthe number of patterns you're searching for, it always visits each byte in thehaystack exactly once. This means, generally speaking, adding more patterns toan Aho-Corasick automaton does not make it slower. (Strictly speaking, however,this is not true, since a larger automaton will make less effective use of theCPU's cache.)Aho-Corasick can be succinctly described as a trie with state transitionsbetween some of the nodes that efficiently instruct the search algorithm totry matching alternative keys in the trie. The trick is that these statetransitions are arranged such that each byte of input needs to be inspectedonly once. These state transitions are typically called "failure transitions,"because they instruct the searcher (the thing traversing the automaton whilereading from the haystack) what to do when a byte in the haystack does notcorrespond to a valid transition in the current state of the trie.More formally, a failure transition points to a state in the automaton that maylead to a match whose prefix is a proper suffix of the path traversed throughthe trie so far. (If no such proper suffix exists, then the failure transitionpoints back to the start state of the trie, effectively restarting the search.)This is perhaps simpler to explain pictorally. For example, let's say we builtan Aho-Corasick automaton with the following patterns: 'abcd' and 'cef'. Thetrie looks like this:         a - S1 - b - S2 - c - S3 - d - S4*        /    S0 - c - S5 - e - S6 - f - S7*where states marked with a `*` are match states (meaning, the search algorithmshould stop and report a match to the caller).So given this trie, it should be somewhat straight-forward to see how it canbe used to determine whether any particular haystack *starts* with either`abcd` or `cef`. It's easy to express this in code:    fn has_prefix(trie: &Trie, haystack: &[u8]) -> bool {      let mut state_id = trie.start();      // If the empty pattern is in trie, then state_id is a match state.      if trie.is_match(state_id) {        return true;      }      for (i, &b) in haystack.iter().enumerate() {        state_id = match trie.next_state(state_id, b) {          Some(id) => id,          // If there was no transition for this state and byte, then we know          // the haystack does not start with one of the patterns in our trie.          None => return false,        };        if trie.is_match(state_id) {          return true;        }      }      false    }And that's pretty much it. All we do is move through the trie starting with thebytes at the beginning of the haystack. If we find ourselves in a positionwhere we can't move, or if we've looked through the entire haystack withoutseeing a match state, then we know the haystack does not start with any of thepatterns in the trie.The meat of the Aho-Corasick algorithm is in how we add failure transitions toour trie to keep searching efficient. Specifically, it permits us to not onlycheck whether a haystack *starts* with any one of a number of patterns, butrather, whether the haystack contains any of a number of patterns *anywhere* inthe haystack.As mentioned before, failure transitions connect a proper suffix of the pathtraversed through the trie before, with a path that leads to a match that has aprefix corresponding to that proper suffix. So in our case, for patterns `abcd`and `cef`, with a haystack `abcef`, we want to transition to state `S5` (fromthe diagram above) from `S3` upon seeing that the byte following `c` is not`d`. Namely, the proper suffix in this example is `c`, which is a prefix of`cef`. So the modified diagram looks like this:         a - S1 - b - S2 - c - S3 - d - S4*        /                      /       /       ----------------      /       /    S0 - c - S5 - e - S6 - f - S7*One thing that isn't shown in this diagram is that *all* states have a failuretransition, but only `S3` has a *non-trivial* failure transition. That is, allother states have a failure transition back to the start state. So if ourhaystack was `abzabcd`, then the searcher would transition back to `S0` afterseeing `z`, which effectively restarts the search. (Because there is no patternin our trie that has a prefix of `bz` or `z`.)The code for traversing this *automaton* or *finite state machine* (it is nolonger just a trie) is not that much different from the `has_prefix` codeabove:    fn contains(fsm: &FiniteStateMachine, haystack: &[u8]) -> bool {      let mut state_id = fsm.start();      // If the empty pattern is in fsm, then state_id is a match state.      if fsm.is_match(state_id) {        return true;      }      for (i, &b) in haystack.iter().enumerate() {        // While the diagram above doesn't show this, we may wind up needing        // to follow multiple failure transitions before we land on a state        // in which we can advance. Therefore, when searching for the next        // state, we need to loop until we don't see a failure transition.        //        // This loop terminates because the start state has no empty        // transitions. Every transition from the start state either points to        // another state, or loops back to the start state.        loop {          match fsm.next_state(state_id, b) {            Some(id) => {              state_id = id;              break;            }            // Unlike our code above, if there was no transition for this            // state, then we don't quit. Instead, we look for this state's            // failure transition and follow that instead.            None => {              state_id = fsm.next_fail_state(state_id);            }          };        }        if fsm.is_match(state_id) {          return true;        }      }      false    }Other than the complication around traversing failure transitions, this codeis still roughly "traverse the automaton with bytes from the haystack, and quitwhen a match is seen."And that concludes our section on the basics. While we didn't go deep into howthe automaton is built (see `src/nfa/noncontiguous.rs`, which has detailedcomments about that), the basic structure of Aho-Corasick should be reasonablyclear.# NFAs and DFAsThere are generally two types of finite automata: non-deterministic finiteautomata (NFA) and deterministic finite automata (DFA). The difference betweenthem is, principally, that an NFA can be in multiple states at once. This istypically accomplished by things called _epsilon_ transitions, where one couldmove to a new state without consuming any bytes from the input. (The othermechanism by which NFAs can be in more than one state is where the same byte ina particular state transitions to multiple distinct states.) In contrast, a DFAcan only ever be in one state at a time. A DFA has no epsilon transitions, andfor any given state, a byte transitions to at most one other state.By this formulation, the Aho-Corasick automaton described in the previoussection is an NFA. This is because failure transitions are, effectively,epsilon transitions. That is, whenever the automaton is in state `S`, it isactually in the set of states that are reachable by recursively followingfailure transitions from `S` until you reach the start state. (This meansthat, for example, the start state is always active since the start state isreachable via failure transitions from any state in the automaton.)NFAs have a lot of nice properties. They tend to be easier to construct, andalso tend to use less memory. However, their primary downside is that they aretypically slower to execute a search with. For example, the code above showinghow to search with an Aho-Corasick automaton needs to potentially iteratethrough many failure transitions for every byte of input. While this is afairly small amount of overhead, this can add up, especially if the automatonhas a lot of overlapping patterns with a lot of failure transitions.A DFA's search code, by contrast, looks like this:    fn contains(dfa: &DFA, haystack: &[u8]) -> bool {      let mut state_id = dfa.start();      // If the empty pattern is in dfa, then state_id is a match state.      if dfa.is_match(state_id) {        return true;      }      for (i, &b) in haystack.iter().enumerate() {        // An Aho-Corasick DFA *never* has a missing state that requires        // failure transitions to be followed. One byte of input advances the        // automaton by one state. Always.        state_id = dfa.next_state(state_id, b);        if dfa.is_match(state_id) {          return true;        }      }      false    }The search logic here is much simpler than for the NFA, and this tends totranslate into significant performance benefits as well, since there's a lotless work being done for each byte in the haystack. How is this accomplished?It's done by pre-following all failure transitions for all states for all bytesin the alphabet, and then building a single state transition table. Buildingthis DFA can be much more costly than building the NFA, and use much morememory, but the better performance can be worth it.Users of this crate can actually choose between using one of two possible NFAs(noncontiguous or contiguous) or a DFA. By default, a contiguous NFA is used,in most circumstances, but if the number of patterns is small enough a DFA willbe used. A contiguous NFA is chosen because it uses orders of magnitude lessmemory than a DFA, takes only a little longer to build than a noncontiguousNFA and usually gets pretty close to the search speed of a DFA. (Callers canoverride this automatic selection via the `AhoCorasickBuilder::start_kind`configuration.)# More DFA tricksAs described in the previous section, one of the downsides of using a DFAis that it uses more memory and can take longer to build. One small way ofmitigating these concerns is to map the alphabet used by the automaton intoa smaller space. Typically, the alphabet of a DFA has 256 elements in it:one element for each possible value that fits into a byte. However, in manycases, one does not need the full alphabet. For example, if all patterns in anAho-Corasick automaton are ASCII letters, then this only uses up 52 distinctbytes. As far as the automaton is concerned, the rest of the 204 bytes areindistinguishable from one another: they will never disrciminate between amatch or a non-match. Therefore, in cases like that, the alphabet can be shrunkto just 53 elements. One for each ASCII letter, and then another to serve as aplaceholder for every other unused byte.In practice, this library doesn't quite compute the optimal set of equivalenceclasses, but it's close enough in most cases. The key idea is that this thenallows the transition table for the DFA to be potentially much smaller. Thedownside of doing this, however, is that since the transition table is definedin terms of this smaller alphabet space, every byte in the haystack must bere-mapped to this smaller space. This requires an additional 256-byte table.In practice, this can lead to a small search time hit, but it can be difficultto measure. Moreover, it can sometimes lead to faster search times for biggerautomata, since it could be difference between more parts of the automatonstaying in the CPU cache or not.One other trick for DFAs employed by this crate is the notion of premultiplyingstate identifiers. Specifically, the normal way to compute the next transitionin a DFA is via the following (assuming that the transition table is laid outsequentially in memory, in row-major order, where the rows are states):    next_state_id = dfa.transitions[current_state_id * 256 + current_byte]However, since the value `256` is a fixed constant, we can actually premultiplythe state identifiers in the table when we build the table initially. Then, thenext transition computation simply becomes:    next_state_id = dfa.transitions[current_state_id + current_byte]This doesn't seem like much, but when this is being executed for every byte ofinput that you're searching, saving that extra multiplication instruction canadd up.The same optimization works even when equivalence classes are enabled, asdescribed above. The only difference is that the premultiplication is by thetotal number of equivalence classes instead of 256.There isn't much downside to premultiplying state identifiers, other than itimposes a smaller limit on the total number of states in the DFA. Namely, withpremultiplied state identifiers, you run out of room in your state identifierrepresentation more rapidly than if the identifiers are just state indices.Both equivalence classes and premultiplication are always enabled. There is a`AhoCorasickBuilder::byte_classes` configuration, but disabling this just makesit so there are always 256 equivalence classes, i.e., every class correspondsto precisely one byte. When it's disabled, the equivalence class map itself isstill used. The purpose of disabling it is when one is debugging the underlyingautomaton. It can be easier to comprehend when it uses actual byte values forits transitions instead of equivalence classes.# Match semanticsOne of the more interesting things about this implementation of Aho-Corasickthat (as far as this author knows) separates it from other implementations, isthat it natively supports leftmost-first and leftmost-longest match semantics.Briefly, match semantics refer to the decision procedure by which searchingwill disambiguate matches when there are multiple to choose from:* **standard** match semantics emits matches as soon as they are detected by  the automaton. This is typically equivalent to the textbook non-overlapping  formulation of Aho-Corasick.* **leftmost-first** match semantics means that 1) the next match is the match  starting at the leftmost position and 2) among multiple matches starting at  the same leftmost position, the match corresponding to the pattern provided  first by the caller is reported.* **leftmost-longest** is like leftmost-first, except when there are multiple  matches starting at the same leftmost position, the pattern corresponding to  the longest match is returned.(The crate API documentation discusses these differences, with examples, inmore depth on the `MatchKind` type.)The reason why supporting these match semantics is important is because itgives the user more control over the match procedure. For example,leftmost-first permits users to implement match priority by simply putting thehigher priority patterns first. Leftmost-longest, on the other hand, permitsfinding the longest possible match, which might be useful when trying to findwords matching a dictionary. Additionally, regex engines often want to useAho-Corasick as an optimization when searching for an alternation of literals.In order to preserve correct match semantics, regex engines typically can't usethe standard textbook definition directly, since regex engines will implementeither leftmost-first (Perl-like) or leftmost-longest (POSIX) match semantics.Supporting leftmost semantics requires a couple key changes:* Constructing the Aho-Corasick automaton changes a bit in both how the trie is  constructed and how failure transitions are found. Namely, only a subset  of the failure transitions are added. Specifically, only the failure  transitions that either do not occur after a match or do occur after a match  but preserve that match are kept. (More details on this can be found in  `src/nfa/noncontiguous.rs`.)* The search algorithm changes slightly. Since we are looking for the leftmost  match, we cannot quit as soon as a match is detected. Instead, after a match  is detected, we must keep searching until either the end of the input or  until a dead state is seen. (Dead states are not used for standard match  semantics. Dead states mean that searching should stop after a match has been  found.)Most other implementations of Aho-Corasick do support leftmost match semantics,but they do it with more overhead at search time, or even worse, with a queueof matches and sophisticated hijinks to disambiguate the matches. While ourconstruction algorithm becomes a bit more complicated, the correct matchsemantics fall out from the structure of the automaton itself.# Overlapping matchesOne of the nice properties of an Aho-Corasick automaton is that it can reportall possible matches, even when they overlap with one another. In this mode,the match semantics don't matter, since all possible matches are reported.Overlapping searches work just like regular searches, except the stateidentifier at which the previous search left off is carried over to the nextsearch, so that it can pick up where it left off. If there are additionalmatches at that state, then they are reported before resuming the search.Enabling leftmost-first or leftmost-longest match semantics causes theautomaton to use a subset of all failure transitions, which means thatoverlapping searches cannot be used. Therefore, if leftmost match semantics areused, attempting to do an overlapping search will return an error (or panicwhen using the infallible APIs). Thus, to get overlapping searches, the callermust use the default standard match semantics. This behavior was chosen becausethere are only two alternatives, which were deemed worse:* Compile two automatons internally, one for standard semantics and one for  the semantics requested by the caller (if not standard).* Create a new type, distinct from the `AhoCorasick` type, which has different  capabilities based on the configuration options.The first is untenable because of the amount of memory used by the automaton.The second increases the complexity of the API too much by adding too manytypes that do similar things. It is conceptually much simpler to keep allsearching isolated to a single type.# Stream searchingSince Aho-Corasick is an automaton, it is possible to do partial searches onpartial parts of the haystack, and then resume that search on subsequent piecesof the haystack. This is useful when the haystack you're trying to search isnot stored contiguously in memory, or if one does not want to read the entirehaystack into memory at once.Currently, only standard semantics are supported for stream searching. This issome of the more complicated code in this crate, and is something I would verymuch like to improve. In particular, it currently has the restriction that itmust buffer at least enough of the haystack in memory in order to fit thelongest possible match. The difficulty in getting stream searching right isthat the implementation choices (such as the buffer size) often impact what theAPI looks like and what it's allowed to do.# PrefiltersIn some cases, Aho-Corasick is not the fastest way to find matches containingmultiple patterns. Sometimes, the search can be accelerated using highlyoptimized SIMD routines. For example, consider searching the followingpatterns:    Sherlock    Moriarty    WatsonIt is plausible that it would be much faster to quickly look for occurrences ofthe leading bytes, `S`, `M` or `W`, before trying to start searching via theautomaton. Indeed, this is exactly what this crate will do.When there are more than three distinct starting bytes, then this crate willlook for three distinct bytes occurring at any position in the patterns, whilepreferring bytes that are heuristically determined to be rare over others. Forexample:    Abuzz    Sanchez    Vasquez    Topaz    WaltzHere, we have more than 3 distinct starting bytes, but all of the patternscontain `z`, which is typically a rare byte. In this case, the prefilter willscan for `z`, back up a bit, and then execute the Aho-Corasick automaton.If all of that fails, then a packed multiple substring algorithm will beattempted. Currently, the only algorithm available for this is Teddy, but moremay be added in the future. Teddy is unlike the above prefilters in that itconfirms its own matches, so when Teddy is active, it might not be necessaryfor Aho-Corasick to run at all. However, the current Teddy implementationonly works in `x86_64` when SSSE3 or AVX2 are available or in `aarch64`(using NEON), and moreover, only works _well_ when there are a small numberof patterns (say, less than 100). Teddy also requires the haystack to be of acertain length (more than 16-34 bytes). When the haystack is shorter than that,Rabin-Karp is used instead. (See `src/packed/rabinkarp.rs`.)There is a more thorough description of Teddy at[`src/packed/teddy/README.md`](src/packed/teddy/README.md).