Yeah 10 million regexes is huuuuge. Aho-Corasick can barely handle 10 million li...

yazaddaruvala · on July 5, 2023

Yeah basically built a competitor to Hyperscan without the proprietary Intel bits.

Quick notes: to scale past X regexes in the set.

1. Don’t make a single RegexSet. Create multiple internally.

2. Ideally colocate similar partners together.

3. Use the immutability of a RegexSet to your advantage by generating an inverse index of “must match” portions of each regex.

4. When you get an input to match, first check the inverse index for a “may match set”, then only execute the internal automata that may possibly match on the input.

The inverse index can be as complicated (FSTs, or all of Lucene) or simple (hash map ) as you’d like it to be. The better it can filter the regexes in the set the more performance and scale you end up with. And tune the sizes of the internal automata for your usecase for some Goldilocks size for extra perf.

burntsushi · on July 6, 2023

Right. I'll also note that this strategy probably assumes something about the search task itself, and may not work for all possible search operations. Think about what happens when your haystack is GBs big for example.