Yeah 10 million regexes is huuuuge. Aho-Corasick can barely handle 10 million literals.
Future work is definitely trying to make the regex engine scale better with more patterns. Currently, it will crap out well before 10 million regexes, and I'm not sure that goal is actually attainable. But it can certainly be better than what it is today.
And of course, Hyperscan is really the gold standard here as far as multi-pattern searching goes. I'm not sure how well it will handle 10 million patterns though.
Yeah basically built a competitor to Hyperscan without the proprietary Intel bits.
Quick notes: to scale past X regexes in the set.
1. Don’t make a single RegexSet. Create multiple internally.
2. Ideally colocate similar partners together.
3. Use the immutability of a RegexSet to your advantage by generating an inverse index of “must match” portions of each regex.
4. When you get an input to match, first check the inverse index for a “may match set”, then only execute the internal automata that may possibly match on the input.
The inverse index can be as complicated (FSTs, or all of Lucene) or simple (hash map ) as you’d like it to be. The better it can filter the regexes in the set the more performance and scale you end up with. And tune the sizes of the internal automata for your usecase for some Goldilocks size for extra perf.
Right. I'll also note that this strategy probably assumes something about the search task itself, and may not work for all possible search operations. Think about what happens when your haystack is GBs big for example.
Future work is definitely trying to make the regex engine scale better with more patterns. Currently, it will crap out well before 10 million regexes, and I'm not sure that goal is actually attainable. But it can certainly be better than what it is today.
And of course, Hyperscan is really the gold standard here as far as multi-pattern searching goes. I'm not sure how well it will handle 10 million patterns though.