I just posted a big Quamina PR representing months of work, brought on by the addition of a small basic regular-expression feature. This ramble doesn’t exactly have a smooth story arc but I’m posting it anyhow because I know there are people out there interested in state-machine engineering and they are my people.
As far as I can tell, a couple of the problems I’m trying to solve haven’t been addressed before, at least not by anyone who published their findings. Partly because of that, I’m starting to wonder if all these disorderly Quamina postings might be worked into a small book or monograph or something. State machines are really freaking useful software constructs! So yeah, this is a war story not an essay, but if you like finite automata you’ll likely be interested in bits of it.
The story thus far ·
Prior to beginning work on Regular Expressions, I’d already wired shell-style “*
” wildcards into Quamina, which
forced me to start working with NFAs and ε-transitions. The implementation wasn’t crushingly difficult, and
the performance was… OK-ish. ¶
Which leads me to The Benchmark From Hell.
I wondered how the wildcard functionality would work under heavy stress, so I pulled in a list of 12,959 five-letter strings
from the Wordle source code, and inserted a “*
” at a random position in each. Here are the first ten:
aalii* *aargh aar*ti abaca* a*baci a*back ab*acs ab*aft abak*a
I created an NFA for each and merged them together as described here. Building and merging the automata were plenty fast enough, and the merged NFA had 46,424 states, which felt reasonable. Matching strings against it ran at under ten thousand per second, which is pretty poor given that Quamina can do a million or two per second on patterns encoded in a DFA.
But, I thought, still reasonably usable.
The cursed “?
” ·
Last year, my
slow grind through the regexp features had led me
to the zero-or-one quantifier “?
”. The state machine for these things is not rocket science; there’s a discussion
with pictures in my recent
Epsilon Wrangling. ¶
So I implemented that and fired off the unit tests, most of which I didn’t have to write, and they all failed. Not a surprise I guess.
It turned out that the way I’d implemented ε-transitions for the wildcards
was partially wrong, as in it worked for the tight-loop
state-to-itself ε-transitions, but not for more general-purpose things like “?
” requires.
In fact, it turns out that merging NFAs is hard (DFAs are easy), and I found precious little help online. Thompson’s construction does give an answer: Make an otherwise-empty state with two ε-transitions, one to each of the automata, and it’ll do the right thing. Let’s call that a “splice state”. It’s easy to implement, so I did. Splicing is hardly “merging” in the Quamina sense, but still.
Unfortunately, the performance was hideously bad, just a few matches per second while pegging the CPU. A glance at the final NFA was sobering; endless chains of splice states, some thousands long.
At this point I became very unhappy and got stalled for months dealing with real-life issues while this problem lurked at the back of my mind, growling for attention occasionally.
Eventually I let the growler out of the cave and started to think through approaches. But first…
Worth solving? · Is it, really? What sane person is going to want to search for the union of thousands of regular expressions in general or wild-carded strings in particular? ¶
I didn’t think about this problem at all, because of my experience with Quamina’s parent, Ruler. When it became popular among several AWS and Amazon teams, people sometimes found it useful to match the union of not just thousands but a million or more different patterns. When you write software that anyone actually uses, don’t expect the people using it to share your opinions on what is and isn’t reasonable. So I wasn’t going to get any mental peace until I cracked this nut.
I eventually decided that three approaches were worth trying:
Figure out a way really to merge, not just splice, the wildcarded patterns, to produce a simpler automaton.
Optimize the NFA-traversal code path.
Any NFA can be transformed into a DFA, says computer-science theory. So do that, because Quamina is really fast at DFA-based matching.
Nfa2Dfa · I ended up doing all of these things and haven’t entirely given up on any of them. The most intellectually-elegant was the transform-to-DFA approach, because if I did that, I could remove the fairly-complex NFA-traversal logic from Quamina. ¶
It turns out that the Net is rich with textbook extracts and YouTubes and slide-shows about how to do the NFA-to-DFA conversion. It ended up being quite a pleasing little chunk of code, only a couple hundred lines.
The bad news: Converting each individual wildcard NFA to a DFA was amazingly fast, but then as I merged them in one by one, the number of automaton states started increasing explosively and the process slowed down so much that I never had the patience to let it finish. Finite-automata theory warns that this can happen, but it’s hard to characterize the cases where it does. I guess this one of them.
Having said that, I haven’t discarded the nfa2Dfa
code, because perhaps I ought to offer a Quamina option to
apply this if you have some collection of patterns that you want to run really super fast and are willing to wait for a while
for the transformation process to complete. Also, I may have missed opportunities to optimize the conversion; maybe it’s making
more states than it needs to?
Faster NFA traversal · Recently in Epsilon wrangling I described how NFA traversal has to work, relying heavily on implementing a thing called an ε-closure. ¶
So I profiled the traversal process and discovered, unsurprisingly, that most of the time was going into memory allocation while computing those ε-closures. So now Quamina has an ε-closure cache and will only compute each one once.
This helped a lot but not nearly enough, and the profiler was still telling me the pain was in Go’s allocation and
garbage-collection machinery. Whittling away at this kind of stuff is not rocket science. The standard Go trick I’ve seen over
and over is to keep all your data in slices, keep re-using them then chopping them back to [:0]
for each request. After a while they’ll have grown to the
point where all the operations are just copying bytes around, no allocation required.
Which also helped, but the speed wasn’t close to what I wanted.
Merging wildcard automata · I coded multiple ways to do this, and they kept failing. But I eventually found a way to build those automata so that any two of them, or any one of them and a DFA, can merged and generate dramatically fewer ε-transition chains. I’m not going to write this up here for two reasons: First of all, it’s not that interesting, and second, I worry that I may have to change the approach further as I go on implementing new regxp operators. ¶
In particular, at one point I was looking at the code while it wasn’t working, and I could see that if I added a particular
conditional it would work, but I couldn’t think of a principled reason to do it. Obviously I’ll have to sort this out
eventually. In the meantime, if you’re the sort of um special person who is now burning with curiosity, check out my branch from
that PR and have a look at the spinout
type.
Anyhow, I added that conditional even though it puzzled me a bit, and now you can add wildcard patterns to Quamina at 80K/second,
and my 12.9K wildcards generate an NFA with with almost 70K states, which can scan events at almost 400K/second. And that’s good
enough to ship the “?”
feature.
By the way, I tried feeding that 70K-state automaton to the DFA converter, and gave up after it’d burned an hour of CPU and grown to occupy many GB of RAM.
Next steps ·
Add “+
” and “*
”, and really hope I don’t have to redesign the NFA machinery again. ¶
Also, figure out the explanation for that puzzling if
statement.
And I should say… · Despite the very narrow not to say obsessive focus of this series, I’ve gotten a few bits and pieces of positive feedback. So there are a few people out there who care about this stuff. To all of you, thanks. ¶
Comment feed for ongoing:
From: Doug Cutting (Jul 22 2025, at 18:45)
You should talk to Ron Kaplan. He & Lauri Karttunen did amazing things with finite state automata in the 80s and 90s. I used their libraries at Xerox to determinize, minimize, union, intersect, etc. finite state transducers composed of large lexicons & rule sets.
[link]
From: Andreas Kupries (Jul 23 2025, at 04:00)
> Also, I may have missed opportunities to optimize
> the conversion; maybe it’s making more states than it needs to?
Add e-closure caches to the conversion process ?
Multi-stage conversion ?
IOW convert each individual NFA to DFA, i.e make it epsilon-free, and then compute the union of these with a simple epsilon construction (new start state with e-transitions to the start states of the DFAs). This should limit the number and size of the e-closures in both stages, obviating the need for e-closure caches. Especially in the second stage it would need e-closures only once, for the closures of the new start state per possible character.
Also, maybe, compute unions of DFA pairs (binary tree like) until the single final union is left ? Maybe keeps the size of the conversion structures under better control ?
[link]