XML Automaton

In December of 1996 I released a piece of software called Lark, which was the world’s first XML Processor (as the term is defined in the XML Specification). It was successful, but I stopped maintaining it in 1998 because lots of other smart people, and some big companies like Microsoft, were shipping perfectly good processors. I never quite open-sourced it, holding back one clever bit in the moronic idea that I could make money out of Lark somehow. The magic sauce is a finite state machine that can be used to parse XML 1.0. Recently, someone out there needed one of those, so I thought I’d publish it, with some commentary on Lark’s construction and an amusing anecdote about the name. I doubt there are more than twelve people on the planet who care about this kind of parsing arcana. [Rick Jelliffe has upgraded the machine].

Why “Lark”? · Lauren and I went to Australia in late 1996 to visit her mother and to get married, which we did on November 30th. Forty-eight hours later, Lauren twisted her knee badly enough that she was pretty well confined to a sofa for the rest of our Australian vacation. ¶

So I broke out my computer and finished the work I’d already started on my XML processor, and decided to call it Lark for Lauren’s Right Knee.

How Lark Worked · Lark was a pure deterministic finite automaton (DFA) parser, with a little teeny state stack. Some of its transitions were labeled with named “events” that would provoke the parser to do something if, for example, it had just recognized a start tag or whatever. ¶

DFA-driven parsers are a common enough design pattern, although I think Lark is the only example in the XML space. There are well-known parser generators such as yacc, GNU bison, and javacc, usually used in combination with lexical scanners such as flex so that you can write your grammar in terms of tokens not characters. Also, they handle LALR langauges, so the parsing technique is quite a bit richer than a pure state machine.

I thought I had a better idea. The grammar of XML is simple enough, and the syntax characters few enough, that I thought I could just write down the state machine by hand. So that’s what I did, inventing a special-purpose DFA-description language for the purpose.

Then I had a file called Lark.jin which was really a Java program that used the state machine to parse XML. The transition “events” in the machine were mapped to case labels in a huge switch construct. Then there was a horrible, horrible Perl program that read the Lark.jin and the automaton, generated the DFA tables in Java syntax, inserted them into the code and produced Lark.java, which you actually compiled to make the parser.

So while Java doesn’t have a preprocessor, Lark did, which made quite a few things easier.

There were a lot of tricks; some of the state transitions weren’t on characters, they were on XML character classes such as NameChar and so on. This made the automaton easier to write, and in fact, to keep the class files small, the character-class transitions persisted into the Java form, and the real DFA was built at startup time. These days, quick startup might be more important than .class file size.

What Was Good · It was damn fast. James Clark managed to hand-craft a Java-language XML parser called XP that was a little faster than Lark, but he did that by clever I/O buffering, and I was determined to leapfrog him by improving my I/O. ¶

This was before the time of standardized XML APIs, but Lark had a stream API that influenced SAX, and a DOM-like tree API; both worked just fine. Lark is one of very few parsers ever to have survived the billion laughs attack.

Lark was put into production in quite a few deployments, and the flow of bug reports slowed to a trickle. Then in 1998 I noticed that IBM and Microsoft and BEA and everyone else were building XML Processors, so I decided that it wasn’t worthwhile maintaining mine.

What Was Bad · I never got around to teaching it namespaces, which means it wouldn’t be real useful today. ¶

It had one serious bug that would have been real work to fix and since nobody ever encountered it in practice, I kept putting it off and never did. If you had an internal parsed entity reference in an attribute value and the replacement text included the attribute delimiter (' or "), it would scream and claim you had a busted XML document.

That Automaton · What happened was, Rick Jelliffe, who is a Good Person, was looking for a FSM for XML and I eventually noticed, and so I sent him mine. ¶

There’s no reason whatsoever to keep it a secret: here it is. Be warned: it’s ugly.

Fortunately, there were only 227 states and 8732 transitions, so the state number fit into a byte; that and the associated event index pack into a short. To make things even tighter, the transitions were only keyed by characters up to 127, as in 7-bit ASCII. Characters higher than that can’t be XML syntax characters, so we’re only interested whether they fall into classes like NameChar and NameStartChar and so on. A 64K byte[] array takes care of that, each byte having a class bitmask.

As a result of all this jiggery-pokery, the DFA ends up, believe it or not, constituting a short[227][128].

Here’s a typical chunk of the automaton:

1. # in Start tag GI
2. State StagGI BustedMarkup {in element type}
3. T $NameC StagGI
4. T $S InStag !EndGI
5. T > InDoc !EndGI !ReportSTag
6. T / EmptyClose !EndGI

This state, called StagGI, is the state where we’re actually reading the name of a tag, we got here by seeing a < followed by a NameStart character.
Line 1 is a comment.
In line 2 we name the state, and support error reporting, providing the name of another state to fall back into in case of error, and in the curly braces, some text to help build an error message.
Line 3 says that if we see a valid XML Name character, we just stay in this state.
Line 4 says that if we see an XML space character, we move to state InStag and process an EndGI event, which would stash the characters in the start tag.
And so on.

Other Hackery · An early cut of Lark used String and StringBuffer objects to hold all the bits and pieces of the XML. This might be a viable strategy today, but in 1996’s Java it was painfully slow. So the code goes to heroic lengths to live in the land of character arrays at all times, making Strings only when a client program asks for one through the API. The performance difference was mind-boggling. ¶

An Evil Idea · If you look at the automaton, and the Lark code, at least half—I’d bet three quarters—is there to deal with parsing the DTD and then dealing with entity wrangling. A whole bunch more is there to support DOM-building and walking. ¶

I bet if I went through and simply removed support for anything coming out of the <!DOCTYPE>, including all entity processing, then discarded the DOM stuff, then added namespace support and SAX and StAX APIs, it would be less than half its current size. Then if I reworked the I/O, knowing what I know now and stealing some tricks that James Clark uses in expat, I bet it would be the fastest Java XML parser on the planet for XML docs without a DOCTYPE; by a wide margin. It’s hard to beat a DFA.

And it would still be fully XML 1.0 compliant. Because (snicker) this is Java, and your basic core Java now includes an XML parser, so I could simply instrument Larkette to buffer the prologue and if it saw a DOCTYPE with an internal subset, defer to Java’s built-in parser.

I’ll probably never do it. But the thought brings a smile to my face.

ongoing

What this is ·

Truth · Biz · Tech

author · Dad
colophon · rights

April 18, 2006
· Technology (90 fragments)
· · Coding (98 more)
· · XML (136 more)

By Tim Bray.

The opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.

I’m on Mastodon!