Recently it became obvious that the Visual Net data-prep and index-build subsystems needed refactoring, and I took on the job. So I’ve been up to my elbows in heavy C coding for a week now—my first such excursion this millennium. Herewith some extremely technical low-level notes on the subject, probably not of interest to non-professionals, except perhaps for a paragraph on the world-view of the aging coder. There is some discussion of XML and scaling issues.

Greybeard Coding · Every time I take up the cudgels to do some real development, I have to wonder if this is the last time I’ll be doing this; many programmers one way or another are not cutting code any more after twenty years or so. Whereas I can see the attractiveness of getting paid to have opinions and meetings and leadership skills, I’m still hooked on the feeling of watching the piles of executable abstractions grow higher and take useful form, step by very small step at my commmand.

I’m about as good a programmer as I was a decade or two ago. I’m no longer as strong, no longer willing to stay up till three to get past some stupid time-dependent bug, and happily encumbered by family and so on. But my bag of tricks is large, and somewhere in the dusty coding cupboards is a perhaps-not-perfect but known-to-work solution to most of what I run across.

This doesn’t mean that I don’t make stupid mistakes; much of one day last week was spent writing a tree-builder that produced subtly wrong results, then I looked at it with fresh eyes and saw that the recursion was backward, I must have been thinking about something else for four hours; I can only assume this happens to other people too.

On C · I recall one of the basic tutorials for those new to Unix some twenty years ago, it contained the immortal line “For any serious programming, you pretty well have to use C.” This is no longer true, except when it is; I note that a lot of important pieces of infrastructure are still written in C, and I don’t think it’s going away any time soon.

For most things I’d much rather use Java or equivalent, or Python or equivalent, but sometimes you just have to wrangle shared-memory data structures hundreds of megabytes in size, not waste any memory, and count your microseconds. The obvious alternative was C++, but reasons of aesthetic revulsion aside, the case for it wasn’t strong enough to be noticeable.

On Object Orientation · Just because you need to write in C doesn’t mean you can’t be O-O. At the end of the day, a Java Object is really a void *, and after a while, it’s easy to fall into a pattern where all your code is grouped into modules that smell like classes; each routine either is a constructor that returns a void *, or takes one of those void * thingies as its first argument. You have to cook your own package-like naming scheme, but no biggie.

If there’s anyone out there who’s still writing application-level C code but isn’t doing it this way, I recommend giving it some serious thought.

On Processing Big Files · I’ve spent a whole lot of my career processing really big input files, going back to the 570MB Oxford English Dictionary in the Eighties (that was really big then). I gave a lecture on the first ever Perl Whirl entitled Perl, XML, and Really Big Data which passed on some of the lessons. I may reproduce that here on ongoing sometime, but here’s one important lesson for free.

Suppose that you have to read ten million lines of text, each of which begins with a number, and count how many begin with an odd number. In Perl for the sake of brevity:

my $odd;
while (<STDIN>)
{
  s/ .*$//;
  $odd++ if ($_ % 2);
}
print "Odd lines: $odd\n";

Looks fine, right? So, you fire it up against your great big file and settle back to wait. After three or four minutes, you start to get nervous; did you get that regex right? Did you screw up the loop somehow? But you don’t want to interrupt it, you’ve already invested in this run. But if it’s off the rails, you don’t want to wait too long to find out.

Here’s Uncle Tim’s first rule of writing code to process big files: Put In Progress Reports. Like so:

my $odd;
my $lines;
while (<STDIN>)
{
  s/ .*$//;
  print STDERR "$lines lines, $odd odd\n" unless (++$lines % 10000);
  $odd++ if ($_ % 2);
}
print "Odd lines: $odd\n";

This will greatly help both your productivity and your sanity.

Expat · This is still pretty well the state of the art for XML parsing in C. We had originally used Xerces but it was too big and too complicated and we had trouble shaking out some weird bugs. Expat is just excellent, on performance grounds if nothing else. I was shaking this program down on a middle-aged 750MHz Linux box, and have also been debugging on my 550MHz Powerbook, running Expat (with fairly simple event handlers) over really big data files, and the CPU usage never gets up over 60%; it’s so efficient that the performance is I/O-limited. I like being I/O-limited.

It’s going to be interesting to see what happens when I run this on a modern multi-GHz production server with really fast disks.

OS X Weirdness · One of the advantages of having a Mac is that I can work on my server-side code here in a self-contained way here on the laptop. Yes, but there are some distinctly surprising things in the view from down here in the C-language trenches:

  • I can’t seem to do a read(2) of more than 8192 bytes against a pipe. Huh?

  • The C compiler is incredibly sloppy by default; modern GCC on Linux is helpfully-pedantic about conditionals and possibly-uninitialized variables and function templates and so on. With this thing I can say foo(a); and then a couple lines later foo(x,b,c);, with the types of a and x being wildly different, and hear nothing from the compiler. It’s probably there, I just haven’t figured out the right option incantations.

  • The -pg option is making some distinctly weird stuff happen, causing breakage that I can’t reproduce in my test suite and can’t track down in the debugger either. Oh well, I can profile on Linux.

An Interesting Optimization Problem · What this program does is read an XML file that is either a from-scratch description of the database Visual Net is mapping, or describes some deltas to it. In the first case, performance is vital because the input files are potentially huge, many gigabytes is not uncommon. In the second case, this is an interactive transaction and humans are waiting for the results of the deltas. Either way, performance is critical.

At the moment, it’s the first area that’s giving me the performance challenges. Visual Net is very fast because it’s mostly just compiled code traversing in-memory data structures, and thus so is update. But pulling the data out of the monster XML streams and building those structures can be challenging. Here’s one little part of the problem, which XML aficionados will find amusing.

One of the elements in the input stream represents a customer’s data object, which can come with an arbitrary number of named metadata fields; we handle this with a <metadata> element, the fields show up either as attributes or child elements, almost always as attributes. In one of the databases we’ve done recently, there are 29 fields for each incoming data object.

So here’s the problem. When Expat gives you a start-tag event and an array of name/value pairs, and for each attribute you have to look through your list of 29 field definitions to figure out where this one goes, and you’re processing twenty million records or so, the profiler tells you that looking through that list starts to loom very large in your processing time figures.

What would you do?

I ended up computing a little automaton, it takes 35 states to recognize 29 distinct possible attribute values, rarely needing to look at more than three characters. Made a huge difference.

I wonder what all those XML deserialization packages I see advertised out there do? I wonder how they’d perform if asked to read 25 million records?

Understanding the Problem · This work has reinforced my conviction that you never really understand the problem until you’ve written some of the code. We’d worked out in advance how this thing I’m writing was supposed to interface to the rest of the system, then on day three I had to go back and say “This isn’t gonna work, here’s why” and we made some changes.

I assume there must be people out there who are smart enough to spec out an interface without having written any code and get it right; but I also assume that they’re few and far between, and normal people like you and me shouldn’t count on this sort of virtuosity.


author · Dad · software · colophon · rights
picture of the day
September 01, 2003
· Technology (77 fragments)
· · Coding (98 more)

By .

I am an employee
of Amazon.com, but
the opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.