WF II: Erlang Blues

This is the second progress report from the Wide Finder Project, and a follow-on from the first, Erlang Ho! The one thing that Erlang does right is so important that I’d like to get past its problems. So far I can’t.

[Update: Several people wrote “show me the data!” Well, OK then. Ten thousand lines of logfile, a little over two meg, may be found at www.tbray.org/tmp/o10k.ap. Anyone who can generate Erlang code that can read this stuff and parse out the ongoing fragment fetches at a remotely competitive speed will get me interested in Erlang again.] [Update II: See the comments: the problem seems to be io:get_line (which is a serious problem, by the way). At first glance, the solutions work for the case when you can read the whole file into memory before you start to process any of it. Looks like I’ll be spending a bit more quality time with E.]

Interactive Irritation · If you type erl at your shell prompt, you’re in Erlang. There are differences between what you can have in a file and what you can type at the prompt. And when you type control-D, nothing happens; to get out, you have to interrupt it with control-C or equivalent. This is wrong. ¶

If you type erlc foo.erl it’ll compile into foo.beam, but you can’t just type erl foo.beam to run it, you have to add a bunch of stupid pettifogging options. This is wrong.

Slow Basics · Yesterday, I reported on Erlang’s appalling regexp performance. Someone using the handle “Masklinn” suggested using pattern-matching with what Erlang calls “binaries” instead. So I did. Let’s leave the add-’em-up part of my problem out, and zero in on the problem of getting Erlang to read the 1,167,948 lines of logfile and select the 105,099 that are fetches of ongoing fragments. ¶

I coded it up using patterns per Masklinn’s suggestion and it was really much better, burning only 56.44 CPU seconds on my MacBook. The code looks like this:

process_match(<< "/ongoing/When/", Trailer/binary >>) ->
    Last = binary_to_list(Trailer),
    case lists:member($., Last) of
	true -> 0;
	false -> 1
    end;
process_match(_) -> 0.

scan_line(eof, _, Count) -> Count;
scan_line(Line, File, Count) ->
    Uri = list_to_binary(lists:nth(7, string:tokens(Line, " "))),
    NewCount = Count + process_match(Uri),
    scan_line(io:get_line(File, ''), File, NewCount).

At this point I smelled a rat; in particular, an I/O rat. So I ripped out all the fragment-recognition crap and measured how long it takes Erlang to read the lines:

scan_line(eof, _, Count) -> Count;
scan_line(_, File, Count) ->
    scan_line(io:get_line(File, ''), File, Count + 1).

That was better: a mere 34.165 CPU seconds.

Except for, on the same MacBook, my simple little Ruby program read the lines, parsed out the fragment fetches, and did all the totaling and sorting in, let’s see... 3.036 CPU seconds (3.47 seconds elapsed).

Hold On a Second · Unlike Ruby, Erlang is highly concurrent. So if I ran it on an 8-core machine, the Ruby would run at the same speed, but the Erlang ought to go faster. Except for parallelizing line-at-a-time I/O from a file would be hard. And even if you could, and go eight times as fast, and even leaving out all the matching and adding, you’re still half again as slow as Ruby. This is wrong. ¶

Dear Erlang I: · I like you. Really, I do. But until you can read lines of text out of a file and do basic pattern-matching against them acceptably fast (which most people would say is faster than Ruby), you’re stuck in a niche; you’re a thought experiment and a consciousness-raiser and an engineering showpiece, but you’re not a general-purpose tool. Sorry. ¶

I’m done with beating my head against Erlang for now. If someone can show me how to make it read and pattern-match at a remotely competitive speed, I’ll be able actually to look at that nice concurrency stuff; and I’d like to.

Dear Erlang II: · Thank you for helping me think about the Wide Finder problem. As a result of these days together, I think I know what the ideal solution would look like. I suspect it’s not out there yet, but I bet I can recognize it when I see it. More on that later. ¶

Contributions

Comment feed for ongoing:

From: Evan (Sep 22 2007, at 21:53)

For what it's worth, and I'll admit that it's been a while since I checked it out, Termite Scheme is a Scheme implementation which is meant have Erlangesque concurrency features. It's built of Gambit Scheme, which is meant to be one of the faster Schemes. It lives here: http://toute.ca/. It was done as a thesis project, so it might not be current or well supported.

Keeping with the insect theme, there was something called Mosquito Lisp, done by some people called (appropriately enough) Ephemeral Security. It appears to be gone, though, now that I've gone looking for the link. It's http://ephsec.squarespace.com/mosquito-lisp/ for what it's worth. Perhaps it'll be back up soon.

I'm following this one closely. Something that goes seamlessly from simple scripts to writing 'enterprise class' concurrent systems more or less seamlessly is something that I've been waiting for a good long while now.