This is the second progress report from the Wide Finder Project, and a follow-on from the first, Erlang Ho! The one thing that Erlang does right is so important that I’d like to get past its problems. So far I can’t.
[Update: Several people wrote “show me the data!” Well, OK then.
Ten thousand lines of
logfile, a little over two meg, may be found at
www.tbray.org/tmp/o10k.ap. Anyone who can generate Erlang code
that can read this stuff and parse out the ongoing
fetches at a remotely competitive speed will get me interested in Erlang
[Update II: See the comments: the problem seems to be
(which is a serious problem, by the way). At
first glance, the solutions
work for the case when you can read the whole file into memory before you
start to process any of it. Looks like I’ll be spending a bit more quality
time with E.]
Interactive Irritation ·
If you type
erl at your shell prompt, you’re in Erlang. There
are differences between what you can have in a file and
what you can type at the prompt. And when you type control-D, nothing
happens; to get out, you have to interrupt it with control-C or
equivalent. This is wrong.
If you type
erlc foo.erl it’ll compile into
foo.beam, but you can’t just type
erl foo.beam to run it, you have to add a bunch of
stupid pettifogging options. This is wrong.
Slow Basics · Yesterday, I reported on Erlang’s appalling regexp performance. Someone using the handle “Masklinn” suggested using pattern-matching with what Erlang calls “binaries” instead. So I did. Let’s leave the add-’em-up part of my problem out, and zero in on the problem of getting Erlang to read the 1,167,948 lines of logfile and select the 105,099 that are fetches of ongoing fragments.
I coded it up using patterns per Masklinn’s suggestion and it was really much better, burning only 56.44 CPU seconds on my MacBook. The code looks like this:
process_match(<< "/ongoing/When/", Trailer/binary >>) -> Last = binary_to_list(Trailer), case lists:member($., Last) of true -> 0; false -> 1 end; process_match(_) -> 0. scan_line(eof, _, Count) -> Count; scan_line(Line, File, Count) -> Uri = list_to_binary(lists:nth(7, string:tokens(Line, " "))), NewCount = Count + process_match(Uri), scan_line(io:get_line(File, ''), File, NewCount).
At this point I smelled a rat; in particular, an I/O rat. So I ripped out all the fragment-recognition crap and measured how long it takes Erlang to read the lines:
scan_line(eof, _, Count) -> Count; scan_line(_, File, Count) -> scan_line(io:get_line(File, ''), File, Count + 1).
That was better: a mere 34.165 CPU seconds.
Except for, on the same MacBook, my simple little Ruby program read the lines, parsed out the fragment fetches, and did all the totaling and sorting in, let’s see... 3.036 CPU seconds (3.47 seconds elapsed).
Hold On a Second · Unlike Ruby, Erlang is highly concurrent. So if I ran it on an 8-core machine, the Ruby would run at the same speed, but the Erlang ought to go faster. Except for parallelizing line-at-a-time I/O from a file would be hard. And even if you could, and go eight times as fast, and even leaving out all the matching and adding, you’re still half again as slow as Ruby. This is wrong.
Dear Erlang I: · I like you. Really, I do. But until you can read lines of text out of a file and do basic pattern-matching against them acceptably fast (which most people would say is faster than Ruby), you’re stuck in a niche; you’re a thought experiment and a consciousness-raiser and an engineering showpiece, but you’re not a general-purpose tool. Sorry.
I’m done with beating my head against Erlang for now. If someone can show me how to make it read and pattern-match at a remotely competitive speed, I’ll be able actually to look at that nice concurrency stuff; and I’d like to.
Dear Erlang II: · Thank you for helping me think about the Wide Finder problem. As a result of these days together, I think I know what the ideal solution would look like. I suspect it’s not out there yet, but I bet I can recognize it when I see it. More on that later.