WF I: Erlang Ho!

This is the first progress report from the Wide Finder Project. Erlang is the obvious candidate for a Wide Finder implementation. It may be decades old but it’s the new hotness, it’s got a PragBook (@Amazon), I hear heavy breathing from serious software geeks whichever way I listen. So, let’s give it a whirl. [Warning: Long and detailed, but the conclusion comes first.]

Conclusion · Based on a few days’ bashing, I think that if I were starting out on a networked application that needed to run well on many-core systems and I wanted to be really really sure would never go down, Erlang might be just the ticket.

But as for the Wide Finder problem: Erlang, as it ships today, doesn’t work. It’s got slow-ish file I/O and ridiculously, unusably slow regular expression processing. There may be work-arounds.

Leaping In · Joe Armstrong’s book is no Camel or Pickaxe, but it’s OK. Armstrong tends to lead with theory, introducing the language formalisms then showing how you might use them. The coverage is a little thin; there are lots of big important pieces of the library that are entirely bypassed, and lots of omitted pieces in those that are covered. Whatever; if you’re an experienced programmer, you’ll be able to to use this as a bootstrap, and there are decent resources online.

I raised my eyebrows a bit at the near-complete absence of any discussion of text processing, regular expressions, and so on. Also at Erlang’s quaint notions about strings and characters; see Sam Ruby’s ASCII, ISO-8859-1, UCS, and Erlang. Well, the language reflects the book.

The rest of this piece is basically my working notes from the last few days’ work. I’m sure the source-code snippets will make real Erlang experts blanch in horror. That’s OK, I’m probably reasonably typical of an experienced developer picking this stuff up, so my goofs are likely usefully diagnostic.

Looping · Erlang doesn’t want to; it wants you to use tail-recursion instead. Here’s how I process all the lines out of a file:

scan_line(io:get_line(File, ''), File, Pattern, Counter).

scan_line(eof, _, _, _) -> 0;
scan_line(Line, File, Pattern, Counter) ->
    process_match(Line, regexp:match(Line, Pattern), Counter),
    scan_line(io:get_line(File, ''), File, Pattern, Counter).

I suppose this is OK, except for the tail recursion is really a glorified goto statement, and after a while it starts to feel like an elaborate fake-recursive lie.

Counting · In most computer languages, the way you count things is by incrementing a variable. In the Ruby program that the Wide Finders are trying to replicate, it’s like this:

    counts[$1] += 1

Except for, in Erlang, Variables aren’t variable, so you Can’t Do That. I thought of two ways to count things. Every Erlang process has a process dictionary, i.e. a hashtable or content-addressable store or whatever, with get() and put() calls. This is the kind of thing that the Ruby code above depends on; the square brackets are a dictionary lookup. The problem is, Joe Armstrong severely disses the use of the dictionary; you get the idea that if you use it, you’re the kind of person who sniffs little girls’ bicycle seats for thrills.

There’s also a hashtable package called ets, of which Joe expresses less disapproval.

Another way to count things, which I suspect would make Erlinguists happier, is to spawn a process for each thing you need to count, containing code like so:

counter_loop(Count) ->
    receive
	{ incr } ->
	    counter_loop(Count + 1);
	{ report, To } ->
	    To ! { count, Count },
	    counter_loop(Count)
    end.

Then you talk to it with code like this:

incr(Counter) ->
    Counter ! { incr }.

find_count(Counter) ->
    Counter ! { report, self() },
    receive
	{ count, Count } ->
	    Count
    end.

In my logfile, there is a small number of thousands of unique URIs to count, and the Erlang posse loves chest-beating about their process mojo, so this may be a good fit. Of course, you need some extra infrastructure to spawn the counter processes and map each unique URI to the process that’s counting it, but I built that and it seems to work OK.

Performance · Reviewing: that little Ruby program could process a week’s worth of data, a million and change lines, maybe a quarter of a gig, in about 7½ seconds of CPU, 13½ seconds elapsed. Perl was about twice as fast.

My first cut in Erlang, based on the per-process dictionary, took around eight minutes of CPU, and kept one of my MacBook’s Core Duo processors pegged at 97% while it was running. Ouch!

Switching to the ets version made no difference, nor did switching to my lightweight multi-process counter as illustrated above.

Profiling · Erlang comes with no less than three profilers, fairly thinly documented. I ended up using something called fprof. It’s a pig. I put it to work on a 100K-line subset of the test data, and it ground my computer to a halt and exhibited no progress in ten minutes, I killed that and started again with a 10K-line (2M of data) subset. It was still pretty bad; then I noticed that the profile trace was up over 8G in size and growing fast. So I tried again on only a thousand lines (195K), and the profile data was a mere 900M.

Eventually, the profiler revealed the awful truth: my program was spending 90% of its time in regexp:match and another 9% in io:get_line. I put my amazing powers of deduction to work and considered the hypothesis that Erlang regular expressions are slow. I had previously noticed this comment in the implementation:

%% Note that we interpret the syntax tree of a regular expression
%% directly instead of converting it to an NFA and then interpreting
%% that. This method seems to go significantly faster.

Well, a single-digit number of seconds with Google turned up lots of evidence, as for example from Gaspar Chilingarov in Implementatoin of regular expressions in pure Erlang [sic]: “In you tried to use regexp module from base erlang distribution or gregexp from jungerl, you noticed, that they work quite slow and are not usable to real text extraction tasks.”

There are alternatives: Gaspar, quoted above, provides one, and there has been at least one interface to the Posix regex libraries.

So I’ll check these out, because I’d really like to get past this basic stuff and investigate whether, as seems likely, I can get past Erlang’s reluctance to read, parse, and count, and make use of its allegedly-excellent concurrency.

Postscript · And I should say: I’ve really enjoyed having to learn a new way to think about computer programming.

Contributions

Comment feed for ongoing:

From: Alex (Sep 21 2007, at 17:42)

I may be grievously wrong, but I would think the idiomatic way to accumulate your counter would be to simply pass it as an argument to your function. The function takes Count, and the recursive call passes Count + 1. The version of Count received on the eof branch will be the total.

[link]

From: Clayton Wheeler (Sep 21 2007, at 18:29)

I'm no Erlang expert, but I've spent some time with it here and there. A more comfortable way, I think, to approach something like your looping construct in Erlang is to write a generic each_line function (shades of Ruby, I guess) to call a fun on every line of a file. Then you've got your loop termination and line reading logic separated from whatever you happen to do with each line. I like the pattern matching on function heads, but for things like this a case statement sometimes seems clearer, too. Like maybe so:

each_line(File, Fun) ->

case io:get_line(File, '') of

eof ->

done;

Line ->

Fun(Line),

each_line(File, Fun)

end.

process_file(File, Pattern, Counter) ->

each_line(File,

fun(Line) ->

process_match(Line,

regexp:match(Line, Pattern),

Counter)

end).

I think I agree about the general weakness of Erlang's text processing. I've been working on a network traffic analyzer in Erlang (and a parallel version in Ruby); my hand-built Erlang HTTP parser works OK, but I wouldn't call it a thing of beauty, either.

It's a bit ironic, since Erlang's bit syntax for doing pattern matching on binary data is wonderful; it just doesn't do variable-length string matching.

[link]

From: anders pearson (Sep 21 2007, at 20:33)

Alex and Clayton are right about the idiomatic way to count. That's how you do it in functional languages in general; it's not Erlang specific.

Yeah, the regexp engine is dog slow. Spending years as a diehard Perl programmer and then transitioning to being a primarily Python programmer has taught me though that regexps aren't as necessary as I used to think they were. I still do a lot of text processing in Python but I very rarely use regexps these days. In Perl and Ruby, they're right there in front of you so it feels natural to reach for them for any text munging problem. When I started with Python I would get annoyed that I had to import a library to do regexps but eventually I discovered that 90% of the time I could get the task done with just the built in string methods very efficiently and clearly. Erlang is less text processing oriented than even Python but it is *very* list oriented. There are a lot of built in functions and libraries for operating on lists and text in Erlang is lists. If you can figure out how to translate what you were doing with regexps to list operations, you're likely to see a performance improvement and the code will probably be more idiomatic.

Also, wrt looping, make sure you're comfortable with list comprehensions and the lists module. Often, what can be done with an awkward feeling tail recursive function can be neatly expressed as a list comprehension. Again, Erlang as a functional language is heavily oriented towards operating on lists. Anytime you can figure out a way to express your problem as list processing, you're probably closer to getting an efficient, idiomatic solution. Erlang give you list comprehensions, lists:map/2, lists:foldr/3, lists:foreach/2, lists:filter/2 and those are some really strong hammers in a skilled hand. Once the problem is expressed that way it also lends itself directly to parallelizing with a MapReduce type approach.

[link]

From: Masklinn (Sep 22 2007, at 02:18)

Tim, from your next post, the format of your log (or more precisely of the log files you want to extract) seems extremely regular and mostly fixed in size (only the last item has a non-fixed size).

Thus, I think you should use Erlang's pattern matching (instead of regexps) on each line of the file, either lists pattern matching or binary pattern matching (binary would probably be faster, and it gives you a much better control on the item sizes you're trying to match).

Reading each line and then matching them against a pattern along the lines of

<<"GET /ongoing/When/", _Count:3/binary, "x/", Year:4/binary, "/", Month:2/binary, "/", Day:2/binary, "/", Title/binary>> = LogFile.

Then you can just use Year, Month, Day and Title, or convert them to list-based strings first (via binary_to_list).

Of course it probably only works if the file is in ascii or an ascii-compatible encoding (such as utf-8).

[link]

From: Thomas Lackner (Sep 22 2007, at 12:05)

That's unfortunate that you didn't have a pleasant first encounter with Erlang. Perhaps it would be an interesting exercise to post 100kb of your datafile and your Erlang code, and people from the community take a crack at optimizing it?

[link]

From: Elliotte Rusty Harold (Sep 24 2007, at 06:37)

No loops? Can't change variables? Tim, doesn't this sound at least a little bit familiar to you?

Just pretend you're working in XSLT but with a syntax more suited to traditional numeric computation, and you'll be fine. :-)

[link]

From: Sean Barrett (Sep 24 2007, at 20:39)

"[Erlang] may be decades old but it’s the new hotness, it’s got a PragBook (@Amazon), I hear heavy breathing from serious software geeks whichever way I listen."

For those of us who don't follow whatever blogs you're reading, is there a link or two to explain why it's hot? I've missed this entirely.

If it has to do with multicore, I advise typing 'erlang faq' into google, press 'i feel lucky', and then check out item 1.4 ("what sort of problems is Erlang not particularly suitable for"). At least according to that entry, rewriting an app in Erlang to try to leverage multiple cores is not likely to be a big win, at least not until we have a _lot_ of cores. (To put it in terms of that faq entry, a small (constant) number of cores is just driving down your non-big-O constants... that Erlang is making bigger than other languages.)

[link]

From: Rornan (Sep 25 2007, at 08:31)

Tim,

If you had anything at all to do with creating XSLT, then you have no right at all to comment on any other language deficency ever ever again.

[link]

ongoing

What this is ·

Truth · Biz · Tech

author · Dad
colophon · rights

September 21, 2007
· Technology (90 fragments)
· · Concurrency (76 more)
· · Erlang (4 more)

By Tim Bray.

The opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.

I’m on Mastodon!