Back in May I wrote a little command-line utility called Topfew (GitHub). It was fun to write, and faster than the shell incantation it replaced. Dirkjan Ochtman dropped in a comment noting that he’d written Topfew along the same lines but in in Rust (GitHub) and that it was 2.86 times faster; at GitHub, the README now says that with a few optimizations it’s now 6.7x faster. I found this puzzling and annoying so I did some optimization too, encountering surprises along the way. You already know whether you’re the kind of person who wants to read the rest of this.

Reminder: What topfew does is replace the sort | uniq -c | sort -rn | head pipeline that you use to do things like find the most popular API call or API caller by plowing through a logfile.

Initial numbers · I ran these tests against a sample of the ongoing Apache access_log, around 2.2GB containing 8,699,657 lines. In each case, I looked at field number 7, which for most entries gives you the URL being fetched. Each number is the average of three runs following a warm-up run. (The variation between runs was very minor.)

ElapsedUserSystemvs Rust
Rust4.964.450.501.0
ls/uniq/sort142.80149.661.2928.80
topfew 1.027.2028.310.545.49

Yep, the Rust code (Surprise #1) measures over five times faster than the Go. In this task, the computation is trivial and the whole thing should be pretty well 100% IO-limited. Granted, the Mac I’m running this on has 32G of RAM so after the first run the code is almost certainly coming straight out of RAM cache.

But still, the data access should dominate the compute. I’ve heard plenty of talk about how fast Rust code runs, so I’d be unsurprised if its user CPU was smaller. But elapsed time!?

What Dirkjan did · I glanced over his code. Gripe: Compared to other modern languages, I find Rust suffers from a significant readability gap, and for me, that’s a big deal. But I understand the appeal of memory-safety and performance.

Anyhow, Dirkjan, with some help from others, had added a few optimizations:

  1. Changed the regex from \s+ to [ \t].

  2. The algorithm has a simple hash-map of keys to occurrence counts. I’d originally done this as just string-to-number. He made it string-to-pointer-to-number, so after you find it in the hash table, you don’t have to update the hash-map, you just increment the pointed-to value.

  3. Broke the file up into segments and read them in separate threads, in an attempt to parallelize the I/O.

I’d already thought of #3 based my intuition that this task is I/O-bound, but not #2, even though it’s obvious.

What I did · In his comment, Dirkjan pointed out correctly that I’d built my binary “naively with ‘go build’”. So I went looking for how to make builds that are super-optimized for production and (Surprise #2) there aren’t any. (There’s a way to turn optimizations off to get more deterministic behavior.) Now that I think about it I guess this is good; make all the optimizations safe and just do them, don’t make the developer ask for them.

(Alternatively, maybe there are techniques for optimizing the build, but some combination of poor documentation or me not being smart enough led to me not finding them.)

Next, I tried switching in the simpler regex, as in Dirkjan’s optimization #1.

ElapsedUserSystemvs Rust
Rust4.964.450.501.0
ls/uniq/sort142.80149.661.2928.80
topfew 1.027.2028.310.545.49
“[ \\t]”25.0326.180.53 5.05

That’s a little better. Hmm, now I’m feeling regex anxiety.

Breakthrough · At this point I fat-fingered one of the runs to select on the first rather than the seventh field and that really made topfew run a lot faster. Which strengthened my growing suspicion that I was spending a lot of my time selecting the field out of the record.

At this point I googled “golang regexp slow” and got lots of hits; there is indeed a widely expressed opinion that Go’s regexp implementation is far from the fastest. Your attitude to this probably depends on your attitude toward using regular expressions at all, particularly in performance-critical code.

So I tossed out the regexp and hand-wrangled a primitive brute-force field finder. Which by the way runs on byte-array slices rather than strings, thus skipping at least one conversion step. The code ain’t pretty but passed the unit tests and look what happened!

ElapsedUserSystemvs Rust
Rust4.964.450.501.0
ls/uniq/sort142.80149.661.2928.80
topfew 1.027.2028.310.545.49
“[ \\t]”25.0326.180.535.05
no regexp4.234.830.660.85

There are a few things to note here. Most obviously (Surprise #3), the Go implementation now has less elapsed time. So yeah, the regexp performance was sufficiently bad to make this process CPU-limited.

Less obviously, in the Rust implementation the user CPU time is less than the elapsed; user+system CPU are almost identical to elapsed, all of which suggests it’s really I/O limited. Whereas in all the different Go permutations, the user CPU exceeds the elapsed. So there’s some concurrency happening in there somewhere even though my code was all single-threaded.

I’m wondering if Go’s sequential file-reading code is doing multi-threaded buffer-shuffling voodoo. It seems highly likely, since that could explain the Go implementation’s smaller elapsed time on what seems like an I/O-limited problem.

[Update: Dirkjan, after reading this, also introduced regex avoidance, but reports that it didn’t appear to speed up the program. Interesting.]

Non-breakthroughs · At this point I was pretty happy, but not ready to give up, so I implemented Dirkjan’s optimization #2 - storing pointers to counts to allow increments without hash-table updates.

ElapsedUserSystemvs Rust
Rust4.964.450.501.0
ls/uniq/sort142.80149.661.2928.80
topfew 1.027.2028.310.545.49
“[ \\t]”25.0326.180.535.05
no regexp4.234.830.660.85
hash pointers4.165.100.650.84

Um well that wasn’t (Surprise #4) what I expected. Avoiding almost nine million hash-table updates had almost no observable effect on latency, while slightly increasing user CPU. At one level, since there’s evidence the code is limited by I/O throughput, the lack of significant change in elapsed time shouldn’t be too surprising. The increase in user CPU is though; possibly it’s just a measurement anomaly?

Well, I thought, if I’m I/O-limited in the filesystem, let’s be a little more modern and instead of reading the file, let’s just pull it into virtual memory with mmap(2). So it should be the VM paging code getting the data, and everyone knows that’s faster, right?

ElapsedUserSystemvs Rust
Rust4.964.450.501.0
ls/uniq/sort142.80149.661.2928.80
topfew 1.027.2028.310.545.49
“[ \\t]”25.0326.180.535.05
no regexp4.234.830.660.85
hash pointers4.165.100.650.84
mmap7.178.330.621.45

So by my math, that’s (Surprise #5) 72% slower. I am actually surprised, because if the paging code just calls through to the filesystem, it ought to be a pretty thin layer and not slow things down that much. I have three hypotheses: First of all, Go’s runtime maybe does some sort of super-intelligent buffer wrangling to make the important special case of sequential filesystem I/O run fast. Second, the Go mmap library I picked more or less at random is not that great. Third, the underlying MacOS mmap(2) implementation might be the choke point. More than one of these could be true.

A future research task is to spin up a muscular EC2 instance and run a few of these comparisons there to see how a more typical server-side Linux box would fare.

Parallel? · Finally, I thought about Dirkjan’s parallel I/O implementation. But I decided not to go there, for two reasons. First of all, I didn’t think it would actually be real-world useful, because it requires having a file to seek around in, and most times I’m running this kind of a pipe I’ve got a grep or something in front of it to pull out the lines I’m interested in. Second, the Rust program that does this is already acting I/O limited, while the Go program that doesn’t is getting there faster. So it seems unlikely there’s much upside.

But hey, Go makes concurrency easy, so I refactored slightly. First, I wrapped a mutex around the hash-table access. Then I put the code that extracts the key and calls the counter in a goroutine, throwing a high-fanout high-TPS concurrency problem at the Go runtime without much prior thought. Here’s what happened.

ElapsedUserSystemvs Rust
Rust4.964.450.501.0
ls/uniq/sort142.80149.661.2928.80
topfew 1.027.2028.310.545.49
“[ \\t]”25.0326.180.535.05
no regexp4.234.830.660.85
hash pointers4.165.100.650.84
goroutines!6.5357.847.681.32

(Note: I turned the mmap off before I turned the goroutines on.)

I’m going to say Surprise #6 at this point even though this was a dumbass brute-force move that I probably wouldn’t consider doing in a normal state of mind. But wow, look at that user time; my Mac claims to be an 8-core machine but I was getting more than 8 times as many CPU-seconds as clock seconds. Uh huh. Pity those poor little mutexes.

I guess a saner approach would be to call runtime.NumCPU() and limit the concurrency to that value, probably by using channels. But once again, given that this smells I/O-limited, the pain and overhead of introducing parallel processing seem unlikely to be cost-effective.

Or, maybe I’m just too lazy.

Lessons · First of all, Rust is not magically and universally faster than Go. Second, using regular expressions can lead to nasty surprises (but we already knew that, didn’t we? DIDN’T WE?!) Third, efficient file handling matters.

Thanks to Dirkjan! I’ll be interested to see if someone else finds a way to push this further forward. And I’ll report back any Linux testing results.



Contributions

Comment feed for ongoing:Comments feed

From: Pete Forman (Jul 02 2020, at 01:56)

The Cloudflare outage shows the damage a regular expression can wreak.

https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/

[link]

From: Thomas Jung (Jul 02 2020, at 05:03)

I ran the Rust implementation https://github.com/djc/topfew-rs on a Graviton 64 core ARM m6g.16xlarge EC2 instance and a 96 core Intel Xeon r5dn.24xlarge EC2 instance. The performance for both is comparable with roughly 12 GB/sec. This is with a 20 GB input file read from tempfs. For comparison that's roughly 5x faster than the fastest available SSD today. A 16 core Graviton machine would be more than capable of saturating an SSD.

[link]

From: Thomas Kupper (Jul 02 2020, at 05:37)

Hi Tim,

I'm a rather passive follower of your blog and this comment got not much to say, except that your SSL cert expired today, 2020-07-02 :).

Have a great day

Thomas

[link]

From: Dominic Mitchell (Jul 02 2020, at 09:35)

The amount of work the Go garbage collector does is proportional to the number of pointers in use. This means that the usual advice is to reduce the number. However, this normally applies in long-lived server daemons. For this use case, I suspect you're not seeing much GC, so it probably doesn't make much difference.

As to replacing regex for performance reasons: I discovered this the hard way. At work I run a small service that matches filenames against those in our version control system, using glob-like matches. For v0, we translated the globs to regexes, and it was slow. Translating the globs into prefix+suffix matching of string literals resulted in a couple of orders of magnitude speedup.

(while the Go regexp package may come in for criticism in this article, I note that it does have the pleasant property of running in bounded time: https://golang.org/pkg/regexp/; rsc@ has a nice article on the idea: https://swtch.com/~rsc/regexp/regexp1.html)

[link]

From: anon (Jul 02 2020, at 14:36)

If this is purely I/O bound, wouldn't that mean that the system column would take 4 seconds as well? I mean how does this compare to "time cat logfile >/dev/null"? If it's the same, then it's probably IO bound but I suspect it isn't.

I wouldn't say the computation is trivial. This solution does a lot of string copying, dynamic memory management, hash value calculations. I feel (but I'm not sure how) one could gain even more performance through a bunch of low level tricks.

[link]

From: ypsu (Jul 03 2020, at 02:26)

Ok, so I managed to rewrite this in C to see if there are still some possible wins here. In my little tests this is about 2x faster than your solution (and it still doesn't look IO bounded when the input is on tmpfs). But I don't really have a good testcase so I might be measuring wrong. And I used linux (it might not compile on mac at all). If you ever come across a linux machine, could you please benchmark my solution too on your 2 GB testfile? Thank you!

For brevity reasons I didn't implement cmdline arguments so just change the constants at the top. (And I apologize if it turns out to be buggy. C is hard. Let me know and I fix the bugs.)

Here's the solution: https://gist.github.com/ypsu/8e658a8721d3032dee44b9017681b0c6

[link]

From: Martin Probst (Jul 06 2020, at 02:09)

Re mmap, my (possibly flawed) understanding is that mmap'ing a file is useful if said file does not fit into memory.

mmap allows random access to the file's contents without incurring the memory use, which gives a nice programming abstraction. Additionally, the OS can skip one copy for readonly mmap'ed files. But at XX GB/s memory bandwidth that's likely invisible compared to your program's processing (it's probably the ~0.5s spent in system, compared to 10x the time spent in user space).

At the same time, accessing mmap'ed files causes page faults for the parts not yet mapped in, which will cause context switches and require MMU shenanigans, so there is a bunch of management overhead. Just reading an entire file into one large []byte is one more copy (but copying 2 GB at 30 GB/s is not that expensive), but might compare favourably to that.

[link]

author · Dad · software · colophon · rights
picture of the day
July 01, 2020
· Technology (85 fragments)
· · Software (71 more)

By .

The opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.