How Fast is Fast?

In preparation for my mod-atom presentation at ApacheCon next week, I’ve been running some performance tests with the help of a gaggle of client machines rustled up by some good people in the Sun engineering-lab group. The first numbers are trickling in, and I’m a bit at a loss both on what to measure and how to evaluate the results. Is 180 POSTs/second on a T2000 good?
[Update: Make that 320/second; have some better data presented in a table.]

What mod_atom Does · When you POST an Atom Entry, mod_atom parses it with Expat, finds a home for it in the filesystem, munges the XML slightly to make sure the dates and ID are right, persists it out with Genx, repeats the exercise to make an HTML entry, touches a couple of empty flag files, and sends the entry down the pipe to the client. It doesn’t regenerate the collection files or public-facing feeds, that’s what the flag files are for. When you GET a collection or feed, mod_atom checks to see if there’s a new flag file and regenerates it on demand. ¶

mod_atom does essentially no mutexing, with the single exception of the scenario where you’re doing a PUT on an existing resource and you have to lock down to compute the ETag and check the If-Match header. All the concurrency is pushed down into the filesystem.

mod_atom is a module that gets loaded into Apache 2.2.<some-recent-build>, no particular SPARC or Solaris optimizations aside from whatever ./configure gets me.

Initial Tests · I have some shell and Ruby scripts that use the Ape codebase to synthesize entries full of random selections from /usr/dict/words and shoot them at an alleged AtomPub server as fast as possible. Well, as fast as possible for Ruby code. ¶

The testing framework runs on N machines; on each, it fires off ten subprocesses, each of which creates a new publication and shoots a hundred entries into it; for a total of a thousand posts to ten different pubs. HTML-creation is enabled, so each POST requires the creation of three files; the .atom, the .html, and a book-keeping timestamp file.

The Atom Entries being posted average a little over 600 bytes in length.

Server · It’s the same box I set up to run the Wide Finder 2 work; an 8-core 32-thread T2000 with 32G of RAM and an unexotic disk setup. Remember, each individual thread is pretty slow, but there are lots and lots of ’em. ¶

Early Results · [Update: This whole section replaced after I did some re-engineering brought on by bug-fixing brought on by interoperating with MarsEdit.] ¶

In the following table, the seconds are elapsed wall-clock time.

Servers	Procs	POSTs	Seconds Elapsed	%Idle	POSTs / second
1	10	1000	~10.8	85	~90
2	20	2000	~12.5	72	~160
3	30	3000	~13.5	53	~220
4	40	4000	~14.2	46	~280
5	50	5000	~15.4	56	~320

This is interesting. It’s very pleasing that as you add parallel client loads, the server pretty well just soaks it up with massively sublinear slowdown. Given that I haven’t really profiled this thing yet nor have I maxed the server out, the eventual throughput on this class of box for this particular benchmark mode is probably way north of the 320 you see here.

However, it’s not quite as good as it looks. Judging by the output of vmstat and friends, in that final 5-clients run, the idle time was up because something in the system was beginning to thrash a little. Mind you, it didn’t seem to hurt the throughput.

Considerations · The nature of the load a ModAtom server might see is really horribly complex to describe. I can think of the following ways to characterize it, all more or less orthogonal to each other: ¶

The total arrival rate of the transactions.
The arrival rate on a per-publication basis; i.e. twenty transactions a second in total run against one publication, or a dozen, or a hundred.
The relative proportion of GET, POST, PUT, and DELETE requests.
The proportion of GETs that are for collections or feeds as opposed to entries, and the proportion of those that are parametrized to support feed paging.
The average size of the incoming Atom Entries.
The proportion of XHTML, HTML, and plain-text in incoming ATOM Entries.
The average size of media objects that are POSTed.
The relative proportion of Atom Entries and media objects that are posted.
The proportion of errors (malformed transactions, AtomPub protocol errors) in the input stream.
The set of configuration parameters applied to the underlying Apache server. (People who’ve been there are shaking their heads in sympathy at this point.)

What Next? · Well, I started with POSTs because my intuition about the code told me that POSTing an Atom Entry was the most complex among the code paths that you might reasonably expect to experience high request volume. I’m a little disappointed with 180 requests/second, but I’m also nonplused at the server spending nearly half its time in kernel mode. Clearly some profiling is called for, and I’d sort of expect to find some low-hanging fruit out there. ¶

I guess the next step is to start building out a matrix containing a reasonable range of values for each of the items in the list above, and characterizing the performance at as many sane points as I can.

Questions? · If anybody’s made it this far, I’d welcome input on which dimensions of the performance space would be most interesting to exercise. ¶

More broadly, I’m interested in what a “good” number would be. 180 new POSTs per second just doesn’t seem that great to me, especially considering that this is filesystem-backed and there’s no database getting in the way. Of course, it’s perfectly possible that my intuition about the relative performance of filesystems and databases is completely wrong. I’m starting to think of Drizzle and CouchDB... of course, no profiling yet.

Oh, and by the way, I am as a side-effect building a framework for performance testing of AtomPub server implementations.

Contributions

Comment feed for ongoing:

From: Shawn Ferry (Nov 03 2008, at 04:18)

Have you looked at the performance numbers using a ramdisk or tmpfs?

[link]

From: Brent Rockwood (Nov 03 2008, at 07:45)

I always wondered if it would be reasonable to not pre-generate the HTML. Given that most modern browsers are capable of client-side XSL transformation, and that a good portion of your GETs are going to be for the XML anyway. I would think that would save you a fair bit on the POST end, and maybe something on the GET end, especially if caches are involved.

[link]

From: Brent Rockwood (Nov 03 2008, at 10:03)

I'm also very curious to know how many threads are running at one time. I don't have any experience writing Apache modules but, in other frameworks, I've seen situations where the throttling is sub-optimal and you end up spending half your time thread-switching. It shows up as heavy on kernel-mode time.

Just a shot in the dark...

[link]

From: Erik Engbrecht (Nov 03 2008, at 17:12)

You might want to reduce the thread/process count a bit. My observation is that not all T2000 cycles are created equal, because there are only 8 real processing cores, but CPU metrics are counted as if there were 32. Fully utilizing the CPU is a fuzzy term that rathers depends on the type of load, and basically there is a point where increased utilization actually represents decreased work accomplished. Judging by you high system percentage, I think you are in that range.

I'd try setting the number of worker threads at, say, 8, 32, and 128, and then doing sort of a binary search for an optimum number.

[link]

From: Silvio Almeida (Nov 03 2008, at 19:34)

At college stadium, 18.000 stupid minds with pen and paper, all of them all day benching hard on insightful notes, every buddy is given 1.67 min for every POST. It doesn't seem that bad, then you got 5 million POSTs in the end of the day.

[link]

From: Lennon (Nov 06 2008, at 08:47)

180 POSTs per second actually isn't bad. The last time I benchmarked a new Postgres install on a reasonably capable x86 box (quad core, 6 GB RAM, hardware SATA RAID), I only saw about 200-250 transactions per second from pgbench.

It would be interesting to compare this to CouchDB. From my limited experiments, CouchDB doesn't seem terribly fast, but it does seem to do a good job of scaling across many cores and storage nodes.

[link]

From: David Magda (Nov 08 2008, at 09:54)

Have you used DTrace to profile what's going on at all?

Brendan Gregg's DTrace Tools has a lot of useful pre-written scripts to get you started:

http://www.brendangregg.com/dtrace.html

The dapptrace and dappprof scripts may be able to help tell you where the code is spending its time:

http://www.brendangregg.com/DTrace/dapptrace

http://www.brendangregg.com/DTrace/dappprof

[link]

ongoing

What this is ·

Truth · Biz · Tech

author · Dad
colophon · rights

November 02, 2008
· Technology (90 fragments)
· · Atom (91 more)
· · Web (397 more)

By Tim Bray.

The opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.

I’m on Mastodon!