This ongoing fragment describes how to match and compare numbers using a finite automaton, which involves transforming them into strings with the right lexical properties. My hope is that there are at least twelve people in the world who are interested in the intersection of numeric representation and finite automata.

Background · (Feel free to skip this part if you already know about Quamina.)

This is yet another entry in the Quamina Diary series of blog posts. Quamina is a Go-language library that allows you to compile a bunch of “Patterns” together and, when presented with “events”, i.e. JSON data blobs, informs you which (if any) of the Patterns match each event, at a speed which is high (often millions/second) and only weakly related to the number of Patterns in any Quamina instance.

Quamina was inspired by AWS Event Ruler (“Ruler” for short), a package I helped develop while at AWS that has since been open-sourced. (Thanks, AWS!) By “based on” I mean “does a subset of the same things compatibly, with a design that is quite different, in interesting ways”. Quamina is also a fruitful source of software geekery for me to write about here, which I enjoy.

The problem · Suppose you want to match records for boxes whose height is 20cm. A sample of such a record, with most fields removed:

{
  "box": {
    "dimensions": {
      "width": 100,
      "height": 20,

(Much omitted.)

A Quamina Pattern designed to match those records would look like this:

{
  "box": {
    "dimensions": { 
      "height": [ 20 ]
    }
  } 
}

All good so far. But what if, due to some upstream computer program or inventory admin, a message showed up like so?

{
  "box": {
    "dimensions": {
      "width": 100.0,
      "height": 20.0,

Up until my last PR landed, Quamina didn’t know that “20” and “20.0” and “2.0e1” were the same quantity; it knew how to compare strings to other strings and that was all. Which was unsatisfactory. And a problem which had been solved years ago (partly by me) in Ruler.

Question · Pause a moment and ask yourself: How would you write a finite automaton which would Do The Right Thing with numbers? I’m not going to claim that the way Ruler and Quamina do it is optimal, but it’s worked pretty well for years and processed trillions of events with no breakages I know of.

Our answer: normalize the numbers into fixed-sized strings whose lexical ordering is that of the numbers they represent. Code first:

func qNumFromFloat(f float64) (qNumber, error) {
	if f < -FiveBillion || f > FiveBillion {
		return nil, errors.New("value must be between -5e9 and +5e9 inclusive")
	}
	value := uint64(TenE6 * (FiveBillion + f))
	return toHexStringSkippingFirstByte(value), nil
}
    

Constraints · Quamina requires, for numeric matching to work properly, that:

  1. The numbers be between -/+5×109, inclusive.

  2. The numbers have no more than five digits to the right of the decimal point.

You’ll notice that the code above enforces the first condition but not the second. We’ll get to that.

Effects · So, what that code is doing is:

  1. Adding 5.0e9 to the number so it’s in the range 0 … 10.0e9.

  2. Multiplying by 106 to push the five-digit fractional part to the left of the decimal point, preserving its sixth digit (if any) so the rounding in the next step works.

  3. Converting it from a float64 into a uint64.

  4. Turning that into a big-endian 14-byte hex string.

So any number that meets the constraints above is represented as 14 hex digits whose lexical order is consistent with the underlying numbers. “20”, “20.0” and “2.0e1” are all “11C3793911AD00”. Which means that this Pattern will do what reasonable people expect:

{"box": { "dimensions": { "height": [ 20 ] } } }

More formally · There are 1015 numbers that meet the constraints described above. This process maps them into hex strings. The first three and their mappings are:

-5,000,000,000, -4,999,999,999.99999, -4,999,999,999.99998
00000000000000,       00000000000009,       00000000000014

And the last three:

4,999,999,999.99998, 4,999,999,999.99999,  5,000,000,000
     2386F26FC0FFEC,      2386F26FC0FFF6, 2386F26FC10000

Less formally · This includes “most” numbers that are used in practice, including prices, occurrence counts, size measurements, and so on.

Examples of numbers that do not meet these criteria include AWS account numbers, some telephone numbers, and cryptographic keys/signatures. For those, Quamina just preserves the digits, whatever they may be, and in fact, this also usually ends up doing what people expect.

Could we do better? · I think so. To start with, hex digits are an inefficient way to represent bits; there are many other options.

The current hex approach hasn’t changed since a very early version of Ruler because it’s never been a pain point.

Speaking of Ruler, they recently landed a PR that lets them have 6 fractional digits as opposed to Quamina’s 5, simply by using decimal rather than binary arithmetic. It’s fast, too! This was made easier by the fact that Java has BigDecimal built in, while Go doesn’t. There are good open-source options out there, but I am extremely reluctant to accept dependencies in a package as low-level as Quamina. I don’t think matching more one more fractional digit justifies the cost of a dependency.

“Q numbers?” · In Ruler, the data type is ComparableNumber. In Quamina I toyed with comparableNumber and canonicalNumber but neither is quite right and both produced excessively long and ugly variable and function names. So I decided to call them “Q numbers”, where Q is for Quamina. The code became noticeably more readable.

While the set of Rationals is called “Q”, the only other significant use of the phrase “Q number” is some weird old thing out of Texas Instruments.

Practicalities · The code to enforce the constraints and do the conversion isn’t that cheap. When I first naively dropped it in, I saw a nasty performance regression. Code optimization helped, but I realized then that it’s really important not to convert an incoming field that happens to be a JSON number into a Q number unless you know, first, that it meets the constraints and second, that the finite automaton we’re trying to match has a Pattern with a numerical match that also met the Q number constraints.

The one moderately clever stroke here relies on the fact that Quamina has its own JSON parser, because Reasons. The parser obviously has to step its way through numbers, and it’s easy enough there to notice where the syntax characters like “.” and “e” are and cheaply figure out if the decimal is the right size.

Conclusion · Quamina now knows numbers. It’s a little slower to match a 14-digit Q than a string like “20”, but finite automata are fast and anyhow, being right matters.



Contributions

Comment feed for ongoing:Comments feed

From: Arne Hormann (Jul 11 2024, at 02:08)

You are converting floating point to fixed point stored in a uint64.

But if you *start* with float64, you are potentially losing data by adding before multiplying and by using decimal.

Another way:

First prepare the conversion to fixed point, but do so in binary: multiply by (1<<20); this still provides 6 digits of decimal accuracy after the separator (1024 * 1024), but it will happen using the exponent and will not yet truncate the value, you still have 52 unmodified bits of the original value. No conversion, no truncation.

The original float64 uses 1 bit for the sign and 11 bits for the exponent. That's why you can skip the first byte, right? It provides no value in the fixed point format. I think you can even drop the first nibble - but that won't help much and makes decoding weirder confusing.

Anyway, next comes the new bias. Which is (1<<55) if you intend to stay at 14 byte hex or (1<<51) if you want to go down to 13 byte hex. Add it to your float. Then convert it to uint64, then convert that to hex.

I didn't try it and am not super sure about the value for the bias, but this should work. If you do this for your 14 byte hex approach, you'll arrive at a 56 bit fixed point digit, 36 bits before the separator followed by 20 bits after. Meaning a range of -(1<<35)..(1<<35). (1<<35) is about 35 billion.

[link]

From: Chris (Jul 11 2024, at 20:23)

It might be considered amusing to use the latin word "qua" as your prefix for quaNumber which is indeed a hex string variable acting as a number.

[link]

From: Arne Hormann (Jul 11 2024, at 21:33)

My last comment is still okay, but I found a vastly better solution - directly convert the float64 with minor required changes and the least amount of detours to a comparable string: https://go.dev/play/p/gCzCrd3X0w1

[link]

From: Tim Bray (Jul 12 2024, at 08:20)

Hey @ArneHormann, feel like doing a PR?

[link]

author · Dad
colophon · rights
picture of the day
July 09, 2024
· Technology (90 fragments)
· · Quamina Diary (9 more)

By .

The opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.

I’m on Mastodon!