By morgan in amino — Oct 7, 2024

writing an amino code-generator, #2: it talks!

it talks, it walks, and it shreds the benchmarks, being up to 25 times faster than its amino (reflect) counterpart.

as promised, here's the continuation of my tales of my development on tomino (here's the first, if you missed it). as always, i started off very optimistic on where i could get to in a few days; and indeed, the first couple of days of development have given me enough satisfaction to keep me going. but now, it's time to actually work towards implementing most of the features to get it to an mvp; and it's taking a small while, especially as bursts of development naturally have to alternate with personal life and the commitments of real work - when working on a weekend side-project.

where i'm at

just about a week ago, i wrote the first compatibility test: it aims to check what the results of amino and tomino are on the same struct, and see whether tomino spits out the same content as amino. there's been a bunch of fixes involved, obviously, but it's safe to say that tomino is correctly generating binary data from at least some structs.

i'm currently lacking two important data types to correctly encode:

packed repeated values: these are arrays of scalar values, like []int, which can be encoded more efficiently on the wire.
interfaces: these will be a bit tricky, especially as my idea for the implementation is to have this work entirely without reflection, and only with a handful of type assertions - ie., at compile-time, we know the full set of types that can be used in that interface value, and generate our code based on that.

that said, the rest of supported values are encoded (mostly?) correctly: all integers, floats, structs, pointers, []byte and [N]byte, and arrays/slices of unpacked values (like structs).

i'm writing the generated code to be efficient, and simple. simplicity is harder, especially as we're talking about generated code which is synonymous with "ugly". and i agree, even after gofumpt, it's not great. but i've taken some care in adding at least some "nice" debugging features, like pointing out the field numbers, and "decomposing" the tag if we can do it for free, as follows:

b = append(b, (1<<3)|2 /* 0x0a */, byte(len(msg.Scheme)))

go will calculate the expression at compile time, and so instead of just writing 0x0a, we can "expand it out" to (1 << 3) | 2, ie. field number 1 with record type 2 (there's a small, WIP reference on the readme on what this refers to).

one thing i will try to maintain as a requirement is to be entirely 0-dependency, counting the standard library as a dependency, too. if you're wondering why, a lot of packages in the go standard library actually turn out to be quite the blobs of code; like strconv, strings, time, or unicode. by making sure we don't import any of these, we can easily have this code ported over into tinygo, or even the gnovm itself. (note: there's an import on unsafe, but it's only used to get the bits of floating points - in gno this could be changed to math.Float64bits. and, unsafe is actually part of go's spec, so i don't fully consider it a dependency :P).

plus, it's a fun challenge. it can also help us with...

the benchmarks

i've fallen into the trap of re-inventing a wheel and then being obsessed about the benchmarks once. but what's incredible is that i'm enough of an idiot to have fallen into that trap again. what can i say. performance is addicting. nothing hacks my primate brain more than seeing the number of ns/op go down until it's barely even optimizable anymore. the fact that we're talking about such a narrowed-down problem, outside any real world scenario, makes it even better. it's like a programmer's candy crush. as i'm writing this, i just finished spending a saturday afternoon obsessing over an optimization on this unfinished project, just because i couldn't help but put off the thought of "i wonder how much i can improve it if i...". what can i say.

anyway! the benchmarks look good. here's a summary of the findings on the readme. in the best case scenario, the marshaler vastly outperform amino's, not having to deal with a lot of reflection, and using very dumb code that the compiler can easily optimise. the improvement factors vary depending on the input, and at the end of the day these benchmarks are not final because there will likely be steps involved into making this work for gno's current codebase; which is the real place where i can actually test this out against a real world scenario.

goos: linux
goarch: amd64
pkg: github.com/thehowl/tomino/tests
cpu: AMD Ryzen 7 7840U w/ Radeon  780M Graphics
BenchmarkMarshalers/tomino/bytes_1_000         	 7047403	       144.5 ns/op	    1088 B/op	       2 allocs/op
BenchmarkMarshalers/tomino/bytes_1_000_000     	   14998	     71262 ns/op	 1007680 B/op	       2 allocs/op
BenchmarkMarshalers/tomino/empty               	43339066	        27.72 ns/op	      64 B/op	       1 allocs/op
BenchmarkMarshalers/tomino/fixed               	42246316	        28.32 ns/op	      64 B/op	       1 allocs/op
BenchmarkMarshalers/tomino/ptr_-1337           	40327893	        29.70 ns/op	      64 B/op	       1 allocs/op
BenchmarkMarshalers/tomino/slice               	24488661	        43.95 ns/op	      64 B/op	       1 allocs/op
BenchmarkMarshalers/tomino/time_duration       	24913022	        41.78 ns/op	      64 B/op	       1 allocs/op
BenchmarkMarshalers/tomino_prealloc/bytes_1_000         	69462626	        17.07 ns/op	       0 B/op	       0 allocs/op
BenchmarkMarshalers/tomino_prealloc/bytes_1_000_000     	   15192	     71695 ns/op	 1007616 B/op	       1 allocs/op
BenchmarkMarshalers/tomino_prealloc/empty               	195221311	         6.091 ns/op	       0 B/op	       0 allocs/op
BenchmarkMarshalers/tomino_prealloc/fixed               	176251285	         6.773 ns/op	       0 B/op	       0 allocs/op
BenchmarkMarshalers/tomino_prealloc/ptr_-1337           	147158512	         8.137 ns/op	       0 B/op	       0 allocs/op
BenchmarkMarshalers/tomino_prealloc/slice               	47009396	        21.80 ns/op	       0 B/op	       0 allocs/op
BenchmarkMarshalers/tomino_prealloc/time_duration       	59409352	        20.14 ns/op	       0 B/op	       0 allocs/op
BenchmarkMarshalers/amino/bytes_1_000                   	 1319538	       891.0 ns/op	    2512 B/op	      16 allocs/op
BenchmarkMarshalers/amino/bytes_1_000_000               	    8067	    143878 ns/op	 2015696 B/op	      16 allocs/op
BenchmarkMarshalers/amino/empty                         	 2431149	       495.1 ns/op	     432 B/op	      12 allocs/op
BenchmarkMarshalers/amino/fixed                         	 2108006	       563.4 ns/op	     528 B/op	      15 allocs/op
BenchmarkMarshalers/amino/ptr_-1337                     	 2021954	       575.1 ns/op	     528 B/op	      15 allocs/op
BenchmarkMarshalers/amino/slice                         	 1005511	      1182 ns/op	    1168 B/op	      35 allocs/op
BenchmarkMarshalers/amino/time_duration                 	 1615058	       739.3 ns/op	     752 B/op	      21 allocs/op
PASS
ok  	github.com/thehowl/tomino/tests	38.245s

i've had to coerce my brain into not trying to overengineer this, as the code is so much in flux anyway. but seeing these numbers, and the comparison against the amino baseline, are literal heroin for my brain.

more code generation

one step that i'm considering that we'll need to do, in order to have this as a "drop-in" replacement for using amino, is to add a couple more "systems":

one that can automatically "ingest" types to be processed, based on the statically parsed initialization calls to amino.RegisterPackage.
- this is useful in general; as it allows us to use this in place of the current mechanism, where specific names of types are passed to tomgen (the tomino code generator) for generation. it works for now, but i want this to work even if we don't use it in gno for creating clients in other languages. this way we can do static analysis on gno's codebase and just generate client marshalers out of the box.
one that can create "translator methods" from the types registered with amino, to their counterparts as tomino messages.
- ie., tomgen always generates a new go type to encode the data, rather than using the existing ones. there are a variety of reasons for this; partly, it's so that you can use the resulting generated code without importing the entire tm2 codebase (as is the case for amino). but it's also to prove the concept that the IR (and thus, resulting code) is completely separate from the original source, and doesn't need to be intertwined.

the last one is a bit tricky, and one where i haven't figured out the implementation details. i was trying to tell you my current idea as i was writing this, but then i scrapped it and came up with something better, but i scrapped that idea once again. i can talk about it another time.

the job of the translator will mostly be to convert types, in what should mostly be "cheap" go conversions; but also add some validation that exists in amino and which i actually didn't add in the tomino generated code.

for instance, amino has special handling for time.Duration and time.Time. but in tomino, the IR generator just converts them to structs; ie. how they're eventually encoded over the wire. this simplifies writing un/marshalers quite a bit. but so, we're missing from the current amino code the validation for these fields, for instance. and, we cannot directly use the types - so that's where the translator will come in. it should allow to completely "close the circle" - and allow for an easy transition if we wanted to do it in the gno codebase.

another big, important part will be to add a json marshaler and unmarshaler, in order to work with the so-called "amino json". nothing that wasn't done before. this'll come after binary parsing though.

where i'm at

the benchmarks

more code generation

meta

Subscribe to diary of a gnome