writing an amino code-generator, #2: it talks!
it talks, it walks, and it shreds the benchmarks, being up to 25 times faster than its amino (reflect) counterpart.
as promised, here's the continuation of my tales of my development on tomino (here's the first, if you missed it). as always, i started off very optimistic on where i could get to in a few days; and indeed, the first couple of days of development have given me enough satisfaction to keep me going. but now, it's time to actually work towards implementing most of the features to get it to an mvp; and it's taking a small while, especially as bursts of development naturally have to alternate with personal life and the commitments of real work - when working on a weekend side-project.
where i'm at
just about a week ago, i wrote the first compatibility test: it aims to check what the results of amino and tomino are on the same struct, and see whether tomino spits out the same content as amino. there's been a bunch of fixes involved, obviously, but it's safe to say that tomino is correctly generating binary data from at least some structs.
i'm currently lacking two important data types to correctly encode:
- packed repeated values: these are arrays of scalar values, like
[]int
, which can be encoded more efficiently on the wire. - interfaces: these will be a bit tricky, especially as my idea for the implementation is to have this work entirely without reflection, and only with a handful of type assertions - ie., at compile-time, we know the full set of types that can be used in that interface value, and generate our code based on that.
that said, the rest of supported values are encoded (mostly?) correctly: all integers, floats, structs, pointers, []byte
and [N]byte
, and arrays/slices of unpacked values (like structs).
i'm writing the generated code to be efficient, and simple. simplicity is harder, especially as we're talking about generated code which is synonymous with "ugly". and i agree, even after gofumpt, it's not great. but i've taken some care in adding at least some "nice" debugging features, like pointing out the field numbers, and "decomposing" the tag if we can do it for free, as follows:
b = append(b, (1<<3)|2 /* 0x0a */, byte(len(msg.Scheme)))
go will calculate the expression at compile time, and so instead of just writing 0x0a, we can "expand it out" to (1 << 3) | 2
, ie. field number 1
with record type 2
(there's a small, WIP reference on the readme on what this refers to).
one thing i will try to maintain as a requirement is to be entirely 0-dependency, counting the standard library as a dependency, too. if you're wondering why, a lot of packages in the go standard library actually turn out to be quite the blobs of code; like strconv
, strings
, time
, or unicode
. by making sure we don't import any of these, we can easily have this code ported over into tinygo, or even the gnovm itself. (note: there's an import on unsafe
, but it's only used to get the bits of floating points - in gno this could be changed to math.Float64bits
. and, unsafe
is actually part of go's spec, so i don't fully consider it a dependency :P).
plus, it's a fun challenge. it can also help us with...
the benchmarks
i've fallen into the trap of re-inventing a wheel and then being obsessed about the benchmarks once. but what's incredible is that i'm enough of an idiot to have fallen into that trap again. what can i say. performance is addicting. nothing hacks my primate brain more than seeing the number of ns/op
go down until it's barely even optimizable anymore. the fact that we're talking about such a narrowed-down problem, outside any real world scenario, makes it even better. it's like a programmer's candy crush. as i'm writing this, i just finished spending a saturday afternoon obsessing over an optimization on this unfinished project, just because i couldn't help but put off the thought of "i wonder how much i can improve it if i...". what can i say.
anyway! the benchmarks look good. here's a summary of the findings on the readme. in the best case scenario, the marshaler vastly outperform amino's, not having to deal with a lot of reflection, and using very dumb code that the compiler can easily optimise. the improvement factors vary depending on the input, and at the end of the day these benchmarks are not final because there will likely be steps involved into making this work for gno's current codebase; which is the real place where i can actually test this out against a real world scenario.
goos: linux
goarch: amd64
pkg: github.com/thehowl/tomino/tests
cpu: AMD Ryzen 7 7840U w/ Radeon 780M Graphics
BenchmarkMarshalers/tomino/bytes_1_000 7047403 144.5 ns/op 1088 B/op 2 allocs/op
BenchmarkMarshalers/tomino/bytes_1_000_000 14998 71262 ns/op 1007680 B/op 2 allocs/op
BenchmarkMarshalers/tomino/empty 43339066 27.72 ns/op 64 B/op 1 allocs/op
BenchmarkMarshalers/tomino/fixed 42246316 28.32 ns/op 64 B/op 1 allocs/op
BenchmarkMarshalers/tomino/ptr_-1337 40327893 29.70 ns/op 64 B/op 1 allocs/op
BenchmarkMarshalers/tomino/slice 24488661 43.95 ns/op 64 B/op 1 allocs/op
BenchmarkMarshalers/tomino/time_duration 24913022 41.78 ns/op 64 B/op 1 allocs/op
BenchmarkMarshalers/tomino_prealloc/bytes_1_000 69462626 17.07 ns/op 0 B/op 0 allocs/op
BenchmarkMarshalers/tomino_prealloc/bytes_1_000_000 15192 71695 ns/op 1007616 B/op 1 allocs/op
BenchmarkMarshalers/tomino_prealloc/empty 195221311 6.091 ns/op 0 B/op 0 allocs/op
BenchmarkMarshalers/tomino_prealloc/fixed 176251285 6.773 ns/op 0 B/op 0 allocs/op
BenchmarkMarshalers/tomino_prealloc/ptr_-1337 147158512 8.137 ns/op 0 B/op 0 allocs/op
BenchmarkMarshalers/tomino_prealloc/slice 47009396 21.80 ns/op 0 B/op 0 allocs/op
BenchmarkMarshalers/tomino_prealloc/time_duration 59409352 20.14 ns/op 0 B/op 0 allocs/op
BenchmarkMarshalers/amino/bytes_1_000 1319538 891.0 ns/op 2512 B/op 16 allocs/op
BenchmarkMarshalers/amino/bytes_1_000_000 8067 143878 ns/op 2015696 B/op 16 allocs/op
BenchmarkMarshalers/amino/empty 2431149 495.1 ns/op 432 B/op 12 allocs/op
BenchmarkMarshalers/amino/fixed 2108006 563.4 ns/op 528 B/op 15 allocs/op
BenchmarkMarshalers/amino/ptr_-1337 2021954 575.1 ns/op 528 B/op 15 allocs/op
BenchmarkMarshalers/amino/slice 1005511 1182 ns/op 1168 B/op 35 allocs/op
BenchmarkMarshalers/amino/time_duration 1615058 739.3 ns/op 752 B/op 21 allocs/op
PASS
ok github.com/thehowl/tomino/tests 38.245s
i've had to coerce my brain into not trying to overengineer this, as the code is so much in flux anyway. but seeing these numbers, and the comparison against the amino baseline, are literal heroin for my brain.
more code generation
one step that i'm considering that we'll need to do, in order to have this as a "drop-in" replacement for using amino, is to add a couple more "systems":
- one that can automatically "ingest" types to be processed, based on the statically parsed initialization calls to
amino.RegisterPackage
.- this is useful in general; as it allows us to use this in place of the current mechanism, where specific names of types are passed to
tomgen
(the tomino code generator) for generation. it works for now, but i want this to work even if we don't use it in gno for creating clients in other languages. this way we can do static analysis on gno's codebase and just generate client marshalers out of the box.
- this is useful in general; as it allows us to use this in place of the current mechanism, where specific names of types are passed to
- one that can create "translator methods" from the types registered with amino, to their counterparts as tomino messages.
- ie.,
tomgen
always generates a new gotype
to encode the data, rather than using the existing ones. there are a variety of reasons for this; partly, it's so that you can use the resulting generated code without importing the entire tm2 codebase (as is the case for amino). but it's also to prove the concept that the IR (and thus, resulting code) is completely separate from the original source, and doesn't need to be intertwined.
- ie.,
the last one is a bit tricky, and one where i haven't figured out the implementation details. i was trying to tell you my current idea as i was writing this, but then i scrapped it and came up with something better, but i scrapped that idea once again. i can talk about it another time.
the job of the translator will mostly be to convert types, in what should mostly be "cheap" go conversions; but also add some validation that exists in amino and which i actually didn't add in the tomino generated code.
for instance, amino has special handling for time.Duration
and time.Time
. but in tomino, the IR generator just converts them to structs; ie. how they're eventually encoded over the wire. this simplifies writing un/marshalers quite a bit. but so, we're missing from the current amino code the validation for these fields, for instance. and, we cannot directly use the types - so that's where the translator will come in. it should allow to completely "close the circle" - and allow for an easy transition if we wanted to do it in the gno codebase.
another big, important part will be to add a json marshaler and unmarshaler, in order to work with the so-called "amino json". nothing that wasn't done before. this'll come after binary parsing though.
meta
meta side note: you may have noticed i've been a bit slower with the blog posts here as well. it's been a busy week, and i didn't seem to get a full two straight hours to do gno software development; let alone to write a blog post, which is not that high on the list of priorities.
i'll be experimenting to see where i can strike a balance in how many posts i can write per week, so i can schedule them up-front. my life is a perpetual pendulum between saying "fuck it, we ball" and publishing as a post three paragraphs i've written over the course of 30 minutes, and then spending a good, scattered 3 hours writing up this post. i'm trying to be less of a perfectionist, but it's hard. look, i've just written 1,763 words. that's not what i meant with an "informal bite". it is what it is. all i wanted to say is, hopefully see you soon, and at one point i'll hopefully have struck the right balance in the frequency of these posts that work for me!