By morgan in test-smart — Sep 27, 2024

test smart: coverage is a measure, not a target

code coverage stats are pointless if the underlying tests make no attempt at being intelligent. the point is preventing bugs and creating maintainability, not adding inconvenience.

do you know about goodhart's law? it is generally summarized as follows:

when a measure becomes a target, it ceases to be a good measure

we see this everywhere throughout our world, especially as a source of inefficiency in large organizations, like corporations and governments. especially when measures (now targets) get tied to monetary incentives.

say, for instance, that you wanted to measure the effectiveness of a university measuring the % of students who eventually graduate. leaving aside the obvious flaws with this approach (say, that the same group of 100 students will receive vastly different grades, if they were to take similar exams in different universities), you could attempt to use this as a measure to compare a university against one another.

if you were to publish this ranking, then that would be problematic because the universities would change their behaviour as a direct consequence of being "observed" and ranked. but even worse if you tie the % of graduating students to public funding. what's the worst that could happen?

that no student ever fails an exam or fails to graduate (except the extreme cases), because that would look bad on the stats and give the university less funding. the university's incentive is no longer teaching and creating good {professionals,scientists,researchers,...} as possible; but rather just spitting out as many diplomas, as fast as possible. incentives matter.

this is, in my opinion, a problem that closely correlates with how we test our code. it needs to be done in a way that is smart and tries to understand why we need to test our code; rather than just blindly covering more test cases because number go up and that looks good on a github readme.

code coverage is a measure that shows in general how many branches of our code get touched. it is a good measure in principle: if the tests are intelligent to begin with, it is a good way to make sure that every possible scenario in our code is tested at least once. but good coverage is not the goal; good tests which prevent bugs and make our codebases more maintainable are.

i originally wanted to make a post showing a variety of examples, but that's likely to take some time and i don't have all the examples at my fingertips. so instead, i'll start "test smart" as a series and gradually show you some other examples as they come up. the anti-example of today:

exact re-implementations

this is a test that was proposed for testing the String() method:

func TestKindString(t *testing.T) {
	t.Parallel()

	testTable := []struct {
		desc     string
		in       Kind
		expected string
	}{
		{"InvalidKind", 0, "InvalidKind"},
		{"BoolKind", 1, "BoolKind"},
		{"StringKind", 2, "StringKind"},
		{"IntKind", 3, "IntKind"},
		{"Int8Kind", 4, "Int8ind"},
		{"Int16Kind", 5, "Int16Kind"},
		{"Int32Kind", 6, "Int32Kind"},
		{"Int64Kind", 7, "Int64Kind"},
		{"UintKind", 8, "UintKind"},
		{"Uint8Kind", 9, "Uint8Kind"},
		{"Uint16Kind", 10, "Uint16Kind"},
		{"Uint32Kind", 11, "Uint32Kind"},
		{"Uint64Kind", 12, "Uint64Kind"},
		{"Float32Kind", 13, "Float32Kind"},
		{"Float64Kind", 14, "Float64Kind"},
		{"BigintKind", 15, "BigintKind"},
		{"BigdecKind", 16, "BigdecKind"},
		{"ArrayKind", 17, "ArrayKind"},
		{"SliceKind", 18, "SliceKind"},
		{"PointerKind", 19, "PointerKind"},
		{"StructKind", 20, "StructKind"},
		{"PackageKind", 21, "PackageKind"},
		{"InterfaceKind", 22, "InterfaceKind"},
		{"ChanKind", 23, "ChanKind"},
		{"FuncKind", 24, "FuncKind"},
		{"MapKind", 25, "MapKind"},
		{"TypeKind", 26, "TypeKind"},
		{"BlockKind", 27, "BlockKind"},
		{"TupleKind", 28, "TupleKind"},
		{"RefTypeKind", 29, "RefTypeKind"},
		{"Kind(30)", 30, "Kind(30)"},
		{"Kind(31)", 31, "Kind(31)"},
		{"Kind(32)", 32, "Kind(32)"},
	}

	for _, tt := range testTable {
		tt := tt
		t.Run(tt.desc, func(t *testing.T) {
			t.Parallel()

			actual := tt.in.String()
			if actual != tt.expected {
				assert.Equal(t, tt.expected, actual)
			}
		})
	}
}

this test is terrible, as it makes life considerably more miserable for anyone who wants to modify Kind, the underlying type being tested in its String() method. aside from generating code from the stringer utility, you also have to update this test, which is NOT code generated.

it also adds no significant information or help in finding bad changes or try to prevent common use cases.

a good test here, is generally one which tries to verify that the generated content is still up-to-date. as an example, we have this in our build_template CI workflow:

- name: Check generated files are up to date
  working-directory: ${{ inputs.modulepath }}
  run: |
    go generate -x ./...
    if [ "$(git status -s)" != "" ]; then
      echo "command 'go generate' creates file that differ from git tree, please run 'go generate' and commit:"
      git status -s
      exit 1
    fi

the responsibility to test the code, here, lays on the stringer (or the code generator more generally). here we only need to check that the author of the changes in a PR re-generated their code if necessary.

note: there can be good tests which check if the stringer result has changed. an example is if the string is used somewhere else for parsing which should also be updated, for instance. or the result risks creating a backward incompatibility. in that case, this is an acceptable test:

func TestKindStringHash(t *testing.T) {
	const want = "aac0e7c422619b641e647f1d1956e2394946388d1497d08dbb35854dbabdf9a1"

	hash := sha256.Sum256([]byte(fmt.Sprintf("%v%s", _Kind_index, _Kind_name)))
	res := fmt.Sprintf("%x", hash)

	assert.Equal(t,
		want, res,
		"Kind.String() changed. @/my/parser/parser.go should be changed accordingly, so that it can parse correctly. "+
			"When done, update TestKindStringHash with this hash: %v", res)
}

though, of course, the best test is the one that tries various results from the Stringer to the ones of the parser.

exact re-implementations

Subscribe to diary of a gnome