Skip to content

Known Good Testing

Bryan Helmkamp

By: Bryan Helmkamp
February 20, 2014

Adult holding laptop in server room

Automatically Validating Millions of Data Points

At Code Climate, we feel it’s critical to deliver dependable and accurate static analysis results. To do so, we employ a variety of quality assurance techniques, including unit tests, acceptance tests, manual testing and incremental rollouts. They all are valuable, but we still had too much risk of introducing hard-to-detect bugs. To fill the gap, we’ve added a new tactic to our arsenal: known good testing.

"Known good" testing refers to capturing the result of a process, and then comparing future runs against the saved or known good version to discover unexpected changes. For us, that means running full Code Climate analyses of a number of open source repos, and validating every aspect of the result. We only started doing this last week, but it’s already caught some hard-to-detect bugs that we otherwise may not have discovered until code hit production.

Why known good testing?

Known good testing is common when working with legacy code. Rather than trying to specify all of the logical paths through an untested module, you can feed it a varied set of inputs and turn the outputs into automatically verifying tests. There’s no guarantee the outputs are correct in this case, but at least you can be sure they don’t change (which, in some systems is even more important).

For us, given that we have a relatively reliable and comprehensive set of unit tests for our analysis code, the situation is a bit different. In short, we find known good testing valuable because of three key factors:

  • The inputs and output from our analysis is extremely detailed. There are a huge number of syntactical structures in Ruby, and we derive a ton of information from them.
  • Our analysis depends on external code that we do not control, but do want to update from time-to-time (e.g. RubyParser)
  • We are extremely sensitive to any changes in results. For example, even a tiny variance in our detection of complex methods across the 20k repositories we analyze would ripple into changes of class ratings, resulting in incorrect metrics being delivered to our customers.

These add up to mean that traditional unit and acceptance testing is necessary but not sufficient. We use unit and acceptance tests to provide faster results and more localized detection of regressions, but we use our known good suite (nicknamed Krylon) to sanity check our results against a dozen or so repositories before deploying changes.

How to implement known good testing

The high level plan is pretty straightforward:

  1. Choose (or randomly generate, using a known seed) a set of inputs for your module or program.
  2. Run the inputs through a known-good version of the system, persisting the output.
  3. When testing a change, run the same inputs through the new version of the system and flag any output variation.
  4. For each variation, have a human determine whether or not the change is expected and desirable. If it is, update the persisted known good records.

The devil is in the details, of course. In particular, if the outputs of your system are non-trivial (in our case a set of MongoDB documents spanning multiple tables), persisting them was a little tricky. We could keep them in MongoDB, of course, but that would not make them as accessible to humans (and tools like diff and GitHub) as a plain-test format like JSON would. So I wrote a little bit of code to dump records out as JSON:

dir = "krylon/#{slug}"
repo_id = Repo.create!(url: "git://github.com/#{slug}")
run_analysis(repo_id)
FileUtils.mkdir_p(dir)

%w[smells constants etc.].each do |coll|
  File.open("#{dir}/#{coll}.json", "w") do |f|
    docs = db[coll].find(repo_id: repo_id).map do |doc|
      round_floats(doc.except(*ignored_fields))
    end

    sorted_docs = JSON.parse(docs.sort_by(&:to_json).to_json)
    f.puts JSON.pretty_generate(sorted_docs)
  end
end

Then there is the matter of comparing the results of a test run against the known good version. Ruby has a lot of built-in functionality that makes this relatively easy, but it took a few tries to get a harness set up properly. We ended up with something like this:

dir = "krylon/#{slug}"
repo_id = Repo.create!(url: "git://github.com/#{slug}")
run_analysis(repo_id)

%w[smells constants etc.].each do |coll|
  actual_docs = db[coll].find(repo_id: repo_id).to_a
  expected_docs = JSON.parse(File.read("#{dir}/#{coll}.json"))

  actual_docs.each do |actual|
    actual = JSON.parse(actual.to_json).except(*ignored_fields)

    if (index = expected_docs.index(actual))
      # Delete the match so it can only match one time
      expected_docs.delete_at(index)
    else
      puts "Unable to find match:"
      puts JSON.pretty_generate(JSON.parse(actual.to_json))
      puts
      puts "Expected:"
      puts JSON.pretty_generate(JSON.parse(expected_docs.to_json))
      raise
    end
  end

  if expected_docs.empty?
    puts "    PASS #{coll} (#{actual_docs.count} docs)"
  else
    puts "Expected not empty after search. Remaining:"
    puts JSON.pretty_generate(JSON.parse(expected_docs.to_json))
    raise
  end
end

All of this is invoked by a couple Rake tasks:

Our CI system runs the rake krylon:validate task. If it fails, someone on the Code Climate team reviews the results, and either fixes an issue or uses rake krylon:save to update the known good version.

Gotchas

In building Krylon, we ran into a few issues. They were all pretty simple to fix, but I’ll list them here to hopefully save someone some time:

  • Floats – Floating point numbers can not be reliably compared using the equality operator. We took the approach of rounding them to two decimal places, and that has been working so far.
  • Timestamps – Columns like created_at, updated_at will vary every time your code runs. We just exclude them.
  • Record IDs – Same as above.
  • Non-deterministic ordering of hash keys and arrays – This took a bit more time to track down. Sometimes Code Climate would generate hashes or arrays, but the order of those data structures was undefined and variable. We had two choices: update the Krylon validation code to allow this, or make them deterministic. We went with updating the production code to be deterministic with respect to order because it was simple.

Wrapping up

Known good testing is not a substitute for unit tests and acceptance tests. However, it can be a valuable tool in your toolbox for dealing with legacy systems, as well as certain specialized cases. It’s a fancy name, but implementing a basic system took less than a day and began yielding benefits right away. Like us, you can start with something simple and rough, and iterate it down the road.