Gold Master Testing


Automatically Validating Millions of Data Points


Eyelids closed, gold sun shines on The world’s coated in the gold Krylon. –Macklemore and Ryan Lewis, featuring Eighty4 Fly

At Code Climate, we feel it’s critical to deliver dependable and accurate static analysis results. To do so, we employ a variety of quality assurance techniques, including unit tests, acceptance tests, manual testing and incremental rollouts. They all are valuable, but we still had too much risk of introducing hard-to-detect bugs. To fill the gap, we’ve added a new tactic to our arsenal: gold master testing.

Gold master testing refers to capturing the result of a process, and then comparing future runs against the saved “gold master” (or known good) version to discover unexpected changes. For us, that means running full Code Climate analyses of a number of open source repos, and validating every aspect of the result. We only started doing this last week, but it’s already caught some hard-to-detect bugs that we otherwise may not have discovered until code hit production.

Why gold master testing?

Gold master testing is common when working with legacy code. Rather than trying to specify all of the logical paths through an untested module, you can feed it a varied set of inputs and turn the outputs into automatically verifying tests. There’s no guarantee the outputs are correct in this case, but at least you can be sure they don’t change (which, in some systems is even more important).

For us, given that we have a relatively reliable and comprehensive set of unit tests for our analysis code, the situation is a bit different. In short, we find gold master testing valuable because of three key factors:

  • The inputs and output from our analysis is extremely detailed. There are a huge number of syntactical structures in Ruby, and we derive a ton of information from them.
  • Our analysis depends on external code that we do not control, but do want to update from time-to-time (e.g. RubyParser)
  • We are extremely sensitive to any changes in results. For example, even a tiny variance in our detection of complex methods across the 20k repositories we analyze would ripple into changes of class ratings, resulting in incorrect metrics being delivered to our customers.

These add up to mean that traditional unit and acceptance testing is necessary but not sufficient. We use unit and acceptance tests to provide faster results and more localized detection of regressions, but we use our gold master suite (nicknamed Krylon) to sanity check our results against a dozen or so repositories before deploying changes.

How to implement gold master testing

The high level plan is pretty straightforward:

  1. Choose (or randomly generate, using a known seed) a set of inputs for your module or program.
  2. Run the inputs through a known-good version of the system, persisting the output.
  3. When testing a change, run the same inputs through the new version of the system and flag any output variation.
  4. For each variation, have a human determine whether or not the change is expected and desirable. If it is, update the persisted gold master records.

The devil is in the details, of course. In particular, if the outputs of your system are non-trivial (in our case a set of MongoDB documents spanning multiple tables), persisting them was a little tricky. We could keep them in MongoDB, of course, but that would not make them as accessible to humans (and tools like diff and GitHub) as a plain-test format like JSON would. So I wrote a little bit of code to dump records out as JSON:

dir = "krylon/#{slug}"
repo_id = Repo.create!(url: "git://{slug}")

%w[smells constants etc.].each do |coll|"#{dir}/#{coll}.json", "w") do |f|
    docs = db[coll].find(repo_id: repo_id).map do |doc|

    sorted_docs = JSON.parse(docs.sort_by(&:to_json).to_json)
    f.puts JSON.pretty_generate(sorted_docs)

Then there is the matter of comparing the results of a test run against the gold master. Ruby has a lot of built-in functionality that makes this relatively easy, but it took a few tries to get a harness set up properly. We ended up with something like this:

dir = "krylon/#{slug}"
repo_id = Repo.create!(url: "git://{slug}")

%w[smells constants etc.].each do |coll|
  actual_docs = db[coll].find(repo_id: repo_id).to_a
  expected_docs = JSON.parse("#{dir}/#{coll}.json"))

  actual_docs.each do |actual|
    actual = JSON.parse(actual.to_json).except(*ignored_fields)

    if (index = expected_docs.index(actual))
      # Delete the match so it can only match one time
      puts "Unable to find match:"
      puts JSON.pretty_generate(JSON.parse(actual.to_json))
      puts "Expected:"
      puts JSON.pretty_generate(JSON.parse(expected_docs.to_json))

  if expected_docs.empty?
    puts "    PASS #{coll} (#{actual_docs.count} docs)"
    puts "Expected not empty after search. Remaining:"
    puts JSON.pretty_generate(JSON.parse(expected_docs.to_json))

All of this is invoked by a couple Rake tasks:

rake krylon:save[brynary/rack-test] # Save the results to disk
rake krylon:validate[brynary/rack-test] # Validate against the gold master

Our CI system runs the rake krylon:validate task. If it fails, someone on the Code Climate team reviews the results, and either fixes an issue or uses rake krylon:save to update the gold master.


In building Krylon, we ran into a few issues. They were all pretty simple to fix, but I’ll list them here to hopefully save someone some time:

  • Floats – Floating point numbers can not be reliably compared using the equality operator. We took the approach of rounding them to two decimal places, and that has been working so far.
  • Timestamps – Columns like created_atupdated_at will vary every time your code runs. We just exclude them.
  • Record IDs – Same as above.
  • Non-deterministic ordering of hash keys and arrays – This took a bit more time to track down. Sometimes Code Climate would generate hashes or arrays, but the order of those data structures was undefined and variable. We had two choices: update the Krylon validation code to allow this, or make them deterministic. We went with updating the production code to be deterministic with respect to order because it was simple.

Wrapping up

Gold master testing is not a substitute for unit tests and acceptance tests. However, it can be a valuable tool in your toolbox for dealing with legacy systems, as well as certain specialized cases. It’s a fancy name, but implementing a basic system took less than a day and began yielding benefits right away. Like us, you can start with something simple and rough, and iterate it down the road.

Actionable metrics for engineering leaders.

Try Velocity Free right_arrow_white

Start your 15-day free trial today.

See what Velocity reveals about your productivity and discover
opportunities to improve your processes, people and code.