Goto Fail, Heartbleed, and Unit Testing Culture

Two computer security flaws were discovered in early 2014: Apple’s “goto fail” bug and OpenSSL’s “Heartbleed” bug. Both bugs had the potential for widespread and severe security failures, the full extent of which we may never know. Given their severity, it is important for the software development profession to reflect on how they could have been prevented so we can improve our ability to prevent these kinds of bugs in the future. This article considers the role unit testing could play, showing how unit tests, and more importantly a unit testing culture, could have identified these particular bugs. It goes on to look at the costs and benefits of such a culture and describes how such a culture was instilled at Google.

12 May 2014

Mike Bland

Music student, semi-retired programmer, and former Googler.

We are publishing this article in installments. This first installment looks at applying unit testing culture to the "goto fail" bug. Later installments will consider the Heartbleed bug, the trade-offs around unit testing, and the experience of instilling a unit-testing culture at Google.

goto fail

At the beginning of 2014 the security of the Internet was rocked by two serious flaws: Apple’s “goto fail” bug (CVE-2014-1266) and OpenSSL’s “Heartbleed” bug (CVE-2014-0160). Both were vulnerabilities in the Secure Sockets Layer technology upon which the majority of secure communications on the Internet relies. These bugs are as instructive as they were devastating: They were rooted in the same programmer optimism, overconfidence, and haste that strike projects of all sizes and domains.

These bugs arouse my passion because I've seen and lived the benefits of unit testing, and this strongly-imprinted experience compels me to reflect on how unit testing approaches could prevent defects as high-impact and high-profile as these SSL bugs. Unit testing is the process of looking for chunks of code that make for convenient “units” to which to apply automated Unit Tests, small programs designed to verify low-level implementation details and detect coding errors early. The nature of the defects inspired me to write my own proof-of-concept unit tests to reproduce the errors and verify their fixes. I wrote these tests to validate my intuition, and to demonstrate to others how unit tests could have detected these defects early and without heroic effort.

Writing unit tests produces benefits beyond detecting low-level coding errors. In this article, I explore the question of whether unit testing could have helped prevent the "goto fail" and Heartbleed bugs. In doing so, I hope to establish a compelling case for the adoption of unit testing as part of everyday development, so that the experience of Self Testing Code becomes universal. I offer my insights in the hope that they may help avoid similar failures in the future, in the spirit of a postmortem or project retrospective. My experience doesn't mean I'm owed deference based on mah authoritah, but I hope I've earned an audience with those open-minded enough to have a discussion on the merits of the argument.

Many popular and technical media stories have run with explanations of how these defects happened, why they slipped past existing safeguards before being so widely deployed, and what should be done to prevent such bugs from happening again. It troubles me that most of these analyses fall back on facile excuses that miss the mark, and promote resigned acceptance of such defects due to the ever-increasing complexity of modern software systems. It is as though the software industry at large, as well as the public that depends on it, is anxious to accept such failures as inevitable fate, the price we pay for the modern conveniences that technology affords us. It's the easiest possible explanation that allows us to make sense of a bad situation and move on as a society.

I don't accept such defects as inevitable. Rather, we must seize this opportunity to reflect on how we developers can do far better than rely on fate, or more funding, or any number of external factors to prevent security vulnerabilities or other high-impact defects caused by low-level coding errors. Bugs will happen, but neither software developers nor the public should be satisfied with that as a response to defects this colossal in scope. Deep, genuine reflection is difficult, and encounters a lot of resistance, as it calls on developers to accept responsibility for their human limitations—which is often a challenge to the very self-image of a programmer. That makes it all the more important to dive deeply into these two bugs in particular, to search for genuine solutions and to avoid setting a dangerous precedent: If everything in the short-term turns out OK in the wake of "goto fail" and Heartbleed, then why bother changing anything about how software development is currently done?

goto fail

The “goto fail” bug first shipped to iPhones, iPads, and AppleTVs in September 2012, appeared in iOS 7.0 and OS X Mavericks, and was not fixed until February 2014—seventeen months after it was introduced. A short-circuit skipping the final step of the SSL/TLS handshake algorithm left users vulnerable to a man-in-the-middle attack, whereby a malicious system relaying traffic between an affected system and another system could present the illusion of a secure connection using false credentials, and subsequently intercept all communications between the other two systems.

The bug earned its name from this now infamous code snippet:

if ((err = SSLHashSHA1.update(&hashCtx, &signedParams)) != 0)
    goto fail;
    goto fail;

Some have argued that all goto statements are bad, based on Edsger Dijkstra's famous essay A Case against the GO TO Statement, summarized by the popular axiom "Goto considered harmful". However, the goto fail statement embodies an idiom C programmers will recognize. Such statements pass control immediately to the end of a function in the case of unrecoverable errors, as it is at the end of a function where memory or other resources allocated by the function are properly released. Other languages have built-in support for such “abortion clauses”, as Dijkstra called them in the conclusion to his essay: destructors in C++; try/catch/finally in Java; defer()/panic()/recover() in Go; try/except/finally and the with statement in Python. In C, there is no essential problem with or confusion surrounding the use of goto in this context. In other words, goto is not considered harmful here.

C programmers will also immediately recognize that the first goto fail statement is bound to the result of the if statement preceding it, but the second goto fail is not: The matching indentation of the two statements bears no significance in C, as surrounding curly braces are required to bind more than one statement to an if condition. If the first goto fail is not executed, the second one certainly will be. This means that the final step of the handshake algorithm will never be executed, but any exchange successfully passing this point will always produce a successful return value even if the final verification step would have failed. More plainly: The algorithm gets short-circuited by the extra goto fail statement.

Some have claimed that a coding style requiring the use of curly braces for all if statements or enabling unreachable-code compiler warnings could have helped. However, there are deeper problems with the code which unit testing could help to resolve.

How Could Unit Testing Have Helped?

While looking for “units” to which to apply “unit” tests, the entire block of code containing the buggy algorithm, with its cluster of conditional logic, leaps out as such a unit (from the SSLVerifySignedServerKeyExchange() function in version 55471 of Apple’s Secure Transport library):

if ((err = ReadyHash(&SSLHashSHA1, &hashCtx)) != 0)
    goto fail;
if ((err = SSLHashSHA1.update(&hashCtx, &clientRandom)) != 0)
    goto fail;
if ((err = SSLHashSHA1.update(&hashCtx, &serverRandom)) != 0)
    goto fail;
if ((err = SSLHashSHA1.update(&hashCtx, &signedParams)) != 0)
    goto fail;
    goto fail;
if ((err = SSLHashSHA1.final(&hashCtx, &hashOut)) != 0)
    goto fail;

Such a block of code is easier to test when extracted into its own function. Extracting chunks of code like this is habitual practice for people writing unit tests, and can help get parts of an existing code base under test a piece at a time. Looking closely at the variables and data types used in the algorithm make it clear that this block of code is performing a handshake on the hashes. By looking up the type of SSLHashSHA1, we can also see that it is an instance of a HashReference “jump table”, a structure containing function pointers that enables C programmers to implement virtual function-like behavior (i.e. substitutability and run-time polymorphism). We can extract this operation into a function with a name signifying its intent:

static OSStatus
HashHandshake(const HashReference* hashRef, SSLBuffer* clientRandom,
    SSLBuffer* serverRandom, SSLBuffer* exchangeParams,
    SSLBuffer* hashOut) {
  SSLBuffer hashCtx;
  OSStatus err = 0;
  hashCtx.data = 0;
  if ((err = ReadyHash(hashRef, &hashCtx)) != 0)
    goto fail;
  if ((err = hashRef->update(&hashCtx, clientRandom)) != 0)
    goto fail;
  if ((err = hashRef->update(&hashCtx, serverRandom)) != 0)
    goto fail;
  if ((err = hashRef->update(&hashCtx, exchangeParams)) != 0)
    goto fail;
  err = hashRef->final(&hashCtx, hashOut);
fail:
  SSLFreeBuffer(&hashCtx);
  return err;
}

Now the series of statements comprising the previously buggy algorithm can be replaced by:

if ((err = HashHandshake(&SSLHashSHA1, &clientRandom, &serverRandom,
     &signedParams, &hashOut)) != 0) {
  goto fail;
}

This function is more easily understood in isolation. Faced with such a self-contained function like this, a programmer can begin to focus on the external effects of the code, considering questions such as:

What is the contract fulfilled by the code under test?
What preconditions are required, and how are they enforced?
What postconditions are guaranteed?
What example inputs trigger different behaviors?
What set of tests will trigger each behavior and validate each guarantee?

In the case of HashHandshake(), the contract can be described as: Five steps, all must pass. Success or failure is propagated to the caller by the return value. The HashReference is expected to respond correctly to the series of calls; whether it makes use of any functions or data beyond that passed in by HashHandshake() is an implementation detail opaque to HashHandshake() itself.

For an algorithm this straightforward, the test cases will rather closely "mirror" the implemetation: One success case, five failure cases. For higher-level or more complex operations, such close "mirroring" can make for brittle tests and should generally be avoided. This is especially important to keep in mind when using mocks or other Test Doubles to test code in isolation from its collaborators.

It's arguably even more important to test that the code doesn't do what it shouldn't do.

Regardless of the scope of the code under test, it's critical to exhaustively test failure cases to the extent possible. It is tempting to test that the code does what it should do and leave it at that, but it's arguably even more important to test that it doesn't do what it shouldn't do.

Proof-of-Concept Unit Test

Despite the fact that C is not an object-oriented programming language, the existing code for this algorithm exhibits a clearly object-oriented design that actually makes for easy unit testing once the code is extracted into its own function. The tls_digest_test.c proof-of-concept unit test shows how a HashReference stub can be used to effectively cover every path through the extracted HashHandshake() algorithm. The actual test cases look like this:

Test It Yourself

This example will only run on OS X. You will need to have Xcode installed. The Security-55471-bugfix-and-test.tar.gz bundle contains a build.sh script that will:

download the source code from Apple;
apply the patch that extracts the HashHandshake() algorithm into Security-55471/libsecurity_ssl/lib/tls_digest.c and replaces all the duplicate copies of the original algorithm with a call to HashHandshake();
build just the affected the code; and
run the test.

It was developed using OS X 10.9. It may build on other versions of OS X.

static int TestHandshakeSuccess() {
  HashHandshakeTestFixture fixture = SetUp(__func__);
  fixture.expected = SUCCESS;
  return ExecuteHandshake(fixture);
}

static int TestHandshakeInitFailure() {
  HashHandshakeTestFixture fixture = SetUp(__func__);
  fixture.expected = INIT_FAILURE;
  fixture.ref.init = HashHandshakeTestFailInit;
  return ExecuteHandshake(fixture);
}

static int TestHandshakeUpdateClientFailure() {
  HashHandshakeTestFixture fixture = SetUp(__func__);
  fixture.expected = UPDATE_CLIENT_FAILURE;
  fixture.client = FAIL_ON_EVALUATION(UPDATE_CLIENT_FAILURE);
  return ExecuteHandshake(fixture);
}

static int TestHandshakeUpdateServerFailure() {
  HashHandshakeTestFixture fixture = SetUp(__func__);
  fixture.expected = UPDATE_SERVER_FAILURE;
  fixture.server = FAIL_ON_EVALUATION(UPDATE_SERVER_FAILURE);
  return ExecuteHandshake(fixture);
}

static int TestHandshakeUpdateParamsFailure() {
  HashHandshakeTestFixture fixture = SetUp(__func__);
  fixture.expected = UPDATE_PARAMS_FAILURE;
  fixture.params = FAIL_ON_EVALUATION(UPDATE_PARAMS_FAILURE);
  return ExecuteHandshake(fixture);
}

static int TestHandshakeFinalFailure() {
  HashHandshakeTestFixture fixture = SetUp(__func__);
  fixture.expected = FINAL_FAILURE;
  fixture.ref.final = HashHandshakeTestFailFinal;
  return ExecuteHandshake(fixture);
}

The HashHandshakeTestFixture holds all the variables needed as input to the code under test and to check the expected outcome:

typedef struct
{
    HashReference ref;
    SSLBuffer *client;
    SSLBuffer *server;
    SSLBuffer *params;
    SSLBuffer *output;
    const char *test_case_name;
    enum HandshakeResult expected;
} HashHandshakeTestFixture;

SetUp() initializes all of the members of a HashHandshakeTestFixture to default values; each test case overrides only those members pertinent to that specific test case:

static HashHandshakeTestFixture SetUp(const char *test_case_name) {
  HashHandshakeTestFixture fixture;
  memset(&fixture, 0, sizeof(fixture));
  fixture.ref = SSLHashNull;
  fixture.ref.update = HashHandshakeTestUpdate;
  fixture.test_case_name = test_case_name;
  return fixture;
}

ExecuteHandshake() executes the HashHandshake() function and evaluates the outcome, printing an error message and returning an error value if the outcome differs from what was expected:

/* Executes the handshake and returns zero if the result matches expected, one
 * otherwise. */
static int ExecuteHandshake(HashHandshakeTestFixture fixture) {
  const enum HandshakeResult actual = HashHandshake(
      &fixture.ref, fixture.client, fixture.server, fixture.params,
      fixture.output);

  if (actual != fixture.expected) {
    printf("%s failed: expected %s, received %s\n", fixture.test_case_name,
           HandshakeResultString(fixture.expected),
           HandshakeResultString(actual));
    return 1;
  }
  return 0;
}

Adding a duplicate goto fail statement anywhere in the HashHandshake() algorithm before the final() call will cause the test to fail.

A unit testing framework is a library providing code structures for organizing test cases and assertion methods for verifying that test outcomes match expectations. The output generated by a framework adheres to a standard format that usually identifies the failing test case and the line in the test file at which a failing assertion occurs. Many unit testing frameworks are based on the Xunit model.

A framework can make unit tests easier to write, and their failures easier to diagnose, especially as an entire team or company becomes familiar with its idioms. However, a framework is a convenience, not a prerequisite for effective unit testing.

This test was written without a testing framework, to demonstrate that an effective test can be written using tools already in-use by a project. Even without referencing a standard framework, the explanation in the preceding paragraph should prove relatively easy to follow: Well-organized test cases using well-organized objects and functions with well-chosen names mean that if a test fails, you can usually diagnose the failure from the information in the test case alone without digging through the full implementation of the test program. Testing frameworks can help in writing tests more efficiently, but are not a prerequisite for writing well-organized, thorough unit tests.

Writing a set of tests to exercise this function is straightforward because now we’re thinking about concrete examples rather than conditions. Furthermore, the tests act as a double-check: It’s easy to make a mistake with conditional logic, accidentally reversing one test in the chain; but when you write tests you are stating the behavior twice, once with examples, once with logic. You have to make the same mistake in two different representations for a bug to get through.

It’s likely that the programmer who wrote this algorithm the first time did execute the program to check for errors in the new code. Most programmers will run a program with some sample inputs to verify that it’s doing what they think it should do. The problem is that these runs are often ephemeral and thrown away once the code is working; an automated test captures those runs as a permanent double-check.

That permanent double-check is important here: We don’t know exactly how that rogue second goto fail got into the code; a likely reason is that it was the result of a large merge operation. When merging a branch into the mainline, large differences can result. Even if a merge compiles, it can still introduce errors. Inspecting such merge differences can be time-consuming, tedious, and error-prone, even for experienced developers. In this case the automated double-check provided by unit tests provides a fast and painstaking (yet painless!) code review, in the sense that the tests will likely catch potential merge errors before a human inspects the merged code. It’s unlikely the original author introduced the "goto fail" bug into the code, but a suite of tests doesn’t just help you find your own mistakes: It helps reveal mistakes made by programmers far into the future.

In the case of the “goto fail” bug, the unit testing habit of looking for, and extracting, testable functions has a second benefit.

Déjà Vu All Over Again

A copy of the same algorithm with a different HashReference instance appears immediately above the buggy algorithm in the same function. In total, the algorithm appears six different times in the same file (sslKeyExchange.c from Security-55471):

twice in SSLVerifySignedServerKeyExchange(), which contained the bug
once in SSLVerifySignedServerKeyExchangeTls12()
twice in SSLSignServerKeyExchange()
once in SSLSignServerKeyExchangeTls12()

The updated version of sslKeyExchange.c in Security-55471.14 has eliminated the duplicated goto fail statement from SSLVerifySignedServerKeyExchange(), but the duplicated algorithm remains.

Code duplication is a Code Smell that is known to increase the likelihood of software errors. It is also apparent from the function names above that there’s more duplication besides that of the core handshake algorithm. This cut-and-paste code reuse also supports the hypothesis that the bug might have been caused by a large merge operation, as duplicate code increases the available “code surface” during merges and compounds the potential for undetected merge errors.

Unit testing introduces pressure to minimize copy/paste, because the copy/pasted code also has to be unit tested. It could have ensured that only one copy of this algorithm existed since it would have been easier to test. A unit test could have easily verified that this algorithm was correct, merge or no, and could have prevented the “goto fail” bug from being written in the first place.

Also, the Security-55471 version of ssl_regressions.h, which appears to list a number of SSL regression tests for this library, remains unchanged in the Security-55471.14 version of ssl_regressions.h. The only substantial difference between the two versions of the library is the deletion of the goto fail statement itself, with no added tests or eliminated duplication:

$ curl -O http://opensource.apple.com/tarballs/Security/Security-55471.tar.gz
$ curl -O http://opensource.apple.com/tarballs/Security/Security-55471.14.tar.gz
$ for f in Security-55471{,.14}.tar.gz; do gzip -dc $f | tar xf - ; done
# Since diff on OS X doesn't have a --no-dereference option:
$ find Security-55471* -type l | xargs rm
$ diff -uNr Security-55471{,.14}/libsecurity_ssl
diff -uNr Security-55470/libsecurity_ssl/lib/sslKeyExchange.c
Security-55471.14/libsecurity_ssl/lib/sslKeyExchange.c
--- Security-55471/libsecurity_ssl/lib/sslKeyExchange.c 2013-08-09
20:41:07.000000000 -0400
+++ Security-55471.14/libsecurity_ssl/lib/sslKeyExchange.c      2014-02-06
22:55:54.000000000 -0500
@@ -628,7 +628,6 @@
         goto fail;
     if ((err = SSLHashSHA1.update(&hashCtx, &signedParams)) != 0)
         goto fail;
-        goto fail;
     if ((err = SSLHashSHA1.final(&hashCtx, &hashOut)) != 0)
         goto fail;

diff -uNr Security-55471/libsecurity_ssl/regressions/ssl-43-ciphers.c
Security-55471.14/libsecurity_ssl/regressions/ssl-43-ciphers.c
--- Security-55471/libsecurity_ssl/regressions/ssl-43-ciphers.c 2013-10-11
17:56:44.000000000 -0400
+++ Security-55471.14/libsecurity_ssl/regressions/ssl-43-ciphers.c
2014-03-12 19:30:14.000000000 -0400
@@ -85,7 +85,7 @@
     { OPENSSL_SERVER, 4000, 0, false}, //openssl s_server w/o client side
auth
     { GNUTLS_SERVER, 5000, 1, false}, // gnutls-serv w/o client side auth
     { "www.mikestoolbox.org", 442, 2, false}, // mike's  w/o client side auth
-//    { "tls.secg.org", 40022, 3, false}, // secg ecc server w/o client side
auth - This server generate DH params we dont support.
+//    { "tls.secg.org", 40022, 3, false}, // secg ecc server w/o client side
auth

     { OPENSSL_SERVER, 4010, 0, true}, //openssl s_server w/ client side auth
     { GNUTLS_SERVER, 5010, 1, true}, // gnutls-serv w/ client side auth

Cultural Implications

The presence of six separate copies of the same algorithm clearly indicates that this bug was not due to a one-time programmer error: This was a pattern. This is evidence of a development culture that tolerates duplicated, untested code.

I've never worked at Apple, nor do I know any Apple developers. I don't know exactly what the company-wide development culture is there, and whether this code is representative or exceptional. Even if this code is the exception rather than the norm, it's still unacceptable. It doesn't matter to me, as someone whose privacy and security might've been violated by this coding error, what the circumstances were "excusing" this particular error or what the rest of the culture looks like. I want to see greater accountability for such mistakes. Not shaming, not condemnation, etc., but accountability and the due diligence that follows it. That's the deeper strategy for how we prevent the next "goto fail" from happening.

I do know that development cultures can change. Bugs such as these give us an occasion to reflect upon our own development cultures, if unit testing is not already a vital part, and begin to appreciate why unit testing is such an important development practice. I'll discuss my experience with changing a development culture in detail in a later section of this article, and offer advice for how to effect change in other development cultures, from a single team to an entire company.

My proof-of-concept unit test for "goto fail" may be easy to dismiss as a one-off test written with 20/20 hindsight. I would rather it appear as an example of the kind of accessible unit testing approach that development teams everywhere can apply to existing code, right now, to avoid similarly embarrassing (and potentially catastrophic) bugs. A development culture that values unit testing and whose members work to improve their craft will produce tests that will most likely catch programming errors exactly like "goto fail" long before they have a chance to impact any users.

In the next installment, I'll take a look at the Heartbleed bug to examine how unit testing might've been applied in that context. Follow the RSS feed or Martin's twitter stream to know when it appears.

Significant Revisions

12 May 2014: published first installment

Goto Fail, Heartbleed, and Unit Testing Culture

Contents

goto fail

How Could Unit Testing Have Helped?

Proof-of-Concept Unit Test

Test It Yourself

Déjà Vu All Over Again

Cultural Implications

Significant Revisions

Guides

Popular Articles

Books

Site Sections

ThoughtWorks