Goto Fail, Heartbleed, and Unit Testing Culture
Two computer security flaws were discovered in early 2014: Apple’s “goto fail” bug and OpenSSL’s “Heartbleed” bug. Both bugs had the potential for widespread and severe security failures, the full extent of which we may never know. Given their severity, it is important for the software development profession to reflect on how they could have been prevented so we can improve our ability to prevent these kinds of bugs in the future. This article considers the role unit testing could play, showing how unit tests, and more importantly a unit testing culture, could have identified these particular bugs. It goes on to look at the costs and benefits of such a culture and describes how such a culture was instilled at Google.
12 May 2014
We are publishing this article in installments. This first installment looks at applying unit testing culture to the "goto fail" bug. Later installments will consider the Heartbleed bug, the trade-offs around unit testing, and the experience of instilling a unit-testing culture at Google.
At the beginning of 2014 the security of the Internet was rocked by two serious flaws: Apple’s “goto fail” bug (CVE-2014-1266) and OpenSSL’s “Heartbleed” bug (CVE-2014-0160). Both were vulnerabilities in the Secure Sockets Layer technology upon which the majority of secure communications on the Internet relies. These bugs are as instructive as they were devastating: They were rooted in the same programmer optimism, overconfidence, and haste that strike projects of all sizes and domains.
These bugs arouse my passion because I've seen and lived the benefits of unit testing, and this strongly-imprinted experience compels me to reflect on how unit testing approaches could prevent defects as high-impact and high-profile as these SSL bugs. Unit testing is the process of looking for chunks of code that make for convenient “units” to which to apply automated Unit Tests, small programs designed to verify low-level implementation details and detect coding errors early. The nature of the defects inspired me to write my own proof-of-concept unit tests to reproduce the errors and verify their fixes. I wrote these tests to validate my intuition, and to demonstrate to others how unit tests could have detected these defects early and without heroic effort.
Writing unit tests produces benefits beyond detecting low-level
coding errors. In this article, I explore the question of whether
unit testing could have helped prevent the "goto fail" and
Heartbleed bugs. In doing so, I hope to establish a compelling
case for the adoption of unit testing as part of everyday
development, so that the experience of Self Testing Code becomes universal. I offer my insights in
the hope that they may help avoid similar failures in the future,
in the spirit of a postmortem or
project retrospective. My
experience doesn't mean I'm owed deference based on mah authoritah, but I hope I've earned an
audience with those open-minded enough to have a discussion on the
merits of the argument.
Many popular and technical media stories have run with explanations of how these defects happened, why they slipped past existing safeguards before being so widely deployed, and what should be done to prevent such bugs from happening again. It troubles me that most of these analyses fall back on facile excuses that miss the mark, and promote resigned acceptance of such defects due to the ever-increasing complexity of modern software systems. It is as though the software industry at large, as well as the public that depends on it, is anxious to accept such failures as inevitable fate, the price we pay for the modern conveniences that technology affords us. It's the easiest possible explanation that allows us to make sense of a bad situation and move on as a society.
I don't accept such defects as inevitable. Rather, we must seize this opportunity to reflect on how we developers can do far better than rely on fate, or more funding, or any number of external factors to prevent security vulnerabilities or other high-impact defects caused by low-level coding errors. Bugs will happen, but neither software developers nor the public should be satisfied with that as a response to defects this colossal in scope. Deep, genuine reflection is difficult, and encounters a lot of resistance, as it calls on developers to accept responsibility for their human limitations—which is often a challenge to the very self-image of a programmer. That makes it all the more important to dive deeply into these two bugs in particular, to search for genuine solutions and to avoid setting a dangerous precedent: If everything in the short-term turns out OK in the wake of "goto fail" and Heartbleed, then why bother changing anything about how software development is currently done?
goto fail
The “goto fail” bug first shipped to iPhones, iPads, and AppleTVs in September 2012, appeared in iOS 7.0 and OS X Mavericks, and was not fixed until February 2014—seventeen months after it was introduced. A short-circuit skipping the final step of the SSL/TLS handshake algorithm left users vulnerable to a man-in-the-middle attack, whereby a malicious system relaying traffic between an affected system and another system could present the illusion of a secure connection using false credentials, and subsequently intercept all communications between the other two systems.
The bug earned its name from this now infamous code snippet:
if ((err = SSLHashSHA1.update(&hashCtx, &signedParams)) != 0) goto fail; goto fail;
Some have argued that all goto statements are bad, based on
Edsger Dijkstra's famous essay A Case against
the GO TO Statement, summarized by the popular axiom "Goto
considered harmful". However, the
goto fail
statement embodies an idiom C programmers
will recognize. Such statements pass control immediately to the
end of a function in the case of unrecoverable errors, as it is
at the end of a function where memory or other resources
allocated by the function are properly released. Other languages
have built-in support for such “abortion clauses”, as Dijkstra
called them in the conclusion to his essay: destructors in C++;
try/catch/finally
in Java;
defer()/panic()/recover()
in Go;
try/except/finally
and the with
statement in Python. In C, there is no essential problem with or
confusion surrounding the use of goto
in this
context. In other words, goto
is not considered
harmful here.
C programmers will also immediately recognize that the first
goto fail
statement is bound to the result of the
if
statement preceding it, but the second goto
fail
is not: The matching indentation of the two statements
bears no significance in C, as surrounding curly braces are required
to bind more than one statement to an if
condition. If
the first goto fail
is not executed, the second one
certainly will be. This means that the final step of the handshake
algorithm will never be executed, but any exchange successfully
passing this point will always produce a successful return value even
if the final verification step would have failed. More plainly: The
algorithm gets short-circuited by the extra goto fail
statement.
Some have claimed that a coding style requiring the use of
curly braces for all if
statements or enabling
unreachable-code compiler warnings could have helped. However, there
are deeper problems with the code which unit testing could help to
resolve.
How Could Unit Testing Have Helped?
While looking for “units” to which to apply “unit” tests,
the entire block of code containing the buggy algorithm, with
its cluster of conditional logic, leaps out as such a unit
(from the SSLVerifySignedServerKeyExchange()
function
in version 55471 of Apple’s Secure
Transport library):
if ((err = ReadyHash(&SSLHashSHA1, &hashCtx)) != 0) goto fail; if ((err = SSLHashSHA1.update(&hashCtx, &clientRandom)) != 0) goto fail; if ((err = SSLHashSHA1.update(&hashCtx, &serverRandom)) != 0) goto fail; if ((err = SSLHashSHA1.update(&hashCtx, &signedParams)) != 0) goto fail; goto fail; if ((err = SSLHashSHA1.final(&hashCtx, &hashOut)) != 0) goto fail;
Such a block of code is easier to test when extracted into its
own function. Extracting chunks of code like this is habitual practice
for people writing unit tests, and can help get parts of an existing
code base under test a piece at a time. Looking closely at the
variables and data types used in the algorithm make it clear that
this block of code is performing a handshake on the hashes. By looking
up the type of SSLHashSHA1
, we can also see that it is an
instance of a HashReference
“jump
table”, a structure containing function pointers that enables C
programmers to implement virtual function-like behavior (i.e.
substitutability and run-time polymorphism).
We can extract this operation into a function with a name signifying
its intent:
static OSStatus HashHandshake(const HashReference* hashRef, SSLBuffer* clientRandom, SSLBuffer* serverRandom, SSLBuffer* exchangeParams, SSLBuffer* hashOut) { SSLBuffer hashCtx; OSStatus err = 0; hashCtx.data = 0; if ((err = ReadyHash(hashRef, &hashCtx)) != 0) goto fail; if ((err = hashRef->update(&hashCtx, clientRandom)) != 0) goto fail; if ((err = hashRef->update(&hashCtx, serverRandom)) != 0) goto fail; if ((err = hashRef->update(&hashCtx, exchangeParams)) != 0) goto fail; err = hashRef->final(&hashCtx, hashOut); fail: SSLFreeBuffer(&hashCtx); return err; }
Now the series of statements comprising the previously buggy algorithm can be replaced by:
if ((err = HashHandshake(&SSLHashSHA1, &clientRandom, &serverRandom, &signedParams, &hashOut)) != 0) { goto fail; }
This function is more easily understood in isolation. Faced with such a self-contained function like this, a programmer can begin to focus on the external effects of the code, considering questions such as:
- What is the contract fulfilled by the code under test?
- What preconditions are required, and how are they enforced?
- What postconditions are guaranteed?
- What example inputs trigger different behaviors?
- What set of tests will trigger each behavior and validate each guarantee?
In the case of HashHandshake()
, the contract can be
described as: Five steps, all must pass. Success or failure is
propagated to the caller by the return value. The
HashReference
is expected to respond correctly to the
series of calls; whether it makes use of any functions or data beyond
that passed in by HashHandshake()
is an implementation
detail opaque to HashHandshake()
itself.
For an algorithm this straightforward, the test cases will rather closely "mirror" the implemetation: One success case, five failure cases. For higher-level or more complex operations, such close "mirroring" can make for brittle tests and should generally be avoided. This is especially important to keep in mind when using mocks or other Test Doubles to test code in isolation from its collaborators.
It's arguably even more important to test that the code doesn't do what it shouldn't do.
Regardless of the scope of the code under test, it's critical to exhaustively test failure cases to the extent possible. It is tempting to test that the code does what it should do and leave it at that, but it's arguably even more important to test that it doesn't do what it shouldn't do.
Proof-of-Concept Unit Test
Despite the fact that C is not an object-oriented programming
language, the existing code for this algorithm exhibits a clearly
object-oriented design that actually makes for easy unit testing once
the code is extracted into its own function. The
tls_digest_test.c
proof-of-concept unit test shows how a HashReference
stub can be used to effectively cover every path through the
extracted HashHandshake()
algorithm. The actual test
cases look like this:
static int TestHandshakeSuccess() { HashHandshakeTestFixture fixture = SetUp(__func__); fixture.expected = SUCCESS; return ExecuteHandshake(fixture); } static int TestHandshakeInitFailure() { HashHandshakeTestFixture fixture = SetUp(__func__); fixture.expected = INIT_FAILURE; fixture.ref.init = HashHandshakeTestFailInit; return ExecuteHandshake(fixture); } static int TestHandshakeUpdateClientFailure() { HashHandshakeTestFixture fixture = SetUp(__func__); fixture.expected = UPDATE_CLIENT_FAILURE; fixture.client = FAIL_ON_EVALUATION(UPDATE_CLIENT_FAILURE); return ExecuteHandshake(fixture); } static int TestHandshakeUpdateServerFailure() { HashHandshakeTestFixture fixture = SetUp(__func__); fixture.expected = UPDATE_SERVER_FAILURE; fixture.server = FAIL_ON_EVALUATION(UPDATE_SERVER_FAILURE); return ExecuteHandshake(fixture); } static int TestHandshakeUpdateParamsFailure() { HashHandshakeTestFixture fixture = SetUp(__func__); fixture.expected = UPDATE_PARAMS_FAILURE; fixture.params = FAIL_ON_EVALUATION(UPDATE_PARAMS_FAILURE); return ExecuteHandshake(fixture); } static int TestHandshakeFinalFailure() { HashHandshakeTestFixture fixture = SetUp(__func__); fixture.expected = FINAL_FAILURE; fixture.ref.final = HashHandshakeTestFailFinal; return ExecuteHandshake(fixture); }
The HashHandshakeTestFixture
holds all the variables
needed as input to the code under test and to check the expected
outcome:
typedef struct { HashReference ref; SSLBuffer *client; SSLBuffer *server; SSLBuffer *params; SSLBuffer *output; const char *test_case_name; enum HandshakeResult expected; } HashHandshakeTestFixture;
SetUp()
initializes all of the members of a
HashHandshakeTestFixture
to default values; each test
case overrides only those members pertinent to that specific test
case:
static HashHandshakeTestFixture SetUp(const char *test_case_name) { HashHandshakeTestFixture fixture; memset(&fixture, 0, sizeof(fixture)); fixture.ref = SSLHashNull; fixture.ref.update = HashHandshakeTestUpdate; fixture.test_case_name = test_case_name; return fixture; }
ExecuteHandshake()
executes the
HashHandshake()
function and evaluates the outcome,
printing an error message and returning an error value if the outcome
differs from what was expected:
/* Executes the handshake and returns zero if the result matches expected, one * otherwise. */ static int ExecuteHandshake(HashHandshakeTestFixture fixture) { const enum HandshakeResult actual = HashHandshake( &fixture.ref, fixture.client, fixture.server, fixture.params, fixture.output); if (actual != fixture.expected) { printf("%s failed: expected %s, received %s\n", fixture.test_case_name, HandshakeResultString(fixture.expected), HandshakeResultString(actual)); return 1; } return 0; }
Adding a duplicate goto fail
statement anywhere in the
HashHandshake()
algorithm before the final()
call will cause the test to fail.
This test was written without a testing framework, to demonstrate that an effective test can be written using tools already in-use by a project. Even without referencing a standard framework, the explanation in the preceding paragraph should prove relatively easy to follow: Well-organized test cases using well-organized objects and functions with well-chosen names mean that if a test fails, you can usually diagnose the failure from the information in the test case alone without digging through the full implementation of the test program. Testing frameworks can help in writing tests more efficiently, but are not a prerequisite for writing well-organized, thorough unit tests.
Writing a set of tests to exercise this function is straightforward because now we’re thinking about concrete examples rather than conditions. Furthermore, the tests act as a double-check: It’s easy to make a mistake with conditional logic, accidentally reversing one test in the chain; but when you write tests you are stating the behavior twice, once with examples, once with logic. You have to make the same mistake in two different representations for a bug to get through.
It’s likely that the programmer who wrote this algorithm the first time did execute the program to check for errors in the new code. Most programmers will run a program with some sample inputs to verify that it’s doing what they think it should do. The problem is that these runs are often ephemeral and thrown away once the code is working; an automated test captures those runs as a permanent double-check.
That permanent double-check is important here: We don’t know
exactly how that rogue second goto fail
got into the
code; a likely reason is that it was
the result of a large merge operation.
When merging a branch into the mainline, large differences can
result. Even if a merge compiles, it can still introduce
errors. Inspecting such merge differences can be time-consuming,
tedious, and error-prone, even for experienced developers. In this
case the automated double-check provided by unit tests provides a fast
and painstaking (yet painless!) code review, in the sense that the
tests will likely catch potential merge errors before a human
inspects the merged code. It’s unlikely the original author
introduced the "goto fail" bug into the code, but a suite of tests
doesn’t just help you find your own mistakes: It helps reveal
mistakes made by programmers far into the future.
In the case of the “goto fail” bug, the unit testing habit of looking for, and extracting, testable functions has a second benefit.
Déjà Vu All Over Again
A copy of the same algorithm with a different
HashReference
instance appears immediately above the
buggy algorithm in the same function. In total, the algorithm appears
six different times in the same file
(sslKeyExchange.c from
Security-55471):
- twice in
SSLVerifySignedServerKeyExchange()
, which contained the bug - once in
SSLVerifySignedServerKeyExchangeTls12()
- twice in
SSLSignServerKeyExchange()
- once in
SSLSignServerKeyExchangeTls12()
The updated version of
sslKeyExchange.c in Security-55471.14 has eliminated the
duplicated goto fail
statement from
SSLVerifySignedServerKeyExchange()
, but the duplicated
algorithm remains.
Code duplication is a Code Smell that is known to increase the likelihood of software errors. It is also apparent from the function names above that there’s more duplication besides that of the core handshake algorithm. This cut-and-paste code reuse also supports the hypothesis that the bug might have been caused by a large merge operation, as duplicate code increases the available “code surface” during merges and compounds the potential for undetected merge errors.
Unit testing introduces pressure to minimize copy/paste, because the copy/pasted code also has to be unit tested. It could have ensured that only one copy of this algorithm existed since it would have been easier to test. A unit test could have easily verified that this algorithm was correct, merge or no, and could have prevented the “goto fail” bug from being written in the first place.
Also, the Security-55471 version of
ssl_regressions.h, which appears to list a number of SSL
regression tests for this library, remains unchanged in the
Security-55471.14 version of
ssl_regressions.h. The only substantial difference between the
two versions of the library is the deletion of the goto
fail
statement itself, with no added tests or eliminated
duplication:
$ curl -O http://opensource.apple.com/tarballs/Security/Security-55471.tar.gz $ curl -O http://opensource.apple.com/tarballs/Security/Security-55471.14.tar.gz $ for f in Security-55471{,.14}.tar.gz; do gzip -dc $f | tar xf - ; done # Since diff on OS X doesn't have a --no-dereference option: $ find Security-55471* -type l | xargs rm $ diff -uNr Security-55471{,.14}/libsecurity_ssl diff -uNr Security-55470/libsecurity_ssl/lib/sslKeyExchange.c Security-55471.14/libsecurity_ssl/lib/sslKeyExchange.c --- Security-55471/libsecurity_ssl/lib/sslKeyExchange.c 2013-08-09 20:41:07.000000000 -0400 +++ Security-55471.14/libsecurity_ssl/lib/sslKeyExchange.c 2014-02-06 22:55:54.000000000 -0500 @@ -628,7 +628,6 @@ goto fail; if ((err = SSLHashSHA1.update(&hashCtx, &signedParams)) != 0) goto fail; - goto fail; if ((err = SSLHashSHA1.final(&hashCtx, &hashOut)) != 0) goto fail; diff -uNr Security-55471/libsecurity_ssl/regressions/ssl-43-ciphers.c Security-55471.14/libsecurity_ssl/regressions/ssl-43-ciphers.c --- Security-55471/libsecurity_ssl/regressions/ssl-43-ciphers.c 2013-10-11 17:56:44.000000000 -0400 +++ Security-55471.14/libsecurity_ssl/regressions/ssl-43-ciphers.c 2014-03-12 19:30:14.000000000 -0400 @@ -85,7 +85,7 @@ { OPENSSL_SERVER, 4000, 0, false}, //openssl s_server w/o client side auth { GNUTLS_SERVER, 5000, 1, false}, // gnutls-serv w/o client side auth { "www.mikestoolbox.org", 442, 2, false}, // mike's w/o client side auth -// { "tls.secg.org", 40022, 3, false}, // secg ecc server w/o client side auth - This server generate DH params we dont support. +// { "tls.secg.org", 40022, 3, false}, // secg ecc server w/o client side auth { OPENSSL_SERVER, 4010, 0, true}, //openssl s_server w/ client side auth { GNUTLS_SERVER, 5010, 1, true}, // gnutls-serv w/ client side auth
Cultural Implications
The presence of six separate copies of the same algorithm clearly indicates that this bug was not due to a one-time programmer error: This was a pattern. This is evidence of a development culture that tolerates duplicated, untested code.
I've never worked at Apple, nor do I know any Apple developers. I don't know exactly what the company-wide development culture is there, and whether this code is representative or exceptional. Even if this code is the exception rather than the norm, it's still unacceptable. It doesn't matter to me, as someone whose privacy and security might've been violated by this coding error, what the circumstances were "excusing" this particular error or what the rest of the culture looks like. I want to see greater accountability for such mistakes. Not shaming, not condemnation, etc., but accountability and the due diligence that follows it. That's the deeper strategy for how we prevent the next "goto fail" from happening.
I do know that development cultures can change. Bugs such as these give us an occasion to reflect upon our own development cultures, if unit testing is not already a vital part, and begin to appreciate why unit testing is such an important development practice. I'll discuss my experience with changing a development culture in detail in a later section of this article, and offer advice for how to effect change in other development cultures, from a single team to an entire company.
My proof-of-concept unit test for "goto fail" may be easy to dismiss as a one-off test written with 20/20 hindsight. I would rather it appear as an example of the kind of accessible unit testing approach that development teams everywhere can apply to existing code, right now, to avoid similarly embarrassing (and potentially catastrophic) bugs. A development culture that values unit testing and whose members work to improve their craft will produce tests that will most likely catch programming errors exactly like "goto fail" long before they have a chance to impact any users.
In the next installment, I'll take a look at the Heartbleed bug to examine how unit testing might've been applied in that context. Follow the RSS feed or Martin's twitter stream to know when it appears.