New Functional Testing in etcd

May 14, 2015 · By Yicheng Qin

Today we are discussing the new fault-injecting, functional testing framework built to test etcd, which can deploy a cluster, inject failures, and check the cluster for correctness continuously.

For context, etcd is an open source, distributed, consistent key-value store. It is a core component of CoreOS software that facilitates safe automatic updates, coordinates work scheduled to hosts, and sets up overlay networking for containers. Because of its core position in the stack, its correctness and availability is significantly critical, which is why the etcd team has built the functional testing framework.

Since writing the framework, we have run it continuously for the last two months, and etcd has shown to be robust under many kinds of harsh failure scenarios. This framework has also helped us identify a few potential bugs and improvements that we’ve fixed in newer releases — read on for more info.

Functional Testing

etcd’s functional test suite tests the functionality of an etcd cluster with a focus on failure-resistance under heavy usage.

The main workflow of the functional test suite is straightforward:

  1. It sets up a new etcd cluster and injects a failure into the cluster. A failure is some unexpected situation that may happen in the cluster, e.g., machine doesn’t work or network is down.
  2. It repairs the failure and expects the etcd cluster to recover within a short amount of time (usually one minute).
  3. It waits for the etcd cluster to be fully consistent and making progress.
  4. It starts the next round of failure injection.

Meanwhile, the framework makes continuous write requests to the etcd cluster to simulate heavy workloads. As a result, there are constantly hundreds of write requests queued, intentionally causing a heavy burden on the etcd cluster.

If the running cluster cannot recover from failure, the functional testing framework archives the cluster state and does the next round of testing on a new etcd cluster. When archiving, process logs and data directories for each etcd member are saved into a separate directory, which can be viewed and debugged in the future.

Basic Architecture

etcd's functional test suite has two components: etcd-agent and etcd-tester. etcd-agent runs on every etcd node and etcd-tester is a single controller of the test.

etcd-agent is a daemon on each machine. It can start, stop, restart, isolate and terminate an etcd process. The agent exposes these functionalities via RPC.

etcd-tester utilizes all etcd-agents to control the cluster and simulate various test cases. For example, it starts a three-member cluster by sending three start-RPC calls to three different etcd-agents. It then forces one of the members to fail by sending a stop-RPC call to the member’s etcd-agent.

etcd functional testing

While etcd-tester uses etcd-agent to control etcd externally, it also directly connects to etcd members to make simulated HTTP requests, including setting a range of keys and checking member health.

Internal Testing Suite

The internal functional testing suite case is built upon four n1-highcpu-2 virtual machines on Google Compute Engine. Each machine has 2 virtual cores, 1.8G memory and 200G standard persistent disk. Three machines have etcd-agent running as a daemon, while the fourth machine runs etcd-tester as the controller.

Currently we have six major failures to simulate the most common cases that etcd may meet in real life:

  1. kill all members
    • the whole data center experiences an outage, and the etcd cluster in the data center is killed
  2. kill the majority of the cluster
    • part of the data center experiences an outage, and the etcd cluster loses quorum
  3. kill one member
    • a single machine needs to be upgraded or maintained
  4. kill one member for a significant time and expect it to recover from an incoming snapshot
    • a single machine is down due to hardware failure, and requires manual repair
  5. isolate one member
    • the network interface on a single machine is broken
  6. isolate all members
    • the router or switch in the data center is broken

Meanwhile, 250k 100-byte keys are written into the etcd cluster continuously, which means we’re storing about 25MB of data in the cluster.

Discovering Potential Bugs

This test suite has helped us to discover potential bugs and areas to improve. In one discovery, we found that when a leader is helping the follower catch up with the progress of the cluster, there was a slight possibility that memory and CPU usage could explode without bound. After digging into the log, it turned out that the leader was repeatedly sending 50MB-size snapshot messages and overloaded its transport module. To fix the issue, we designed a message flow control for snapshot messages that solved the resource explosion.

Another example is the automatic WAL repair feature added in 2.1.0. To protect data integrity, etcd intentionally refuses to restart if the last entry in the underlying WAL was half-written, which may happen if the process is killed or disk is full. We've found this happens occasionally (once per hundred rounds) in functional testing, and it’s safe and easier to remove the error automatically and recover from the cluster to simplify the recovery for the administrator. This functionality has been merged into the master branch, and will be released in v2.1.0.

After several weeks of running and debugging, the etcd cluster has survived several thousand consecutive rounds of all six failures. Surviving serious testing, the etcd cluster is strong and working quite well.

Diving into the Code

Build and Run

etcd-agent can be built via

$ go build github.com/coreos/etcd/tools/functional-tester/etcd-agent

and etcd-tester at

$ go build github.com/coreos/etcd/tools/functional-tester/etcd-tester

Run etcd-agent binary on machine{1,2,3}:

$ ./etcd-agent --etcd-path=$ETCD_BIN_PATH

Run etcd-tester binary on machine4:

$ ./etcd-tester -agent-endpoints=”$MACHINE1_IP:9027,$MACHINE2_IP:9027,$MACHINE3_IP:9027” -limit=3 -stress-key-count=250000 -stress-key-size=100

etcd-tester starts running, and makes 3 rounds of all failures on a 3-member cluster in machines 1, 2, and 3.

Add a new failure

Let us go through the process to add failureKillOne, which kills one member and recovers it afterwards. First, write how to inject and recover from failure:

type failureKillOne struct {
  description
}

func newFailureKillOne() *failureKillOne {
  return &failureKillOne{
    // detailed description of the failure
    description: "kill one member",
  }
}

func (f *failureKillOne) Inject(c *cluster, round int) error {
  // round robin on all members
  i := round % c.Size
  // ask its agent to stop etcd
  return c.Agents[i].Stop()
}

func (f *failureKillOne) Recover(c *cluster, round int) error {
  i := round % c.Size
  // ask its agent to restart etcd
  if _, err := c.Agents[i].Restart(); err != nil {
    return err
  }
  // wait for recovery done
  return c.WaitHealth()
}

Then we add it into failure lists:

  t := &tester{
    failures: []failure{
      newFailureKillOne(),
    },
    cluster: c,
    limit:   *limit,
  }

Done.

As you see, the framework is simple but already fairly powerful. We are looking forward to having you join the etcd test party!

Future Plans

The framework is still under active development, and more failure cases and checks will be added.

Random network partitions, network delays and runtime reconfigurations are some classic failure cases that the framework does not yet cover. Another interesting idea we plan to explore is a cascading failure case that injects multiple failure cases at the same time.

On the recovery side, more checks against consistent views of the keyspace on all members is a good starting point for more exploration.

The internal testing cluster runs 24/7, and our etcd cluster works perfectly under the current failure set. The etcd team is making its best effort to guarantee etcd’s correctness, and hopes that we can provide users the most robust consensus store possible.

Follow-up plans for more specific and harsher tests are in our TODO list. This framework is good to imitate real-life scenarios, but it cannot have fine controls on lower-level system and hardware behaviors. Future testing approaches may use simulated networks and disks to tackle these failure simulations.

We will keep enhancing the testing strength and coverage by adding more failure cases and checks into the framework. Pull requests to the framework are welcomed!

Acknowledgement

We are running our testing cluster on GCE. Thanks to Google for providing the testing environment.