List of resources on testing distributed systems curated by Andrey Satarin (@asatarin).
Contents
- Overview of testing approaches
-
Specific approaches in different distributed systems
- Amazon Web Services
- Netflix
- Datastax (Cassandra)
- ScyllaDB
- VoltDB
- MemSQL
- CockroachLabs (CockroachDB)
- PingCap (TiDB)
- MongoDB
- Cloudera
- FoundationDB
- Sendence
- Microsoft
- Dropbox
- Atomix Copycat
- Onyx
- Druid.io
- Salesforce
- SQLite
- InfluxDB
- Shopify
- Confluent (Kafka)
- Elastic (Elastic Search)
- Tools
Overview of testing approaches
Research Papers
- Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems — Great overview of how even simple testing can help a lot, you just need right focus
- What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems — study of actual bugs in different popular distributed systems (Hadoop MapReduce, HDFS, HBase, Cassandra, ZooKeeper and Flume)
- TaxDC: A Taxonomy of Non-Deterministic Concurrency Bugs in Datacenter Distributed Systems — comprehensive taxonomy of bugs in distributed systems (Cassandra, Hadoop MapReduce, HBase, ZooKeeper)
- Redundancy does not imply fault tolerance: analysis of distributed storage reactions to single errors and corruptions — study of several distributed systems (Redis, ZooKeeper, MongoDB, Cassandra, Kafka, RethinkDB) on how fault tolerant they are to data corruption and read/write errors
- An empirical study on the correctness of formally verified distributed systems — study of bugs in formally verified distributed systems
- The Case for Limping-Hardware Tolerant Clouds — research on effect of limping hardware on performance of a distributed systems (aka limplock), see also great blog post by Dan Luu on a similiar topic Distributed systems: when limping hardware is worse than dead hardware
- Early detection of configuration errors to reduce failure damage — why and how to test configuration files of your system
Technologies for Testing Distributed Systems by Colin Skott
Colin Skott shares his viewpoint from academia on testing distributed systems.
- Technologies for Testing Distributed Systems, Part I
- See also post Distributed Systems Testing: The Lost World by Crista Lopes
Testing in a Distributed World by Ines Sombra (RICON 2014)
Great overview of techniques for testing distributed systems. Unfortunately video of this talk is lost. Additional materials could be found in this Github repo
Resilience In Complex Adaptive Systems
These materials are not directly related to testing distributed systems, but they greatly contribute to general understanding of such systems.
- Velocity NY 2013: Richard Cook, “Resilience In Complex Adaptive Systems”
- Velocity 2012: Richard Cook, “How Complex Systems Fail”
- How Complex Systems Fail
Jepsen
State of the art approach to testing stateful distributed systems.
- Jepsen Analyses — most recent Jepsen analyses of different distributed systems
- Jepsen Talks — talks by Kyle Kingsbury on various conferences
- Aphyr’s Jepsen posts — older Jepsen analyses on Kyle Kingsbury’s (Aphyr) personal site
- Jepsen Talks on Github — Jepsen talks slides before 2015 on Github
- Kyle Kingsbury on InfoQ
- Call me maybe: Jepsen and flaky networks — talk on Jepsen, not by Kyle
- Jepsen is used by Microsoft CosmosDB — founder of Azure CosmosDB confirms, that they are using Jepsen
Some notable Jepsen analyses:
- Jepsen: CockroachDB beta-20160829
- Jepsen: VoltDB 6.3
- Jepsen: RethinkDB 2.2.3 reconfiguration
- Jepsen: RethinkDB 2.1.5
Jepsen is used by CockroachDB, VoltDB, Cassandra, ScyllaDB and others.
Formal Methods
- Comparisons of Alloy and Spin
- Verdi: Formally Verifying Distributed Systems
- Verdi — A framework for formally verifying distributed systems implementations in Coq
- Network Semantics for Verifying Distributed Systems
- Proving that Android’s, Java’s and Python’s sorting algorithm is broken (and showing how to fix it) — using formal verification to find a bug in TimSort sorting algorithm
- Proving JDK’s Dual Pivot Quicksort Correct — analizying quicksort implementation in Java
- The verification of a distributed system By Caitie McCaffrey also podcast and talk on InfoQ.com and accompanying materials on GitHub and a slidedeck
See also section on Amazon Web Services.
Lineage-driven Fault Injection
Netflix adopted lineage-driven fault injection techniques for testing microservices.
Chaos Engineering
- Principles of Chaos Engineering
- Free Chaos Engineering book by Netflix engineers
- A curated list of awesome Chaos Engineering resources
Netflix pioneered chaos engineering discipline.
Fuzzing
- Fuzzing Raft for Fun and Publication
- DNS parser, meet Go fuzzer
- Fuzz Testing with afl-fuzz (American Fuzzy Loop)
- Randomized testing for Go and talk on this tool GopherCon 2015: Dmitry Vyukov — Go Dynamic Tools
- Simple guided fuzzing for libraries using LLVM’s new libFuzzer
- LibFuzzer – a library for coverage-guided fuzz testing
- How Heartbleed could’ve been found — example of how fuzzing could be used for finding famous HeartBleed vulnerability
- Combining AFL and QuickCheck for Directed Fuzzing by Dan Luu
Game Days
Performance and Benchmarking
- Your Load Generator Is Probably Lying To You
- Everything You Know About Latency Is Wrong — great overview of Gil Tene`s “How NOT to Measure Latency” talk
- “How NOT to Measure Latency” by Gil Tene
- “Benchmarking: You’re Doing It Wrong” by Aysylu Greenberg
See also benchmarking tools.
Misc
Specific approaches in different distributed systems
Amazon Web Services
- The Evolution of Testing Methodology at AWS: From Status Quo to Formal Methods with TLA+
- Use of Formal Methods at Amazon Web Services
- CACM Article “How Amazon Web Services Uses Formal Methods”
- Experience of software engineers using TLA+, PlusCal and TLC
- Debugging Designs by Chris Newcombie there is also a source bundle
Netflix
Automated failure injection (see also Lineage-driven Fault Injection):
- Monkeys in Lab Coats: Applying Failure Testing Research @Netflix
- “Monkeys in Labs Coats”: Applied Failure Testing Research at Netflix
- Automated Failure Testing
- Automating Failure Testing Research at Internet Scale by P. Alvaro et.el
Random/manual failure injection testing:
- Netflix Simian Army
- Failure Injection Testing
- From Chaos to Control — Testing the resiliency of Netflix’s Content Discovery Platform
- Breaking Bad at Netflix: Building Failure as a Service
- GTAC 2014: I Don’t Test Often … But When I Do, I Test in Production — Netflix different testing strategies
See also Chaos Engineering.
Datastax (Cassandra)
- Testing Apache Cassandra with Jepsen
- Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen
- Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen
- Jepsen Cassandra Testing on Git
- Netflix A STATE OF XEN — CHAOS MONKEY & CASSANDRA from Cassandra Summit 2015
- Testing Apache Cassandra with Jepsen: How to Understand and Produce Safe Distributed Systems by Joel Knighton presented at Devoxx UK 2016
ScyllaDB
They published series of blog posts on testing ScyllaDB:
- Scylla testing part 1: Cassandra compatibility testing
- Scylla testing part 2: Extending Jepsen for testing Scylla
- CharybdeFS: a new fault-injecting filesystem for software testing
- Testing part 4: Distributed tests
- Testing part 5: Longevity testing
- Fault-injecting filesystem cookbook Video from Scylla Summit 2017 on testing
- How We Constantly Try to Bring Scylla to its Knees and slides — overview of different testing types at ScyllaDB
VoltDB
Series of post on testing at VoltDB:
- How We Test at VoltDB
- Testing at VoltDB: SQLCoverage — describes how they test SQL query functionality using 5 millions queries generated from templates and comparing results against HSQLDB
- Testing VoltDB Against PostgreSQL
- VoltDB 6.4 Passes Official Jepsen Testing — VoltDB hired Kyle Kingsbury (Jepsen) to tests their database, they share results in this post
Additional resources:
- “All In With Determinism for Performance and Testing in Distributed Systems” by John Hugg and a slide deck Hugg-DeterministicDistributedSystems.pdf
- SelfCheck workload
- TPC-C implementation
MemSQL
- Running MemSQL’s 107 Node Test Infrastructure on CoreOS
- Practical Techniques to Achieve Quality in Large Software Projects
- How to Make a Believable Benchmark
- Building an Infinitely Scalable Testing System — description of internal test system PsyDuck
CockroachLabs (CockroachDB)
- DIY Jepsen Testing CockroachDB — great read about using Jepsen at Cockroach Labs
- CockroachDB Beta Passes Jepsen Testing — CockroachDB tested by Kyle Kingsbury (Jepsen.io)
PingCap (TiDB)
- Use Chaos to test the distributed system linearizability — describes Jepsen-like framework implemented in Go and used at PingCap to test TiDB
- A test framework for linearizability check with Go — Chaos is a Jepsen-like framework written in Go
- Testing Distributed Systems for Linearizability — linearizability testing library used by Chaos framework
- Chaos Tools and Techniques for Testing the TiDB Distributed NewSQL Database and the same post on company blog
MongoDB
- MongoDB’s JavaScript Fuzzer: Creating Chaos (1/2)
- MongoDB’s JavaScript Fuzzer: Harnessing the Havoc (2/2)
Cloudera
- Quality Assurance at Cloudera: Fault Injection and Elastic Partitioning — Cloudera describes their approach to fault injection testing
- Quality Assurance at Cloudera: Highly-Controlled Disk Injection
FoundationDB
Sendence
There is one talk from Sean T. Allen on testing stream processing system at Sendence
- Materials on Sean’s blog “CodeMeshIO: How Did I Get Here?”
- Video from QCon NY 2016 on InfoQ
- Video from CodeMeshIO on YouTube
- Presentation on Speakerdeck
- Efficient Exploratory Testing of Concurrent Systems — They don’t mention it but looks like they describe testing of Google Omega
- Exploratory Testing Architecture (ETA)
- Paxos Made Live — An Engineering Perspective has a section on testing
- 10 Years of Crashing Google describes some war stories from Disaster Recovery Testing (DiRT) team at Google
- Testing for Reliability chapter from Google Site Reliability Engineering book
Microsoft
- Uncovering Bugs in Distributed Storage Systems during Testing (not in Production!)
- Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency describes “Pressure Point Testing” approach used for Azure Cloud Storage
- Inside Azure Search: Chaos Engineering
Dropbox
- Mysteries of Dropbox Property-Based Testing of a Distributed Synchronization Service — example of how to use QuickCheck to test synchronisation in Dropbox and similar tools (Google Drive)
Atomix Copycat
Onyx
- Simoorg Failure inducer framework — Failure inducer implemented in Python
- A Deep Dive into Simoorg
- Dynamometer: Scale Testing HDFS on Minimal Hardware with Maximum Fidelity — testing scalability of large Hadoop clusters (namely NameNode) with just fraction of nodes
Druid.io
Salesforce
SQLite
SQLite is not a distributed system by any stretch of the imagination, but provides good example of comprehensive testing of database implementation.
- Finding bugs in SQLite, the easy way — how fuzzing used in testing SQLite database
- How SQLite Is Tested
InfluxDB
Shopify
- Resiliency Testing with Toxiproxy
- Toxiproxy — A TCP proxy to simulate network and system conditions for chaos and resiliency testing
Confluent (Kafka)
Elastic (Elastic Search)
- Growing a protocol — applying lineage driven fault injection to test Elastic Search replication protocol
Tools
Network Simulation
- Comcast - Simulating shitty network connections so you can build better systems
- Muxy Simulating real-world distributed system failures
- Namazu — Programmable fuzzy scheduler for testing distributed systems
- Toxiproxy — A TCP proxy to simulate network and system conditions for chaos and resiliency testing
- Traffic Control
- Python API for Linux Traffic Control
- Slow tool
- Blockade is a utility for testing network failures and partitions in distributed applications
QuickCheck
- PolyConf 14: Testing the Hard Stuff and Staying Sane / John Hughes
- The Joy of Testing
- John Hughes on InfoQ
- Hansei: Property-based Development of Concurrent Systems
- QuickChecking Poolboy for Fun and Profit — from Basho
- Combining Fault-Injection with Property-Based Testing
- Testing Telecoms Software with Quviq QuickCheck
- Fuzz testing distributed systems with QuickCheck — using QuickCheck to test Raft protocol implementation in Haskell
- Modeling Eventual Consistency Databases with QuickCheck — testing Riak eventual consistency guarantees with QuickCheck
Benchmarking
- OLTP-Bench: An Extensible Testbed for Benchmarking Relational Databases
- OLTP Benchmark Wiki
- OLTP Benchmark on Github
- Py-TPCC
- Netflix Data Benchmark: Benchmarking Cloud Data Stores