Testing Distributed Systems

List of resources on testing distributed systems curated by Andrey Satarin. If you are interested in my other stuff, check out public talks. For any questions or suggestions you can reach out to me on Twitter, Bluesky @asatarin.bsky.social or other platforms.

Table of Contents

Overview of Testing Approaches
Specific Approaches in Different Distributed Systems
- Google
- Amazon Web Services
- Netflix
- Microsoft
- Meta
- FoundationDB
- Cassandra
- ScyllaDB
- Dropbox
- Elastic (Elasticsearch)
- MongoDB
- Confluent (Kafka)
- CockroachLabs (CockroachDB)
- SingleStore
- Twitter
- LinkedIn
- Salesforce
- VoltDB
- PingCap (TiDB)
- Cloudera
- Wallaroo Labs
- YugabyteDB
- FaunaDB
- Shopify
- Hazelcast
- Basho (Riak)
- Etcd
- Red Planet Labs
- Atomix Copycat
- Druid.io
- TigerBeetle
- Convex
- RisingWave
- YDB
- Feldera
- Datadog
- Polar Signals
Single Node Systems
- Concurrency
  - JCStress
  - LinCheck
  - Other
- SQLite
- Sled
- Clickhouse
- MariaDB
Tools

# Overview of Testing Approaches

# Research Papers

# Bugs

What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems— study of actual bugs in different popular distributed systems (Hadoop MapReduce, HDFS, HBase, Cassandra, ZooKeeper and Flume)
TaxDC: A Taxonomy of Non-Deterministic Concurrency Bugs in Datacenter Distributed Systems— comprehensive taxonomy of bugs in distributed systems (Cassandra, Hadoop MapReduce, HBase, ZooKeeper)
An Empirical Study on Crash Recovery Bugs in Large-Scale Distributed Systems — based on bug database from “What Bugs Live in the Cloud?” paper researchers focus specifically on crash recovery bugs in Hadoop MapReduce, HBase, Cassandra, ZooKeeper. There is review of this paper by Murat Demirbas in his blog.
An empirical study on the correctness of formally verified distributed systems— study of bugs in formally verified distributed systems. Analysis includes Microsoft’s IronFleet distributed key-value store built from formal model.
What bugs cause cloud production incidents? — research focused on bugs (and their resolution strategies) that actually cause production incidents in large-scale distributed services at Microsoft Azure.

# Testing

Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems — Great overview of how even simple testing can help a lot, you just need the right focus
Early detection of configuration errors to reduce failure damage— why and how to test configuration files of your system
Why Is Random Testing Effective for Partition Tolerance Bugs? — just what it says in a title, authors try to explain why random testing (Jepsen) is effective and introduce notions of test coverage relating to network partition, see also “The Morning Paper” review or slide deck
FlyMC: Highly Scalable Testing of Complex Interleavings in Distributed Systems — novel approach of systematically exploring interleavings in distributed systems augmented with static analysis and prioritization. This approach is faster than previous techniques and found old and new bugs in several systems (Cassandra, Ethereum Blockchain, Hadoop, Kudu, Raft LogCabin, Spark, ZooKeeper).
Torturing Databases for Fun and Profit — checking ACID guarantees of open source and commercial databases under power loss, additional material
Understanding and Detecting Software Upgrade Failures in Distributed Systems — paper presents first study of upgrade failures in distributed systems (Cassandra, HBase, Kafka, Mesos, YARN, ZooKeeper, etc). Authors look at severity, symptoms, causes and triggers of these failures and summarize results in a number of findings. They propose two new tools to improve testing targeting upgrade failures specifically and apply those tools to a few systems with good results (new bugs and potential bugs found). I gave an overview talk of the paper in September 2022.

# Fault Tolerance

Redundancy does not imply fault tolerance: analysis of distributed storage reactions to single errors and corruptions— study of several distributed systems (Redis, ZooKeeper, MongoDB, Cassandra, Kafka, RethinkDB) on how fault-tolerant they are to data corruption and read/write errors
The Case for Limping-Hardware Tolerant Clouds— research on effect of limping hardware on performance of a distributed systems (aka limplock), see also great blog post by Dan Luu on a similar topic Distributed systems: when limping hardware is worse than dead hardware
Toward a Generic Fault Tolerance Technique for Partial Network Partitioning — overview of network partition failures in various distributed systems (MongoDB, HBase, HDFS, Kafka, RabbitMQ, Elasticsearch, Mesos, etc), common traits among them and strategies to mitigate those failures.
Understanding, Detecting and Localizing Partial Failures in Large System Software — what happens if your system loses some functionality due to failure as opposed to full fail-stop? Authors study how these partial failures manifest in distributed systems (ZooKeeper, Cassandra, HDFS, Mesos) and what triggers them. They propose runtime approach to detect those failure with mimic-style intrinsic watchdogs and show how these watchdogs could be generated automatically. They managed to reproduce 20 out of 22 real world partial failures and detect them using intrinsic watchdogs with great code localization and reaction time within a few seconds. See also overview talk of the paper.

# Resilience In Complex Adaptive Systems

These materials are not directly related to testing distributed systems, but they greatly contribute to general understanding of such systems.

# Jepsen

State-of-the-art approach to testing stateful distributed systems.

Jepsen Analyses — most recent Jepsen analyses of different distributed systems
Jepsen Talks — talks by Kyle Kingsbury at various conferences
Aphyr’s Jepsen posts — older Jepsen analyses on Kyle Kingsbury’s (Aphyr) personal site
Jepsen Talks on GitHub — Jepsen talks slides before 2015 on GitHub
Kyle Kingsbury on InfoQ
Call me maybe: Jepsen and flaky networks — talk on Jepsen, not by Kyle
Jepsen is used by Microsoft CosmosDB — founder of Azure CosmosDB confirms, that they are using Jepsen
Consistency Models — overview of various consistency models for distributed systems with transactional and non-transactional semantics. This page gives bird’s-eye view on guarantees distributed systems might provide with references to do a deep dive.
Maelstrom — A workbench for writing toy implementations of distributed systems. Provides tests and simple I/O protocol to test simple implementation of distributed systems written in any language. All testing happens on one node, network is fully simulated.

Elle transactional consistency checker for black-box databases:

Elle: Inferring Isolation Anomalies from Experimental Observations — paper on Elle design by Kyle Kingsbury and Peter Alvaro. You might also check out overview of the paper from Murat Demirbas or The Morning Paper blog
Elle source code
Black-box Isolation Checking with Elle — talk Kyle gave at CMU DB database seminar describing Elle and results obtained with it
Elle: Finding Isolation Violations in Real-World Databases — keynote by Kyle Kingsbury on Elle at PODC 2021
Elle: Opaque-box Serializability Verification — talk by Kyle Kingsbury and Peter Alvaro on Elle at VLDB 2021

Some notable Jepsen analyses:

Jepsen is used by CockroachDB, VoltDB, Cassandra, ScyllaDB, YDB, MariaDB and others.

# Formal Methods

The verification of a distributed system By Caitie McCaffrey also podcast and talk on InfoQ.com and accompanying materials on GitHub and a slidedeck
Comparisons of Alloy and Spin
Verdi — A framework for formally verifying distributed systems implementations in Coq
Network Semantics for Verifying Distributed Systems
Proving that Android’s, Java’s and Python’s sorting algorithm is broken (and showing how to fix it)— using formal verification to find a bug in TimSort sorting algorithm
Proving JDK’s Dual Pivot Quicksort Correct— analyzing quicksort implementation in Java
Formal Modeling and Analysis of Distributed Systems by Ankush Desai
Gain confidence in system correctness using formal and semi-formal methods by Ankush Desai, presented at BugBash

# TLA+

Designing Distributed Systems in TLA+ by Hillel Wayne, and talk Everything about distributed systems is terrible
Designing distributed systems with TLA+ by Hillel Wayne at Hydra Conference 2020
Distributed systems showdown — TLA + vs real code by Jack Vanlightly at Hydra Conference 2021. Jack compares two approaches to testing distributed systems — formal verification of the design with TLA+ and testing with Maelstrom / Jepsen, comparing pros and cons.
“Workshop: TLA+ in action” by Markus Kuppe in four parts 1, 2, 3, 4 at Hydra Conference 2021
TLA+ Conference is a forum to present case studies tools and techniques using TLA+

Companies using TLA+ to verify correctness of algorithms:

Amazon Web Services
PingCap for TiDB
Elastic
MongoDB
CockroachLabs
Microsoft for services in Azure cloud
Confluent for Apache Kafka

# Deterministic Simulation

Pioneered by FoundationDB, deterministic simulation approach to testing distributed systems gained more popularity in recent years.

“Simulation Testing” by Michael Nygard gives a good introduction into simulation testing
Designing Dope Distributed Systems for Outer Space with High-Fidelity Simulation— talk about using deterministic simulation to test a distributed space telescope. With recommendations on how to move file IO, network, scheduling out of your program to make it more amenable to simulation.
What’s the big deal about Deterministic Simulation Testing? — Phil Eaton gives an introduction to deterministic simulation testing, discussing basics and challenges.
What if we embraced simulation-driven development? by Pierre Zemb
Deterministic simulation testing - how it works and when to use it — overview of deterministic simulation testing by Antithesis.

More companies and systems adopt deterministic simulation as a primary testing strategy:

FoundationDB
TigerBeetle
Convex
RisingWave
Amazon Web Services uses SimWorld to test Elastic Block Storage control plane
Red Planet Labs
Sled
Polar Signals for FrostDB

Other collections on deterministic simulation testing:

Planet DST by Alex Miller
So, You Want to Learn More About Deterministic Simulation Testing? by Pierre Zemb
Awesome Deterministic Simulation Testing collection by Ivan Yurchenko

# Autonomous Testing

This approach is currently represented by Antithesis — pioneers in autonomous testing, defining the space and the state of the art. Will Wilson (of FoundationDB fame) is one of the founders.

Testing a Single-Node, Single Threaded, Distributed System Written in 1985 by Will Wilson. This is a comprehensive introduction into autonomous testing by using Super Mario Bros. (game) as a testing target. The autonomous testing platform plays the game and achieves remarkable results leveraging simple interface and a straightforward goal. Will does a great job of delivering the talk and it’s fascinating to watch. Copy of the talk video on Vimeo Why Antithesis Works.
Accompanying blog post to the talk above (or vice versa) Is something bugging you? with reasoning behind Antithesis and value proposition and history on FoundationDB. The post introduces Antithesis platform to deliver FoundationDB style deterministic testing with autonomous capabilities to everybody. See comprehensive discussion on Hacker News.
Accelerating developers at MongoDB — case study of using Antithesis as MongoDB
Testing the Ethereum merge — case study of using Antithesis for testing Ethereum
Autonomous Testing and the Future of Software Development — Will Wilson talks about why testing sucks and how to fix it with the new autonomous testing approach by making testing less human involved. This talk is a precursor to above talks and posts on autonomous testing.
Torturing Postgres: extreme autonomous testing for distributed architectures— how OrioleDB uses Antithesis to test the database
Chaos Testing Stardog Cluster for Fun and Profit

# Lineage-driven Fault Injection

Netflix adopted lineage-driven fault injection techniques for testing microservices.

# Chaos Engineering

Principles of Chaos Engineering
Free Chaos Engineering book by Netflix engineers
A curated list of awesome Chaos Engineering resources

Netflix pioneered chaos engineering discipline.

# Fuzzing

There are two flavors of fuzzing. First, randomized concurrency testing, where the ordering of messages is fuzzed:

And input fuzzing, where message contents or user inputs are fuzzed:

DNS parser, meet Go fuzzer
Fuzz Testing with afl-fuzz (American Fuzzy Loop)
Randomized testing for Go and talk on this tool GopherCon 2015: Dmitry Vyukov — Go Dynamic Tools
Simple guided fuzzing for libraries using LLVM`s new libFuzzer
LibFuzzer – a library for coverage-guided fuzz testing
How Heartbleed could’ve been found — example of how fuzzing could be used for finding famous HeartBleed vulnerability

# Microservices

Amazing and comprehensive overview of different strategies to test systems built with microservices by Cindy Sridharan.

Testing Microservices, the sane way

Series of blog posts specifically on testing in production — best practices, pitfalls, etc:

# Performance and Benchmarking

Your Load Generator Is Probably Lying To You
Everything You Know About Latency Is Wrong— great overview of Gil Tene`s “How NOT to Measure Latency” talk
“How NOT to Measure Latency” by Gil Tene
“Benchmarking: You’re Doing It Wrong” by Aysylu Greenberg
Performance Analysis Methodology — approaches developed by Brendan Gregg for analysing performance in systematic fashion

# Misc

Metamorphic Testing — overview of what metamorphic testing is and where it can help. For more details see paper “Metamorphic Testing: A Review of Challenges and Opportunities”.
Testing Distributed Systems for Linearizability — describes linearizability testing tool Porcupine, written in Go.

# Testing in a Distributed World

Great overview of techniques for testing distributed systems from practitioner, the video did age well and still an excellent overview of the landscape. Additional materials could be found in this GitHub repo

# Game Days

Sometimes Kill -9 Isn’t Enough

# Technologies for Testing Distributed Systems

Colin Scott shares his viewpoint from academia on testing distributed systems, specifically regression testing for correctness and performance bugs.

Technologies for Testing Distributed Systems, Part I
See also post Distributed Systems Testing: The Lost World by Crista Lopes

# Test Case Reduction

Minimizing Faulty Executions of Distributed Systems — reducing the size of buggy executions to make them easier to understand. 60 minute talk here
Troubleshooting Blackbox SDN Control Software with Minimal Causal Sequences — similar to above, but requires less instrumentation.
Concurrency Debugging with Differential Schedule Projections — find and minimize concurrency bugs using program analysis. Shared memory systems are equivalent to message passing systems, so you can apply the same techniques to distributed systems.

# Specific Approaches in Different Distributed Systems

# Google

Efficient Exploratory Testing of Concurrent Systems— They don’t mention it but looks like they describe testing of Google Omega
Exploratory Testing Architecture (ETA)
Paxos Made Live — An Engineering Perspective has a section on testing
10 Years of Crashing Google describes some war stories from Disaster Recovery Testing (DiRT) team at Google
Testing for Reliability chapter from Google Site Reliability Engineering book
Randomized Testing of Cloud Spanner — overview of randomized testing at Cloud Spanner, including how to scale it to large datasets and high concurrency
How chaos testing adds extra reliability to Spanner’s fault-tolerant design — high level overview of fault injection (chaos) testing in Google Spanner discussing various types of injected faults

# Amazon Web Services

The Evolution of Testing Methodology at AWS: From Status Quo to Formal Methods with TLA+
Use of Formal Methods at Amazon Web Services
CACM Article “How Amazon Web Services Uses Formal Methods”
Debugging Designs by Chris Newcombie there is also a source bundle
Millions of tiny databases — has a section on testing which describes several approaches: SimWorld simulation resembling the approach used at Foundation DB, use of Jepsen and formal methods and game days.
Using lightweight formal methods to validate a key-value storage node in Amazon S3 — paper on verifying correctness of a new key-value storage node implementation in S3. They are using property-based testing and stateless model checking extensively to balance trade-offs and follow pragmatic approach. I gave a talk “Formal Methods at Amazon S3” on this paper for a reading group.
Gain confidence in system correctness & resilience with formal methods by Ankush Desai
Fifteen years of formal methods at AWS by Marc Brooker
Proving the correctness of AWS authorization
Systems Correctness Practices at Amazon Web Services

See also formal methods and deterministic simulation sections.

# Netflix

Automated failure injection (see also Lineage-driven Fault Injection):

Random/manual failure injection testing:

Netflix Simian Army
Failure Injection Testing
From Chaos to Control — Testing the resiliency of Netflix’s Content Discovery Platform
Breaking Bad at Netflix: Building Failure as a Service
GTAC 2014: I Don’t Test Often … But When I Do, I Test in Production— Netflix different testing strategies

# Microsoft

Asynchronous programming, analysis and testing with state machines — Open source language for building distributed systems. Language is designed with tooling in mind, particularly, automatic exploration of message orderings in order to find bugs.
Uncovering Bugs in Distributed Storage Systems during Testing (not in Production!)
Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency describes “Pressure Point Testing” approach used for Azure Cloud Storage
Inside Azure Search: Chaos Engineering
TLA+ at Microsoft: 16 Years in Production by David Langworthy — how rejuvenation of TLA+ happened at Microsoft in 2016 and onwards
Formal Methods at Microsoft by Nikolaj Bjørner

See also formal methods section.

# Meta

BellJar: A new framework for testing system recoverability at scale — BellJar is a testing framework focused on answering question “What service dependencies are required for the service to recover after large scale disaster?”. BellJar puts service in a vacuum environment with only handful of direct dependencies allow-listed to verify that recovery procedures succeed under those constraints. It checks those recovery procedures in CI/CD pipeline preventing unconstrained growth of dependency graph and circular dependencies. Based on BellJar tests one can construct the entire dependency graph of the services allowing to boostrap them in the correct order from bottom to top.
Vacuum Testing for Resiliency: Verifying Disaster Recovery in Complex — talk on how BellJar is used at Meta to test recovery of distributed systems
Hermit: Deterministic Linux for Controlled Testing and Software Bug-finding — the first practical deterministic operating system built as an emulation layer on top of Linux kernel. It’s deterministic execution capability help with regression, stress testing and allow for systematic diagnostics
https://github.com/facebookexperimental/hermit — code for Hermit

# FoundationDB

“Testing Distributed Systems w/ Deterministic Simulation” by Will Wilson — talk on FoundationDB simulation testing. Their architecture was built from the ground up to support fully deterministic simulation testing
Simulation and Testing — public overview of FoundationDB simulation testing framework
FoundationDB or: How I Learned to Stop Worrying and Trust the Database by Markus Pilman from Snowflake — updated talk on testing FoundationDB with deterministic simulation. Markus goes into details of what it takes to build deterministic simulation into a database. He mentions that it took two years to build a simulation framework before FoundationDB team started working on a database.
“Buggify — Testing Distributed Systems with Deterministic Simulation” — Alex Miller, one of developers at FoundationDB, describes BUGGIFY macros, which helps bias simulation tests towards doing dangerous and bug finding things. This is a good example of cooperation between testing efforts and production code.
“FoundationDB: A Distributed Unbundled Transactional Key Value Store” — SIGMOD 2021 paper on FoundationDB has a very detailed section on simulation testing at FoundationDB with discussions on determinism, test oracles, fault injection and limitations.
“Unlucky Simulation” — talk on using various scheduling strategies (LibFuzzer, random, etc) with simulation testing in FoundationDB
FoundationDB Testing: Past & Present

# Cassandra

Testing Apache Cassandra with Jepsen
Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen
Jepsen Cassandra Testing on Git
Netflix A STATE OF XEN — CHAOS MONKEY & CASSANDRA from Cassandra Summit 2015
Testing Apache Cassandra with Jepsen: How to Understand and Produce Safe Distributed Systems by Joel Knighton presented at Devoxx UK 2016
Testing Apache Cassandra 4.0 — quick overview of approaches used to test next major version of Cassandra
Fallout — tool to run distributed tests as a service. It is meant to easily orchestrate cluster creation and testing tools like Jepsen, performance testing tools and others, though extension and combining them in various ways with environmental conditions. It could run tests either locally or on large scale clusters.
Cassandra Harry — Fuzz testing / property-based testing tool for Apache Cassandra. Aims to provide reproducible workloads to test correctness of Apache Cassandra.
Fuzz Testing and Verification of Apache Cassandra with “Harry” — talk on Harry fuzz testing tool by Alex Petrov at ApacheCon 2021
Harry, an Open Source Fuzz Testing and Verification Tool for Apache Cassandra by Alex Petrov — blog post about Harry fuzz testing tool for Apache Cassandra and how it helps to find bugs
Garden of Forking Paths — talk by Alex Petrov on property-based testing philosophy and core ideas based on his experience building Cassandra Harry.

# ScyllaDB

They published series of blog posts on testing ScyllaDB:

Scylla testing part 1: Cassandra compatibility testing
Scylla testing part 2: Extending Jepsen for testing Scylla
CharybdeFS: a new fault-injecting filesystem for software testing
Testing part 4: Distributed tests
Testing part 5: Longevity testing
Fault-injecting filesystem cookbook Video from Scylla Summit 2017 on testing
How We Constantly Try to Bring Scylla to its Knees and slides — overview of different testing types at ScyllaDB
Project Gemini: An Open Source Automated Random Testing Suite for Scylla and Cassandra Clusters — random test generator comparing results from cluster with injected faults against single node running without faults. Works on tops of CQL API and suitable for testing any database implementing it. See also talk on Project Gemini and open source code
ScyllaDB NoSQL Database Testing — highlights of testing approaches at Scylla

# Dropbox

Mysteries of Dropbox Property-Based Testing of a Distributed Synchronization Service— example of how to use QuickCheck to test synchronization in Dropbox and similar tools (Google Drive). John Hughes gave a talk on this. See also QuickCheck.
Data Checking at Dropbox — If you have lots of data, you have to verify that it did not suffer from bit rot and protect it against rare bugs (e.g. race conditions) to guarantee long term durability. This talks explains intricacies of building data consistency checker(s) at Dropbox scale.
Dropbox’s Exabyte Storage System (aka Magic Pocket) talk by James Cowling — describes number of strategies to achieve extremely high durability. This includes:
- guard against faulty disks,
- guard against software defects,
- guard against black swan events,
- operational safeguards to reduce blast radius,
- safeguards against deletes with multi stage soft-delete,
- comprehensive testing strategy in-depth with increased scale,
- redundancy across various axis in software and hardware stacks,
- continuous data integrity validation on many levels,
- etc
Testing sync at Dropbox — comprehensive overview of two test frameworks at Dropbox for new sync engine implementation. CanopyCheck — single threaded and fully deterministic randomized testing framework with minimization for synchronization planner component of the engine. The other framework Trinity focuses on concurrency and larger surface area of components. Great discussion on tradeoffs between determinism, strength of test oracles vs width of coverage and size of the system under test.

# Elastic (Elasticsearch)

Growing a protocol — applying lineage driven fault injection to test Elasticsearch replication protocol
Using TLA+ for fun and profit in the development of Elasticsearch by Yannick Welsch — Elasticsearch uses TLA+ to verify correctness of their replication protocol

See also formal methods section.

# MongoDB

MongoDB’s JavaScript Fuzzer: Creating Chaos (1/2)
MongoDB’s JavaScript Fuzzer: Harnessing the Havoc (2/2)
MongoDB’s JavaScript Fuzzer article in ACM Queue
Fixing a MongoDB Replication Protocol Bug with TLA+ by William Schultz — how MongoDB uses formal verification with TLA+ to check correctness of their replication protocol. Describes how replication bugs could have been found with help of formal model.
eXtreme Modelling in Practice — two attempts at MongoDB to check that code conforms to its formal model. Accompanying video eXtreme Modelling in Practice
Formal Verification of a Distributed Dynamic Reconfiguration Protocol — talk on formally verifying MongoDB Raft-based replication reconfiguration protocol with TLAPS. Paper preprint.
Change Point Detection in Software Performance Testing — paper on how MongoDB team automatically detects performance degradations in the presence of noise in continuous integration runs. The paper was presented at ICPE 2020
Conformance Checking at MongoDB: Testing That Our Code Matches Our TLA+ Specs
Design and Modular Verification of Distributed Transactions in MongoDB

See also formal methods section.

# Confluent (Kafka)

Kafka Fault Injection framework
TLA+ specification of the Kafka replication protocol and talk about using TLA+ for hardening Kafka replication protocol

See also formal methods section.

# CockroachLabs (CockroachDB)

DIY Jepsen Testing CockroachDB— great read about using Jepsen at Cockroach Labs
CockroachDB Beta Passes Jepsen Testing— CockroachDB tested by Kyle Kingsbury (Jepsen.io)
Introducing Pebble: A RocksDB Inspired Key-Value Store Written in Go — introduces new storage engine and includes thorough discussion on what it takes to properly test storage engine
ParallelCommits.tla — Formal specification in TLA+ of the parallel commit transaction protocol. See also formal methods.
The importance of being earnestly random: Metamorphic Testing in CockroachDB — blog post talking about metamorphic testing at CockroachLabs to test Pebble storage engine

# SingleStore

Formerly known as MemSQL.

Running SingleStore’s 107 Node Test Infrastructure on CoreOS. See also accompanying talk.
Practical Techniques to Achieve Quality in Large Software Projects
How to Make a Believable Benchmark
Building an Infinitely Scalable Testing System — description of internal test system PsyDuck

# Twitter

# LinkedIn

Simoorg Failure inducer framework— Failure inducer implemented in Python
A Deep Dive into Simoorg
Dynamometer: Scale Testing HDFS on Minimal Hardware with Maximum Fidelity — testing scalability of large Hadoop clusters (namely NameNode) with just fraction of nodes

# Salesforce

Go Fast and Don’t Break Things: Ensuring Quality in the Cloud

# VoltDB

Series of post on testing at VoltDB:

How We Test at VoltDB
Testing at VoltDB: SQLCoverage— describes how they test SQL query functionality using 5 millions queries generated from templates and comparing results against HSQLDB
Testing VoltDB Against PostgreSQL
VoltDB 6.4 Passes Official Jepsen Testing— VoltDB hired Kyle Kingsbury (Jepsen) to tests their database, they share results in this post

Additional resources:

“All In With Determinism for Performance and Testing in Distributed Systems” by John Hugg and a slide deck Hugg-DeterministicDistributedSystems.pdf

# PingCap (TiDB)

Use Chaos to test the distributed system linearizability— describes Jepsen-like framework implemented in Go and used at PingCap to test TiDB
A test framework for linearizability check with Go — Chaos is a Jepsen-like framework written in Go, uses Porcupine linearizability checker
Chaos Tools and Techniques for Testing the TiDB Distributed NewSQL Database and the same post on company blog
Official Jepsen report on TiDB 2.1.7 and companion blog post in company blog
Safety First! Common Safety Pitfalls in Distributed Databases Found by Jepsen Tests — overview of Jepsen approach and tests with quick refresher on results for different databases to date
https://github.com/pingcap/tla-plus — formal specification in TLA+ of Raft consensus protocol and implementation of distributed transactions in TiDB
Testing Cloud-Native Databases with Chaos Mesh — talk on Chaos Mesh and how it is used for testing TiDB at PingCap. Blog post with introduction to Chaos Mesh and how it integrates with Kubernetes. See also Chaos Mesh source code and chaos engineering section.

# Cloudera

Quality Assurance at Cloudera: Fault Injection and Elastic Partitioning— Cloudera describes their approach to fault injection testing
Quality Assurance at Cloudera: Highly-Controlled Disk Injection

# Wallaroo Labs

Measuring Correctness of State in a Distributed System — describes general idea and implementation how to test safety of distributed stream processing system
Performance testing a low-latency stream processing system — high level overview of what to look at when testing performance of stream processing system
How We Test the Stateful Autoscaling of Our Stream Processing System — advanced safety tests for autoscaling stateful stream processing
All posts on testing from Walaroo engineering blog

There is also talk from Sean T. Allen on testing stream processing system at Wallaroo Labs (ex. Sendence)

# YugabyteDB

Jepsen Testing on YugabyteDB — YugabyteDB describes how they use Jepsen
YugabyteDB 1.1.9 analysis by Kyle Kingsbury — Kyle explores safety of YugabyteDB. Accompanying post in company blog “YugabyteDB 1.2 Passes Jepsen Testing” and “Wrapping Up: Jepsen Test Results for YugabyteDB 1.2 Webinar” post with webinar recording by Kyle and Karthik Ranganathan (Yugabyte CTO).
YugabyteDB 1.3.1 — Jepsen analysis of YugabyteDB support for serializable SQL transactions. Companion blog post on the company website.

# FaunaDB

Verifying Transactional Consistency with Jepsen — results of internal Jepsen testing at FaunaDB
Jepsen: FaunaDB 2.5.4 — official Jepsen test for FaunaDB, write-up in Fauna blog

# Shopify

# Hazelcast

Testing the CP Subsystem with Jepsen — overview of how Jepsen is used to test Hazelcast in-memory data grid CP subsystem

# Basho (Riak)

Testing Eventual Consistency in Riak — how to model eventually consistent database in QuickCheck and find bugs in it`s implementation, video available on YouTube
Modeling Eventual Consistency Databases with QuickCheck— another talk on testing Riak eventual consistency guarantees with QuickCheck

# Etcd

Testing distributed systems in Go — overview of failure injection testing for etcd. Or alternative url for the same post.
On the Hunt for Etcd Data Inconsistencies — talk on how Etcd reimplemented Jepsen in Go using their existing test framework as a cluster runner and Porcupine as a linearizability checker

# Red Planet Labs

Where we’re going, we don’t need threads: Simulating Distributed Systems — following FoundationDB steps, Red Planet Labs uses deterministic simulation for testing. Their formula for success is “deterministic simulation = no parallelism + quantized execution + deterministic behavior”.

See also deterministic simulation section.

# Atomix Copycat

A novel implementation of the Raft consensus algorithm
Jepsen tests for Atomix Copycat — Using Jepsenat Atomix

# Druid.io

Architecting Distributed Databases for Failure

# TigerBeetle

Simulation Tests in TigerBeetle — TigerBeetle is a distributed financial accounting database built in Zig programming language and uses simulation tests inspired by Dropbox and FoundationDB.
TigerStyle! (Or How To Design Safer Systems in Less Time) by Joran Dirk Greef — talk about TigerStyle, philosophical approach to design and build distributed systems at TigerBeetle. This style greatly contributed to improving developer productivity, reliability and correctness in TigerBeetle.
A Descent Into the Vᴏ̈ʀᴛᴇx – non-deterministic testing framework to verify real production binaries and supplement any missed coverage from deterministic simulation tests. Feels similar to Jepsen.

See also deterministic simulation section.

# Convex

Convex: Life Without a Backend Team by James Cowling — talks about architecture and features of Convex. At the end of the talk James covers testing at Convex. They use approach inspired by QuickCheck and FoundationDB to test end-to-end guarantees with randomized initial state, workload, injected failures and thread interleaving. These tests validate correctness in production similar to Dropbox Magic Pocket system on which James worked previously.
Better Testing With Less Code Using Randomization — blog post describing approach Convex uses to develop randomized tests

# RisingWave

In a series of two blog posts, RisingWave team talks about their experience using deterministic simulation for testing distributed SQL-based stream processing platform:

Deterministic Simulation: A New Era of Distributed System Testing
Applying Deterministic Simulation: The RisingWave Story They talk about a few kinds of tests they built with the simulator (unit, end-to-end, recovery, scaling), pros, cons and challenges of this approach.
How Randomized SQL Testing Can Help Detect Bugs?

As a result of this work, they open sourced MadSim — Magical Deterministic Simulator for the Rust language ecosystem.

See also deterministic simulation section.

# YDB

Hardening YDB with Jepsen: Lessons Learned — how Jepsen tests for YDB helped find consistency and other bugs in YDB
jepsen.ydb — code of Jepsen tests for YDB

# Feldera

Correctness at Feldera — overview of correctness approaches used at Feldera, a strongly consistent incremental compute engine, includes:
- machine checked proof of the foundational DBSP algorithm using Lean theorem prover
- differential (shadow) testing of the implementation
- large corpus of tests reused from other SQL systems (MySQL, Postgres, Data Fusion, SQL Logic Tests, etc)
- metamorphic tests with SQLancer
- manually written automatic tests
- fault tolerance, model-based and fuzzing tests for the control plane
Formalization of DBSP — GitHub repository with machine checked proof of the DBSP algorithm using Lean theorem prover

# Datadog

How we use formal modeling, lightweight simulations, and chaos testing to design reliable distributed systems

# Polar Signals

(Mostly) Deterministic Simulation Testing in Go

# Single Node Systems

These examples are not about distributed systems, but they demonstrate testing concurrency and level of sophistication required in distributed systems.

# Concurrency

Testing concurrent code is one of the challenges in single node as well as distributed systems. These tools help to test both lock based and lock-free concurrent code on various platforms.

# JCStress

JCStress — test harness to verify correctness of concurrency support in the JVM, class libraries, and hardware.
Workshop: Java Concurrency Stress (JCStress). Part 1 and Part 2 by Aleksey Shipilëv
JCStress samples showcasing what could be verified with the harness
Java Concurrency Stress Tests presentations by Aleksey Shipilëv on JCStress

# LinCheck

LinCheck — framework for testing concurrent data structures on JVM
How We Test Concurrent Primitives in Kotlin Coroutines from JetBrains blog
Lin-Check: Testing concurrent data structures in Java talk by Nikita Koval
Workshop. Lincheck: Testing concurrency on the JVM (Part 1 and Part 2 by Maria Sokolova

# Other

ThreadSanitizer — data race detection tool for C++
ThreadSanitizer is used under the hood of the Go language race detector

# SQLite

SQLite is not a distributed system by any stretch of the imagination, but provides good example of comprehensive testing of a database implementation.

Finding bugs in SQLite, the easy way— how fuzzing used in testing SQLite database
How SQLite Is Tested

# Sled

Sled simulation guide (jepsen-proof engineering) — guide on simulation testing ( see FoundationDB) in Sled database
Reliable Systems Series: Model-Based Testing

See also deterministic simulation section.

# Clickhouse

Fuzzing ClickHouse — high level overview of query fuzzing at Clickhouse
Fuzzing Databases is Difficult — discusses the design of BuzzHouse, a new database fuzzer to test ClickHouse
BuzzHouse: Bridging the database fuzzing gap for testing ClickHouse
ClickHouse Testing — documentation on various tests for ClickHouse database and how to contribute more tests

# MariaDB

Isolation level violation testing and debugging in MariaDB

# Tools

# Network Simulation

# QuickCheck

PolyConf 14: Testing the Hard Stuff and Staying Sane / John Hughes
The Joy of Testing
John Hughes on InfoQ
Hansei: Property-based Development of Concurrent Systems
QuickChecking Poolboy for Fun and Profit— from Basho
Combining Fault-Injection with Property-Based Testing
Testing Telecoms Software with Quviq QuickCheck
Fuzz testing distributed systems with QuickCheck — using QuickCheck to test Raft protocol implementation in Haskell

# Overview of Testing Approaches

# Research Papers

# Bugs

# Testing

# Fault Tolerance

# Resilience In Complex Adaptive Systems

# Jepsen

# Formal Methods

# TLA+

# Deterministic Simulation

# Autonomous Testing

# Lineage-driven Fault Injection

# Chaos Engineering

# Fuzzing

# Microservices

# Performance and Benchmarking

# Misc

# Testing in a Distributed World

# Game Days

# Technologies for Testing Distributed Systems

# Test Case Reduction

# Specific Approaches in Different Distributed Systems

# Google

# Amazon Web Services

# Netflix

# Microsoft

# Meta

# FoundationDB

# Cassandra

# ScyllaDB

# Dropbox

# Elastic (Elasticsearch)

# MongoDB

# Confluent (Kafka)

# CockroachLabs (CockroachDB)

# SingleStore

# Twitter

# LinkedIn

# Salesforce

# VoltDB

# PingCap (TiDB)

# Cloudera

# Wallaroo Labs

# YugabyteDB

# FaunaDB

# Shopify

# Hazelcast

# Basho (Riak)

# Etcd

# Red Planet Labs

# Atomix Copycat

# Druid.io

# TigerBeetle

# Convex

# RisingWave

# YDB

# Feldera

# Datadog

# Polar Signals

# Single Node Systems

# Concurrency

# JCStress

# LinCheck

# Other

# SQLite

# Sled

# Clickhouse

# MariaDB

# Tools

# Network Simulation

# QuickCheck

# Benchmarking

# Linkbench

# YCSB