Skip to the content.

List of resources on testing distributed systems curated by Andrey Satarin (@asatarin). If you are interested in my other stuff, check out talks page. For any questions or suggestions you can reach out to me on Twitter (@asatarin), Mastodon (https://discuss.systems/@asatarin) or LinkedIn.

Table of Contents

# Overview of Testing Approaches

# Research Papers

# Bugs

# Testing

# Fault Tolerance

# Resilience In Complex Adaptive Systems

These materials are not directly related to testing distributed systems, but they greatly contribute to general understanding of such systems.

# Jepsen

State-of-the-art approach to testing stateful distributed systems.

Elle transactional consistency checker for black-box databases:

Some notable Jepsen analyses:

Jepsen is used by CockroachDB, VoltDB, Cassandra, ScyllaDB, YDB and others.

# Formal Methods

# TLA+

Companies using TLA+ to verify correctness of algorithms:

# Deterministic Simulation

Pioneered by FoundationDB, deterministic simulation approach to testing distributed systems gained more popularity in recent years.

More companies and systems adopt deterministic simulation as a primary testing strategy:

See also autonomous testing, FoundationDB.

# Autonomous Testing

This approach is currently represented by Antithesis — pioneers in autonomous testing, defining the space and the state of the art. Will Wilson (of FoundationDB fame) is one of the founders.

See also deterministic simulation, FoundationDB and fuzzing.

# Lineage-driven Fault Injection

Netflix adopted lineage-driven fault injection techniques for testing microservices.

# Chaos Engineering

Netflix pioneered chaos engineering discipline.

# Fuzzing

There are two flavors of fuzzing. First, randomized concurrency testing, where the ordering of messages is fuzzed:

And input fuzzing, where message contents or user inputs are fuzzed:

See also autonomous testing.

# Microservices

Amazing and comprehensive overview of different strategies to test systems built with microservices by Cindy Sridharan.

Series of blog posts specifically on testing in production — best practices, pitfalls, etc:

# Performance and Benchmarking

See also benchmarking tools.

# Misc

# Testing in a Distributed World

Great overview of techniques for testing distributed systems from practitioner, the video did age well and still an excellent overview of the landscape. Additional materials could be found in this GitHub repo

# Game Days

# Technologies for Testing Distributed Systems

Colin Scott shares his viewpoint from academia on testing distributed systems, specifically regression testing for correctness and performance bugs.

# Test Case Reduction

# Specific Approaches in Different Distributed Systems

# Google

# Amazon Web Services

See also formal methods and deterministic simulation sections.

# Netflix

Automated failure injection (see also Lineage-driven Fault Injection):

Random/manual failure injection testing:

See also chaos engineering and lineage-driven fault injection.

# Microsoft

See also formal methods section.

# Meta

# FoundationDB

See also deterministic simulation and autonomous testing.

# Cassandra

# ScyllaDB

They published series of blog posts on testing ScyllaDB:

# Dropbox

# Elastic (Elasticsearch)

See also formal methods section.

# MongoDB

See also formal methods section.

# Confluent (Kafka)

See also formal methods section.

# CockroachLabs (CockroachDB)

# SingleStore

Formerly known as MemSQL.

# Twitter

# LinkedIn

# Salesforce

# VoltDB

Series of post on testing at VoltDB:

Additional resources:

# PingCap (TiDB)

See also formal methods section.

# Cloudera

# Wallaroo Labs

There is also talk from Sean T. Allen on testing stream processing system at Wallaroo Labs (ex. Sendence)

# YugabyteDB

# FaunaDB

# Shopify

# Hazelcast

# Basho (Riak)

# Etcd

# Red Planet Labs

See also deterministic simulation section.

# Atomix Copycat

# Druid.io

# TigerBeetle

See also deterministic simulation section.

# Convex

See also QuickCheck, FoundationDB, Dropbox, Jepsen, deterministic simulation.

# RisingWave

In a series of two blog posts, RisingWave team talks about their experience using deterministic simulation for testing distributed SQL-based stream processing platform:

As a result of this work, they open sourced MadSim — Magical Deterministic Simulator for the Rust language ecosystem.

See also deterministic simulation section.

# YDB

See also Jepsen.

# Single Node Systems

These examples are not about distributed systems, but they demonstrate testing concurrency and level of sophistication required in distributed systems.

# Concurrency

Testing concurrent code is one of the challenges in single node as well as distributed systems. These tools help to test both lock based and lock-free concurrent code on various platforms.

# JCStress

# LinCheck

# Other

# SQLite

SQLite is not a distributed system by any stretch of the imagination, but provides good example of comprehensive testing of a database implementation.

# Sled

See also deterministic simulation section.

# Clickhouse

# Tools

# Network Simulation

# QuickCheck

# Benchmarking

# Linkbench

# YCSB