This is a talk on “Understanding and Detecting Software Upgrade Failures in Distributed Systems” paper for distributed systems reading group lead by Aleksey Charapko.
“Understanding and Detecting Software Upgrade Failures in Distributed Systems” by Yongle Zhang, Junwen Yang, Zhuqi Jin, Utsav Sethi, Kirk Rodrigues, Shan Lu, Ding Yuan. Presented at SOSP 2021.
# Paper Abstract
Upgrade is one of the most disruptive yet unavoidable maintenance tasks that undermine the availability of distributed systems. Any failure during an upgrade is catastrophic, as it further extends the service disruption caused by the upgrade. The increasing adoption of continuous deployment further increases the frequency and burden of the upgrade task. In practice, upgrade failures have caused many of today’s high-profile cloud outages. Unfortunately, there has been little understanding of their characteristics.
This paper presents an in-depth study of 123 real-world upgrade failures that were previously reported by users in 8 widely used distributed systems, shedding lights on the severity, root causes, exposing conditions, and fix strategies of upgrade failures. Guided by our study, we have designed a testing framework DUPTester that revealed 20 previously unknown upgrade failures in 4 distributed systems, and applied a series of static checkers DUPChecker that discovered over 800 cross-version data-format incompatibilities that can lead to upgrade failures. DUPChecker has been requested by HBase developers to be integrated into their toolchain.
Download slides (PDF)
- “Understanding and Detecting Software Upgrade Failures in Distributed Systems” paper
- Video from SOSP 2021
- Reference respository for the paper
- DUPTester tool code
- DUPChecker tool code
- “Simple Testing Can Prevent Most Critical Failures” paper
- Curated list of resources on testing distributed systems