Mitigating the Risks of Metastable Failures in Distributed Systems

Date:

Friday, 28 March, 2025 - 15:30 to 16:30

Speaker:

Aleksey Charapko, University of New Hampshire

Venue:

Computer Lab, FW11 and Online (Teams link will appear before the talk)

"Teams link -- click here":https://teams.microsoft.com/l/meetup-join/19%3ameeting_NmQ5YzhmYjUtZGZjYy00NGIzLWEzY2QtZWM2NWNmMzg4NTZh%40thread.v2/0?context=%7b%22Tid%22%3a%2249a50445-bdfa-4b79-ade3-547b4f3986e9%22%2c%22Oid%22%3a%22c74ff4ca-98fe-4b28-9889-e119acc12f30%22%7d

Metastable failures refer to a class of catastrophic system failures that cause a permanent, self-sustaining overload of the impacted system. Distinguishing characteristics of metastable failures are the initial trigger that temporarily overloads the system and the sustaining effect that kicks in due to such overload and keeps the systems in the overloaded state, even after the initial trigger is fixed. Once in this permanently overloaded state, called the metastable failure state, the system is perpetually busy but unable to complete any useful work until drastic manual measures, such as restarting the system, are taken. Metastable failures have led to several prominent cloud outages in recent years.

This seminar explores strategies for mitigating the risks of metastable failures in distributed systems. First, we focus on the practical robustness of algorithms and systems, accounting for the performance cost of fault tolerance and error handling. Then, we look at the importance of identifying and protecting vulnerable components in large distributed systems to tame the sustaining effects and prevent the sustaining mechanisms from developing into a positive feedback loop. Finally, we discuss "metastable failure poisoning" -- a feedback mechanism that spreads the failure across seemingly isolated systems or components.

Bio: Aleksey Charapko is an assistant professor at the University of New Hampshire. He received his Ph.D. from the University at Buffalo, working on consensus algorithms and state machine replication. Now, Aleksey is broadly interested in distributed systems' performance, reliability, and efficiency. Aleksey has received several awards and research grants, most recently an NSF CAREER award for the "metastable failures" research. In addition to his academic endeavors, Aleksey has over a decade of engineering experience ranging from freelance to big tech to consulting.

Seminar series:

Systems Research Group Seminar

View on talks.cam

Calendar

Upcoming seminars

13Oct

Perplexity AI: Under the Hood of LLM Inference

Nandor Licker

Technical Talks
16Oct

Using interactive theorem provers in physics

Joseph Tooby-Smith (University of Bath)

Formalisation of mathematics with interactive theorem provers
17Oct

Making and breaking tokenizers

Sander Land (Writer)

NLIP Seminar Series
17Oct

The Dichotomy Theorem on the computational complexity of the Constraint Satisfaction Problem

Petar Markovic (University of Novi Sad)

Logic and Semantics Seminar
20Oct

Federated Learning at H.IAAC: On-going Research and Opportunities

Allan M. de Souza & Luiz Bittencourt, Universidade Estadual de Campinas (UNICAMP), Brazil

Cambridge ML Systems Seminar Series

View all seminars

Upcoming seminars

About the department

Social media

Study at Cambridge

About the University

Research at Cambridge