skip to content

Department of Computer Science and Technology

Date: 
Friday, 28 March, 2025 - 15:30 to 16:30
Speaker: 
Aleksey Charapko, University of New Hampshire
Venue: 
Computer Lab, FW11 and Online (Teams link will appear before the talk)

"Teams link -- click here":https://teams.microsoft.com/l/meetup-join/19%3ameeting_NmQ5YzhmYjUtZGZjYy00NGIzLWEzY2QtZWM2NWNmMzg4NTZh%40thread.v2/0?context=%7b%22Tid%22%3a%2249a50445-bdfa-4b79-ade3-547b4f3986e9%22%2c%22Oid%22%3a%22c74ff4ca-98fe-4b28-9889-e119acc12f30%22%7d

Metastable failures refer to a class of catastrophic system failures that cause a permanent, self-sustaining overload of the impacted system. Distinguishing characteristics of metastable failures are the initial trigger that temporarily overloads the system and the sustaining effect that kicks in due to such overload and keeps the systems in the overloaded state, even after the initial trigger is fixed. Once in this permanently overloaded state, called the metastable failure state, the system is perpetually busy but unable to complete any useful work until drastic manual measures, such as restarting the system, are taken. Metastable failures have led to several prominent cloud outages in recent years.

This seminar explores strategies for mitigating the risks of metastable failures in distributed systems. First, we focus on the practical robustness of algorithms and systems, accounting for the performance cost of fault tolerance and error handling. Then, we look at the importance of identifying and protecting vulnerable components in large distributed systems to tame the sustaining effects and prevent the sustaining mechanisms from developing into a positive feedback loop. Finally, we discuss "metastable failure poisoning" -- a feedback mechanism that spreads the failure across seemingly isolated systems or components.

Bio: Aleksey Charapko is an assistant professor at the University of New Hampshire. He received his Ph.D. from the University at Buffalo, working on consensus algorithms and state machine replication. Now, Aleksey is broadly interested in distributed systems' performance, reliability, and efficiency. Aleksey has received several awards and research grants, most recently an NSF CAREER award for the "metastable failures" research. In addition to his academic endeavors, Aleksey has over a decade of engineering experience ranging from freelance to big tech to consulting.

Seminar series: 
Systems Research Group Seminar

Upcoming seminars