Microreboot – A Technique for Cheap Recovery
George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, Armando Fox
Proc. 6th Symposium on Operating Systems Design and Implementation (OSDI), San Francisco, CA, December 2004
[ PDF | HTML ]
A significant fraction of software failures in large-scale Internet
systems are cured by rebooting, even when the exact failure causes are
unknown. However, rebooting can be expensive, causing nontrivial
service disruption or downtime even when clusters and failover are
employed. In this work we separate process recovery from
data recovery to enable microrebooting -- a fine-grain technique for
surgically recovering faulty application components, without
disturbing the rest of the application.
We evaluate microrebooting in an Internet auction system running on
an application server. Microreboots recover most of the same failures
as full reboots, but do so an order of magnitude faster and result in
an order of magnitude savings in lost work. This cheap form of
recovery engenders a new approach to high availability: microreboots
can be employed at the slightest hint of failure, prior to node
failover in multi-node clusters, even when mistakes in failure
detection are likely; failure and recovery can be masked from end
users through transparent call-level retries; and systems can be
rejuvenated by parts, without ever being shut down.
|