Home

SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery


Author(s) : Mark D. Hill Milo M. K. Martin Daniel J. Sorin David A. Wood, 
Publisher : N/A
Publication Date : 2002
ISSN : N/A
Abstract : We develop an availability solution, called SafetyNet, that uses a unified, lightweight checkpoint/recovery mechanism to support multiple long-latency fault detection schemes. At an abstract level, SafetyNet logically maintains multiple, globally consistent checkpoints of the state of a shared memory multiprocessor (i.e., processors, memory, and coherence permissions), and it recovers to a pre-fault checkpoint of the system and re-executes if a fault is detected. SafetyNet efficiently coordinates checkpoints across the system in logical time and uses ?logically atomic ? coherence transactions to free checkpoints of transient coherence state. SafetyNet minimizes performance overhead by pipelining checkpoint validation with subsequent parallel execution.,