XtreemOS uses an integrated grid checkpointing service (XtreemGCP) for implementing migration and fault tolerance. Checkpointing and restarting applications in a grid requires saving and restoring applications in a distributed heterogeneous environment. The latter may spawn millions of grid nodes using different system-specific checkpointers saving and restoring application and kernel data structures on a grid node.
Our architecture is open to support different checkpointing strategies that can be adapted according to evolving failure situations or changing application requirements. We propose to bridge the gap between grid semantics and system-specific checkpointers by introducing a common kernel checkpointer API that allows using different checkpointers in a uniform way. Furthermore, we address other grid related checkpointing issues including resource conflicts during restart, security, and checkpoint file management. Although this work is performed within the XtreemOS context it can be applied to any other grid middleware or distributed OS, too. This was a joint work with Christine Morin (INRIA, Rennes, France) and was funded by the EU within the XtreemOS project (FP6).
Overview slides: download
Demo video at Vimeo
Contacts: Dr. John Mehnert-Spahn and Prof. Dr. Michael Schöttner
Can be found here.
D3.3.7 - Prototype with advanced features in AEM, 30.3.2010
D3.3.6 - AEM Prototype, 18.12.2008
"Checkpointing in heterogenen, verteilten Umgebungen", John Mehnert-Spahn, Heinrich-Heine Universität Düsseldorf, July 2010.