Cooperations

RDSM (Reliable Distributed Shared Memory)
Actual PC-clusters can be used to provide reliable services and/or aggregate resources. The first is achieved by periodically saving snapshots of the distributed system state within checkpoints thus avoiding fallback to the initial state in case of an error. The latter can be simplified by using the proven Distributed Shared Memory (DSM) concept advocated by numerous research projects. Introducing reliability for DSM is not new, but previous research is typically limited to theory and simulations.

In this project we study, design, and implement new checkpointing strategies for real applications running in different DSM environments:
- Kerrighed, IRISA, Rennes, France.
- Plurix, Ulm University, Germany.
- Mome, IRISA, Rennes, France.

The main goal of the cooperation is to compare qualitatively and quantitatively the checkpointing strategies between these three systems.

The cooperation is funded within the PROCOPE 2004 program of the german DAAD.

Meetings & Presentations
Meeting 01: Rennes, France, March 4th, 2004:
- "Checkpointing in Kerrighed", David Margery
- "Plurix - A High-Speed DSM OS", Peter Schulthess
- "Persistence of the DSM-System Plurix", Stefan Frenz
- "Plurix & PROCOPE", Michael Schoettner

Meeting 02: Ulm, Germany, September 8th - 9th, 2004:
- "Status Report of the Plurix DSM platform", Peter Schulthess
- "PROCOPE 2004 HA-DSM", Christine Morin
- "Mome, Checkpointing", Yvon Jegou
- "PageServer status", Stefan Frenz
- "Inside Kerrighed", Renaud Lottiaux
- "Plurix graphics status", Markus Fakler
- "Kerrighed running in Ulm", Stefan Frenz

Meeting 03: Rennes, France, May 3d, 2005:
- "Kerrighed - Status Report", Christine Morin
- "Plurix - Status Report", Michael Schoettner
- "Mome - Status Report", Yvon Jegou
- "A Fault-Tolerant Transparent Data Sharing Service for the Grid", Louis Rilling

Meeting 04: Ulm, Germany, December 15th, 2005:
- "HADSM Goals & Results", Michael Schoettner


last updated: December 19, 2005.