Publications

Optimization of Multi-Level Checkpoint Model with Uncertain Execution Scales

SC 2014
S. Di, L. Bautista-Gomez, F. Cappello

Omnisc’IO: A Grammar-Based Approach to Spatial and Temporal I/O Patterns Prediction

SC 2014
M. Dorier, S. Ibrahim, G. Antoniu, R. Ross

CALCioM: Mitigating I/O Interference in HPC Systems through Cross-Application Coordination

IEEE IPDPS 2014
M. Dorier, G. Antoniu, R. Ross, D. Kimpe, S. Ibrahim

Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications

IEEE IPDPS 2014
S. Di, S. Bouguera, L. Bautista Gomez, F. Cappello

Unified Model for Assessing Checkpointing Protocols

To appear in Concurrency and Computation: Practice and Experience, Wiley.
G. Bosilca, A. Bouteiller, E. Brunet, F.Cappello, J. Dongarra, A. Guermouche, T. Herault, Y. Robert, F. Vivien, D. Zaidouni

Improving Floating Point Compression through Binary Masks

short paper, Proceedings of IEEE BigData 2013
L. Bautista Gomez, F. Cappello

Damaris/Viz: a Nonintrusive, Adaptable and User-Friendly In Situ Visualization Framework

In proceedings of IEEE LDAV 2013
M. Dorier, R. Sisneros, T. Peterka, G. Antoniu, D. Semeraro

Communication and topology-aware load balancing in Charm++ with TreeMatch

Proceedins of IEEE Cluster 2013
E. Jeannot, E. Meneses-Rojas, G. Mercier, F. Tessier, G. Zheng

SPBC: Leveraging the Characteristics of MPI HPC Applications for Scalable Checkpointing

Proceedings of IEEE/ACM SC13
T. Ropars, T. Martsinkevich, A. Guermouche, A. Schiper, F. Cappello

Optimization of Cloud Task Processing with Checkpoint-Restart Mechanism

Proceedings of IEEE/ACM SC13
D. Sheng, Y. Robert, F. Vivien, D. Kondo, C. L. Wang, F. Cappello

Characterizing and Modeling Cloud Applications/Jobs on a Google Data Center, short paper

Proceedings fo ICPP2013
D. Sheng, D. Kondo, F. Cappello

Multi-criteria checkpointing strategies: optimizing response-time versus resource utilization

Proceedings of Europar 2013
Aurelien Bouteiller, Franck Cappello, Jack Dongarra, Amina Guermouche, Thomas Herault and Yves Robert

Failure prediction for HPC systems and applications: current situation and open issues

International Journal of High Performance Computing Applications, SAGE, 2013
A. Gainaru, F. Cappello, M. Snir, B. Kramer

AI-Ckpt: Leveraging Memory Access Patterns for Adaptive Asynchronous Incremental Checkpointing

to appear in proceeding of ACM HPDC 2013
B. Nicolae, F. Cappello

BlobCR: Virtual Disk Based Checkpoint-Restart for HPC Applications on IaaS Clouds

To appear in Journal of Parallel and Distributed Computing, 2013
B. Nicolae, F. Cappello

ECOFIT: A Framework to Estimate Energy Consumption of Fault Tolerance protocols during HPC executions

to apprear in Proceedins of CCGRID 2013
M. El Mehdi Diour, O. Gluck, L. Lefevre, F. Cappello

Improving the computing efficiency of HPC systems using a combination of proactive and preventive checkpointing

Proceedings of IEEE IPDPS 2013
M. S. Bouguerra, A. Gainaru, F. Cappello, L. Bautista Gomez, N. Maruyama and S. Matsuoka

A Framework to Estimate Energy Consumption of Fault Tolerance protocols during HPC executions

Poster, to apprear in Proceedins of ACM PPoPP 2013
M. El Mehdi Diour, O. Gliuck, L. Lefevre, F. Cappello

Fault prediction under the microscope: A closer look into HPC systems

Proceedings of IEEE/ACM SC12
A. Gainaru, F. Cappello, Marc Snir, Bill Kramer

Damaris: How to Efficiently Leverage Multicore Parallelism to Achieve Scalable, Jitter-free I/O

Proceedings of IEEE Cluster 2012
M. Dorier, G. Antoniu, F. Cappello, M. Snir, L. Orf

Hierarchical Clustering Strategies for Fault Tolerance in Large Scale HPC Systems

Proceedings of IEEE Cluster 2012
L. B. Gomez, T. Ropars, N. Maruyama, F. Cappello, S. Matsuoka

A Hierarchical Approach for Load Balancing on Parallel Multi-core Systems

Proceedings of ICPP2012
L.L. Pilla, C. Pousa Ribeiro, D. Cordeiro, C. Mei, A. Bhatele, P. O. A. Navaux, F. Broquedis, J.-F. Mehaut, L. V. Kale

Scalable Reed-Solomon-based Reliable Local Storage for HPC Applications on IaaS Clouds

Proceedings of Europar 2012
L. Bautista Gomez, B. Nicolae, N. Maruyama, F. Cappello, S. Matsuoka

Energy considerations in Checkpointing and Fault Tolerance protocol

Proceeding of IEEE/IFIP DSN/FTXS 2012
M. el Mehdi Diouri, O. Guck, L. Lefevre and F. Cappello

Towards Efficient Live Migration of I/O Intensive Workloads: A Transparent Storage Transfer Proposal

Proceedings of ACM HPDC 2012
B. Nicolae, F. Cappello

Hybrid static/dynamic scheduling for already optimized dense matrix factorization

Proceedings of IEEE IPDPS 2012
S. Donfack, L Grigori, B. Gropp, V. Kale

HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications

Proceedings of IEEE IPDPS 2012
Technical report TR-JLPC-11-05
A. Guermouche, T. Ropars, M. Snir, F. Cappello

Taming of the Shrew: Modeling the Normal and Faulty Behavior of Large-scale HPC Systems

Proceedings of IEEE IPDPS 2012
Technical report TR-JLPC-11-10
A. Gainaru, F. Cappello, B. Kramer

Adaptive Event Prediction Strategy with Dynamic TimeWindow
for Large-Scale HPC Systems

Proceedings of SLAMS 2011 (Managing Large-Scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques)
A. Gainaru, F. Cappello, J. Fullop, S. Trausan-Matu, B. Kramer

FTI: high performance Fault Tolerance Interface for hybrid systems

Proceedings of IEEE/ACM SC11
Technical report TR-JLPC-11-09 [pdf]
L. Bautista Gomez; D. Komatitsch, N. Maruyama; S. Tsuboi, F. Cappello, S. Matsuoka, T Nakamura

Modeling and Tolerating Heterogeneous Failures in Large Parallel Systems

Proceedings of IEEE/ACM SC11
Technical report TR-JLPC-11-08
E. M. Heien, D. Kondo, A. Gainaru, D. Lapine, B. Kramer, F. Cappello

Damaris: Leveraging Multicore Parallelism to Mask I/O Jitter

Technical report TR-JLPC-11-07
M. Dorrier, G. Antoniu, F. Cappello, M. Snir, L. Orf

BlobCR: Efficient Checkpoint-Restart for HPC Applications on IaaS Clouds using Virtual Disk Image Snapshots

Proceedings of IEEE/ACM SC11
Technical report TR-JLPC-11-06
B. Nicolae, F. Cappello

Checkpointing strategies for parallel jobs

Proceedings of IEEE/ACM SC11
Technical report TR-JLPC-11-04
M. Bougeret, H. Casanova, M. Rabie, Y. Robert. F. Vivien

Comparing archival policies for Blue Waters

Proceedings of HIPC 2011
Technical report TR-JLPC-11-03
F.Cappello, M. Jacquelin, L. Marchal, Y. Robert and M. Snir

Improving Parallel System Performance with a NUMA-aware Load-Balancer

Technical report TR-JLPC-11-02
L. Pilla, C. Pousa, D. Cordeiro, A. Bhatele, P. Navaux, J-F. Méhaut, L. Kale

On the Use of Cluster-Based Partial Message Logging to Improve Fault Tolerance for MPI HPC Applications

Proceedings of Europar 2011
Technical report TR-JLPC-11-01 [pdf]
T. Ropars, A. Guermouche, B. Ucar, E. Meneses, L. V. Kale, F. Cappello

Optimizing multi-deployment on clouds by means of self-adaptive prefetching

Proceedings of Europar 2011
B. Nicolae, F. Cappello, G. Antoniu

Event log mining tool for large scale HPC systems

Proceedings of Europar 2011
A. Gainaru, F. Cappello, B. Kramer

The International Exascale Software Project roadmap

IJHPCA 25(1): 3-60 (2011)
J. Dongarra, F. Cappello, T. H. Dunning, B. Gropp, S. Kale, B. Kramer, M. Snir, et al.

Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic Message Passing Applications

Proceedings of IPDPS 2011
Technical Report of the INRIA-Illinois Joint Laboratory on Petascale Computing (TR-JLPC-10-03) [pdf]
Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir, Franck Cappello

Preventive Migration vs. Preventive Checkpointing for Extreme Scale Supercomputers

Parallel Processing Letters 21(2): 111-132 (2011)
F. Cappello, H. Casanova, Y. Robert

On Communication Determinism in Parallel HPC Applications

Proceedings of IEEE ICCCN 2010 [pdf]
Franck Cappello, Amina Guermouche, Marc Snir

Checkpointing vs. Migration for Post-Petascale Supercomputers

Proceedings of ICPP 2010
Franck Cappello, Henri Casanova, Yves Robert

Distributed Diskless Checkpoint for Large Scale Systems

Proceedings of IEEE CCGRID 2010
Leonardo Arturo Bautista Gomez, Naoya Maruyama, Franck Cappello, Satoshi Matsuoka

Hierarchical Event Log Organizer

Technical Report of the INRIA-Illinois Joint Laboratory on Petascale Computing (TR-JLPC-10-02)
Ana Gainaru, Franck Cappello, Stephan Trausan-Matu, William Kramer

State of the art on event analysis for large scale computers

Technical Report of the INRIA-Illinois Joint Laboratory on Petascale Computing (TR-JLPC-10-01)
Ana Gainaru, Franck Cappello, Stephan Trausan-Matu

Toward Exascale Resilience

IJHPCA 23(4): 374-388 (2009)
Franck Cappello, Al Geist, Bill Gropp, Laxmikant Kale, Bill Kramer, Marc Snir

Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities

IJHPCA 23(3): 212-226 (2009)
Franck Cappello, INRIA and UIUC

The International Exascale Software Project: a Call To Cooperative Action By the Global High-Performance Community

IJHPCA 23(4): 309-322 (2009)
Jack Dongarra, Pete Beckman, Patrick Aerts, Franck Cappello, Thomas Lippert, Satoshi Matsuoka, Paul Messina, Terry Moore, Rick Stevens, Anne E. Trefethen, Mateo Valero

Revisiting Fault Tolerant Protocols for HPC Applications

Technical Report of the INRIA-Illinois Joint Laboratory on Petascale Computing (TR-JLPC-09-02), submitted
Franck Cappello, INRIA, UIUC; Amina Guermouche, Univ. Paris Sud; Thomas Herault, Univ. Paris Sud, INRIA, UTK, Marc Snir, UIUC

Toward Exascale Resilience

Technical Report of the INRIA-Illinois Joint Laboratory on Petascale Computing (TR-JLPC-09-01) [pdf]
Franck Cappello, INRIA, UIUC; Al Geist, ORNL; Bill Gropp, UIUC; Sanjay Kale, UIUC; Bill Kramer, UIUC; Marc Snir, UIUC