Optimization of Multi-Level Checkpoint Model with Uncertain Execution Scales
SC 2014
S. Di, L. Bautista-Gomez, F. Cappello
Omnisc’IO: A Grammar-Based Approach to Spatial and Temporal I/O Patterns Prediction
SC 2014
M. Dorier, S. Ibrahim, G. Antoniu, R. Ross
CALCioM: Mitigating I/O Interference in HPC Systems through Cross-Application Coordination
IEEE IPDPS 2014
M. Dorier, G. Antoniu, R. Ross, D. Kimpe, S. Ibrahim
Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications
IEEE IPDPS 2014
S. Di, S. Bouguera, L. Bautista Gomez, F. Cappello
Unified Model for Assessing Checkpointing Protocols
To appear in Concurrency and Computation: Practice and Experience, Wiley.
G. Bosilca, A. Bouteiller, E. Brunet, F.Cappello, J. Dongarra, A. Guermouche, T. Herault, Y. Robert, F. Vivien, D. Zaidouni
Improving Floating Point Compression through Binary Masks
short paper, Proceedings of IEEE BigData 2013
L. Bautista Gomez, F. Cappello
Damaris/Viz: a Nonintrusive, Adaptable and User-Friendly In Situ Visualization Framework
In proceedings of IEEE LDAV 2013
M. Dorier, R. Sisneros, T. Peterka, G. Antoniu, D. Semeraro
Communication and topology-aware load balancing in Charm++ with TreeMatch
Proceedins of IEEE Cluster 2013
E. Jeannot, E. Meneses-Rojas, G. Mercier, F. Tessier, G. Zheng
SPBC: Leveraging the Characteristics of MPI HPC Applications for Scalable Checkpointing
Proceedings of IEEE/ACM SC13
T. Ropars, T. Martsinkevich, A. Guermouche, A. Schiper, F. Cappello
Optimization of Cloud Task Processing with Checkpoint-Restart Mechanism
Proceedings of IEEE/ACM SC13
D. Sheng, Y. Robert, F. Vivien, D. Kondo, C. L. Wang, F. Cappello
Characterizing and Modeling Cloud Applications/Jobs on a Google Data Center, short paper
Proceedings fo ICPP2013
D. Sheng, D. Kondo, F. Cappello
Multi-criteria checkpointing strategies: optimizing response-time versus resource utilization
Proceedings of Europar 2013
Aurelien Bouteiller, Franck Cappello, Jack Dongarra, Amina Guermouche, Thomas Herault and Yves Robert
Failure prediction for HPC systems and applications: current situation and open issues
International Journal of High Performance Computing Applications, SAGE, 2013
A. Gainaru, F. Cappello, M. Snir, B. Kramer
AI-Ckpt: Leveraging Memory Access Patterns for Adaptive Asynchronous Incremental Checkpointing
to appear in proceeding of ACM HPDC 2013
B. Nicolae, F. Cappello
BlobCR: Virtual Disk Based Checkpoint-Restart for HPC Applications on IaaS Clouds
To appear in Journal of Parallel and Distributed Computing, 2013
B. Nicolae, F. Cappello
ECOFIT: A Framework to Estimate Energy Consumption of Fault Tolerance protocols during HPC executions
to apprear in Proceedins of CCGRID 2013
M. El Mehdi Diour, O. Gluck, L. Lefevre, F. Cappello
Improving the computing efficiency of HPC systems using a combination of proactive and preventive checkpointing
Proceedings of IEEE IPDPS 2013
M. S. Bouguerra, A. Gainaru, F. Cappello, L. Bautista Gomez, N. Maruyama and S. Matsuoka
A Framework to Estimate Energy Consumption of Fault Tolerance protocols during HPC executions
Poster, to apprear in Proceedins of ACM PPoPP 2013
M. El Mehdi Diour, O. Gliuck, L. Lefevre, F. Cappello
Fault prediction under the microscope: A closer look into HPC systems
Proceedings of IEEE/ACM SC12
A. Gainaru, F. Cappello, Marc Snir, Bill Kramer
Damaris: How to Efficiently Leverage Multicore Parallelism to Achieve Scalable, Jitter-free I/O
Proceedings of IEEE Cluster 2012
M. Dorier, G. Antoniu, F. Cappello, M. Snir, L. Orf
Hierarchical Clustering Strategies for Fault Tolerance in Large Scale HPC Systems
Proceedings of IEEE Cluster 2012
L. B. Gomez, T. Ropars, N. Maruyama, F. Cappello, S. Matsuoka
A Hierarchical Approach for Load Balancing on Parallel Multi-core Systems
Proceedings of ICPP2012
L.L. Pilla, C. Pousa Ribeiro, D. Cordeiro, C. Mei, A. Bhatele, P. O. A. Navaux, F. Broquedis, J.-F. Mehaut, L. V. Kale
Scalable Reed-Solomon-based Reliable Local Storage for HPC Applications on IaaS Clouds
Proceedings of Europar 2012
L. Bautista Gomez, B. Nicolae, N. Maruyama, F. Cappello, S. Matsuoka
Energy considerations in Checkpointing and Fault Tolerance protocol
Proceeding of IEEE/IFIP DSN/FTXS 2012
M. el Mehdi Diouri, O. Guck, L. Lefevre and F. Cappello
Towards Efficient Live Migration of I/O Intensive Workloads: A Transparent Storage Transfer Proposal
Proceedings of ACM HPDC 2012
B. Nicolae, F. Cappello
Hybrid static/dynamic scheduling for already optimized dense matrix factorization
Proceedings of IEEE IPDPS 2012
S. Donfack, L Grigori, B. Gropp, V. Kale
HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications
Proceedings of IEEE IPDPS 2012
Technical report TR-JLPC-11-05
A. Guermouche, T. Ropars, M. Snir, F. Cappello
Taming of the Shrew: Modeling the Normal and Faulty Behavior of Large-scale HPC Systems
Proceedings of IEEE IPDPS 2012
Technical report TR-JLPC-11-10
A. Gainaru, F. Cappello, B. Kramer
Adaptive Event Prediction Strategy with Dynamic TimeWindow
for Large-Scale HPC Systems
Proceedings of SLAMS 2011 (Managing Large-Scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques)
A. Gainaru, F. Cappello, J. Fullop, S. Trausan-Matu, B. Kramer
FTI: high performance Fault Tolerance Interface for hybrid systems
Proceedings of IEEE/ACM SC11
Technical report TR-JLPC-11-09 [pdf]
L. Bautista Gomez; D. Komatitsch, N. Maruyama; S. Tsuboi, F. Cappello, S. Matsuoka, T Nakamura
Modeling and Tolerating Heterogeneous Failures in Large Parallel Systems
Proceedings of IEEE/ACM SC11
Technical report TR-JLPC-11-08
E. M. Heien, D. Kondo, A. Gainaru, D. Lapine, B. Kramer, F. Cappello
Damaris: Leveraging Multicore Parallelism to Mask I/O Jitter
Technical report TR-JLPC-11-07
M. Dorrier, G. Antoniu, F. Cappello, M. Snir, L. Orf
BlobCR: Efficient Checkpoint-Restart for HPC Applications on IaaS Clouds using Virtual Disk Image Snapshots
Proceedings of IEEE/ACM SC11
Technical report TR-JLPC-11-06
B. Nicolae, F. Cappello
Checkpointing strategies for parallel jobs
Proceedings of IEEE/ACM SC11
Technical report TR-JLPC-11-04
M. Bougeret, H. Casanova, M. Rabie, Y. Robert. F. Vivien
Comparing archival policies for Blue Waters
Proceedings of HIPC 2011
Technical report TR-JLPC-11-03
F.Cappello, M. Jacquelin, L. Marchal, Y. Robert and M. Snir
Improving Parallel System Performance with a NUMA-aware Load-Balancer
Technical report TR-JLPC-11-02
L. Pilla, C. Pousa, D. Cordeiro, A. Bhatele, P. Navaux, J-F. Méhaut, L. Kale
On the Use of Cluster-Based Partial Message Logging to Improve Fault Tolerance for MPI HPC Applications
Proceedings of Europar 2011
Technical report TR-JLPC-11-01 [pdf]
T. Ropars, A. Guermouche, B. Ucar, E. Meneses, L. V. Kale, F. Cappello
Optimizing multi-deployment on clouds by means of self-adaptive prefetching
Proceedings of Europar 2011
B. Nicolae, F. Cappello, G. Antoniu
Event log mining tool for large scale HPC systems
Proceedings of Europar 2011
A. Gainaru, F. Cappello, B. Kramer
The International Exascale Software Project roadmap
IJHPCA 25(1): 3-60 (2011)
J. Dongarra, F. Cappello, T. H. Dunning, B. Gropp, S. Kale, B. Kramer, M. Snir, et al.
Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic Message Passing Applications
Proceedings of IPDPS 2011
Technical Report of the INRIA-Illinois Joint Laboratory on Petascale Computing (TR-JLPC-10-03) [pdf]
Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir, Franck Cappello
Preventive Migration vs. Preventive Checkpointing for Extreme Scale Supercomputers
Parallel Processing Letters 21(2): 111-132 (2011)
F. Cappello, H. Casanova, Y. Robert
On Communication Determinism in Parallel HPC Applications
Proceedings of IEEE ICCCN 2010 [pdf]
Franck Cappello, Amina Guermouche, Marc Snir
Checkpointing vs. Migration for Post-Petascale Supercomputers
Proceedings of ICPP 2010
Franck Cappello, Henri Casanova, Yves Robert
Distributed Diskless Checkpoint for Large Scale Systems
Proceedings of IEEE CCGRID 2010
Leonardo Arturo Bautista Gomez, Naoya Maruyama, Franck Cappello, Satoshi Matsuoka
Hierarchical Event Log Organizer
Technical Report of the INRIA-Illinois Joint Laboratory on Petascale Computing (TR-JLPC-10-02)
Ana Gainaru, Franck Cappello, Stephan Trausan-Matu, William Kramer
State of the art on event analysis for large scale computers
Technical Report of the INRIA-Illinois Joint Laboratory on Petascale Computing (TR-JLPC-10-01)
Ana Gainaru, Franck Cappello, Stephan Trausan-Matu
Toward Exascale Resilience
IJHPCA 23(4): 374-388 (2009)
Franck Cappello, Al Geist, Bill Gropp, Laxmikant Kale, Bill Kramer, Marc Snir
Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities
IJHPCA 23(3): 212-226 (2009)
Franck Cappello, INRIA and UIUC
The International Exascale Software Project: a Call To Cooperative Action By the Global High-Performance Community
IJHPCA 23(4): 309-322 (2009)
Jack Dongarra, Pete Beckman, Patrick Aerts, Franck Cappello, Thomas Lippert, Satoshi Matsuoka, Paul Messina, Terry Moore, Rick Stevens, Anne E. Trefethen, Mateo Valero
Revisiting Fault Tolerant Protocols for HPC Applications
Technical Report of the INRIA-Illinois Joint Laboratory on Petascale Computing (TR-JLPC-09-02), submitted
Franck Cappello, INRIA, UIUC; Amina Guermouche, Univ. Paris Sud; Thomas Herault, Univ. Paris Sud, INRIA, UTK, Marc Snir, UIUC
Toward Exascale Resilience
Technical Report of the INRIA-Illinois Joint Laboratory on Petascale Computing (TR-JLPC-09-01) [pdf]
Franck Cappello, INRIA, UIUC; Al Geist, ORNL; Bill Gropp, UIUC; Sanjay Kale, UIUC; Bill Kramer, UIUC; Marc Snir, UIUC