3rd JLESC Workshop Agenda

Note: if you open this page on your smartphone and click on restaurant addresses, google map will be launched and show you the route (turn by turn navigation).

The workshop rooms are at Level E.

Main TopicsScheduleSpeakerAffiliationType of presentationTitle (tentative)
Sunday Jun. 28th
DINNER20:00 Visual restaurant (2 min. walking distance), Direction.
Workshop Day 1Monday Jun. 29th
Lunch (on your own)
13:00Hotel, level E, near room 5,6 and 7.
Welcome and Introduction
Room: 5-6
14:00 Franck CappelloANL, UIUC and InriaBackgroundWelcome, Workshop objectives and organization
Chair: Franck Cappello, ANL14:10Mateo Valero
BSCBackgroundBSC Novelties and vision of the collaboration
14:20Antoine PetitInriaBackgroundInria Novelties and vision of the collaboration
14:30Bill KramerUIUCBackgroundUIUC Novelties and vision of the collaboration
14:40Marc Snir
ANLBackgroundANL Novelties and vision of the collaboration
14:50Thomas LippertJSCBackgroundJSC Novelties and vision of the collaboration
15:00Akira UkawaRikenBackgroundRiken Novelties and vision of the collaboration
15:10Marc SnirANLKeynote 1On the Road to Exascale: The next generation of DOE leadership supercomputers
Chair: Marc Snir, ANL16:10Thomas LippertJSCKeynote 2Creating the HPC and Data Analytics Infrastructure for the Human Brain Project
16:50Akira UkawaRikenKeynote 3AICS View toward Exascale
DINNER20:00Hotel Barcelo Sants, near meeting rooms
Workshop Day 2
Plenary session
Applications and mini-apps
Room: 5-6
Chair: Francisco Doblas-Reyes
Tuesday Jun. 30th
8:30Rob jacobANLResearchChallenges of modeling the climate system at Exascale
8:55Hisashi YashiroRikenResearchClimate modeling towards exascale: the case of NICAM-LETKF
9:20Stephane LanteriInriaResearchDevelopment of scalable high order finite element solvers for computational nanophotonics in the context of the C2S@Exa Inria Project Lab
9:45Andreas LintermannJSCResearchRecent CFD Research in the SimLab FSE
Chair: Naoya Marumaya, Riken10:35Mohamed WahibRikenResearchScalable and Automated GPU kernel Transformations in Production Stencil Applications
11:00Mariano VázquezBSCResearchLarge-scale Simulations for Biomedical Research at Organ Level
11:25Philippe HelluyInriaResearchA generic Discontinuous Galerkin solver based on OpenCL task graph. Application to electromagnetic compatibility.
Chair: Naoya Marumaya, Riken11:50Open Microphone

ANLDefining a common objectiveEstablishing the JLESC Benchmarks
Parallel session 1
Programming models,
Room 7
Chair: Jean François Mehaut
14:00Hitoshi MuraiRikenResearchOverview and Future Plan of the XcalableMP Parallel Language
14:25Jesus LabartaBSCResearch
14:50Emmanuel JeannotInriaResearchImproving parallel I/O with topology-aware aggregators mapping
15:15Pavan BalajiANLResearchTitle: Buffer-sharing Techniques in Insitu Workflows: Challenges and Pitfalls
Chair: Rajeev Thakur, ANL16:05Jean François MehautInriaResearchCORSE: Compiler Optimizations and Runtime SystEms
16:30Wen-mei HwuUIUCResearch
16:55Seo SangminANLResearchArgobots: Lightweight Low-level Threading/Tasking Framework
17:20Marta Garcia BSCResearchDLB: Dynamic Load Balancing Library
Chair: Rajeev Thakur, ANL17:45Open Microphone

Defining a common objective
Direction Bus service provided.
Parallel session 2
I/O, Big Data, Visualization
Room 5-6,
Chair: Bruno Raffin, Inria
14:00Robert SisnerosUIUCResearchAn IDEAL Framework: Recent Work at the Intersection of Big Data and “Big Data”
14:25Orcun YildizInriaResearchChronos: Failure-Aware Scheduling in Shared Hadoop Clusters
14:50Ramon NouBSCResearchPerformance Impacts with Reliable Parallel File Systems at Exascale Level
15:15Wolfgang FringsJSCResearchTask-Local Parallel I/O Support for Parallel Performance Analysis Tools with SIONlib
Chair: Ramon Nou, BSC16:05Jorji NonakaRikenResearchLarge-Scale Parallel Image Composition for In Situ Visualization Framework
16:30Matthieu DreherANLResearchData Model and Data Redistribution for In-Situ Applications with Decaf
16:55Bruno Raffin
InriaResearchIn-Situ Processing with FlowVR for Molecular Dynamics Simulations
17:20Kate KeaheyANLResearchChameleon: Building a Large-scale Experimental Testbed for Cloud Research
Chair: Gabriel Antoniu, Inria 17:45Open MicrophoneDefining a common objective
Bus service provided.
Workshop Day 3Wednesday July. 1st
Parallel session 3
Chair: Osman Unsal, BSC,
Room 7
8:30Marc CasasBSCResearchAsynchronous algorithms to mitigate faults recoveries and enable approximate computing
8:55Hongyang SunInriaResearchWhich Verification for Soft Error Detection?
9:20Leonardo Bautista GomezANLResearchAnalytic Based Corruption Detection
9:45Jon CalhounUIUC
Chair: Yves Robert, Inria10:35Omer SubasiBSCResearchEfficient Software-based Fault Tolerance for Memory Errors in the Exascale Era
11:00Suraj PrabhakaranJSCResearchDynamic Node Replacement and Adaptive Scheduling for Fault Tolerance
11:25Atsushi HoriRikenResearchSpare Node Substitution
Chair: Franck Cappello, ANL-UIUC11:50Open MicrophoneDefining a common objectiveEstablishing the JLESC Resilience Methodology: failure logs, log analysis tools, SDC injection practices, etc.
Parallel session 4
I/O, Big Data, Visualization
Room 5-6
Chair: Gabriel Antoniu, Inria
8:30Francieli Zanon BoitoInriaResearchI/O Scheduling Algorithm Selection for Parallel File Systems
8:55Nicolas VandenbergenJSCResearchExperiences with Blue Gene Active Storage
9:20Florin IsailaANLOptimizing data staging based on autotuning, coordination and locality exploitation on large scale supercomputers
9:45Ana QueraltBSCPersistent data as a first-class citizen in parallel programming models
Chair: Wolfgan Frings, JSC10:35Gabriel AntoniuInriaResearchTo Overlap or Not to Overlap: Optimizing Incremental MapReduce Computations for On-Demand Data Upload
11:00Justin WozniakANLResearchSwift Parallel Scripting: Novel Features and Applications
11:25Rosa M BadiaBSCResearchProgrammability in PyCOMPSs
Chair: Rosa M Badia, BSC11:50Open MicrophoneDefining a common objective
Plenary session
Applications and mini-apps
Room: 5-6,
Chair: Naoya Marumaya, Riken
13:30Allen, Dr. GabrielleUIUCResearchThe Einstein Toolkit: A Community Computational Infrastructure for Relativistic Astrophysics
Parallel session 5
Performance and tools,
Room 5-6:
Chair: Judit Gimenez
14:00Miguel CastrillóBSCResearchBSC tools to study the computational efficiency of EC-Earth components
14:25Arnaud Legrand
InriaResearchFast and Accurate Simulation of Multithreaded Dense and Sparse Linear Algebra Solvers
14:50Paul F Baumeister
JSCResearchOpenPOWER: First Performance Results for Scientific Applications
15:15Bill KramerUIUCResearchUnderstanding Performance On Extreme Scale System Takes Big Data and Extreme Tools
Chair: Bernd Mohr, JSC 16:05Brian WylieJSCResearchVI-HPS and Scalasca
16:30Brice VideauInriaResearchBOAST: Performance Portability Using Meta-Programming and Auto-Tuning
16:55Harald Servat GelabertBSCResearchStudy the use of the Folding hardware-based profiler to assist on data distribution for heterogeneous memory systems in HPC
Chair: Bernd Mohr, JSC 17:20Open MicrophoneDefining a common objective
Closing17:45Franck Cappello

ANL, UIUC, InriaReviewing defined common objectives
DINNER19:00Hotel Barcelo Sants, near the meeting rooms.
Plenary session
Applications and mini-apps
Room: 5-6,
Chair: Naoya Marumaya, Riken
13:30Allen, Dr. GabrielleUIUCResearchThe Einstein Toolkit: A Community Computational Infrastructure for Relativistic Astrophysics
Parallel session 6
Numerical Methods/Algorithms
Room 7:
Chair: Paul Hovland
14:00Vijay MahadevanANLResearchEasing computational workflows through flexible and scalable tools
14:25David HaenselJSCResearchFirst steps towards an automatic load balancing for the Fast Multipole Method.
14:50Amanda BienzUIUCResearchTopology-Aware Performance Modelling
15:15Luc GiraudInriaResearchRecent progess on numerical kernels for large scale computing on heterogeneous manycores
Chair: Bill Gropp, UIUC16:05Robert SpeckJSCResearchParallel-in-Time Integration with PFASST
16:30Andre SchleifeUIUCResearchNumerical integrators for a plane-wave implementation of real-time time-dependent density functional theory
16:55Guillaume AupyInriaResearchOptimal Multistage Algorithm for Adjoint Computation
Chair: Bill Gropp, UIUC17:20Open MicrophoneDefining a common objective
Room 5-6
17:45Franck Cappello

ANL, UIUC, InriaReviewing defined common objectives
DINNER19:00Hotel Barcelo Sants, near the meeting rooms.

On the Road to Exascale: The next generation of DOE leadership supercomputers
Marc Snir, MCS division director, ANL

We shall discuss in this talk the next generation of DOE supercomputers and the changes application codes will need to consider in order to leverage them effectively. We shall next discuss the expected evolution toward the next (exascale) generation of leadership systems.

Creating the HPC and Data Analytics Infrastructure for the Human Brain Project
Thomas Lippert, JSC

HBP, the human brain project, is one of two European flagship projects foreseen to run for 10 years. The HBP aims at creating a open European neuroscience driven infrastructure for simulation and big data aided modelling and research with a credible user program. The goal of the HBP is to progressively understand structure and functionality of the human brain, strongly based on a reverse engineering philosophy. In addition, it aims at advancements in digital computing by means of brain inspired algorithms with the potential to create completely novel analogue computing technology called neuromorphic computing. The HBP simulation
and data analytics infrastructure will be based on a federation of
supercomputer and data centers contributing to specific requirements of neuroscience in a complementary manner. It will encompass a variety of simulation services and data analytics services ranging from the molecular level towards synaptic and neuronal levels up to cognitive and robotic models. The major challenge is that HBP research will require exascale capabilities for computing, data integration and data analytics. Mastering these challenges requires a huge interdisciplinary software and hardware co-design effort including neuroscientists, physicists, mathematicians, and computer scientists on an international
scale. The HBP is a long-term endeavor and thus puts large emphasis on educational and training aspects. The maturity of a service is critical, and it is important to differentiate between an early prototype, the development phase, and the delivery of services, in order to assess capability levels. The services and infrastructures of the HBP will successively include more European partners, in particular PRACE sites and EUDAT data services, and will be made available step by step to the pan-European neuroscience community.

AICS View toward Exascale
Akira Ukawa, Deputy Director, RIKEN AICS

Since this is the first time that RIKEN AICS participates in the JLESC Workshop as a member institution, we wish to present a perspective on various aspects toward exascale from AICS point of view, and the role we hope to play in JLESC in this context.  We start by providing a somewhat detailed view of AICS, its founding vision and history, the organization and people, and the science being done.  A brief update on the post K Project, officially named the Flagship 2020 Project, is given with some details on application targets and co-design.  We then turn to international collaboration from science domain point of view, taking the case of QCD in particle theory as a case study.  HPC talks these days would sound  lopsided if no mention are made of big data, so we try to do so in a lopsided way in the wrapup.

Challenges of modeling the climate system at Exascale
Rob Jacob, ANL

Exascale systems may require several changes to business-as-usual for climate modeling.  Using the vertical dimension of the high-horizontal resolution numerical grids may be necessary to obtain more parallelism.   Other uses for an exaflop include using high-resolution sub-models in place of paramaterizations based on the large-scale fields and moving other components, such as the land, to fully 3D representations. Ensembles are a necessary and straightforward use of exascale resources.  Because of the increase in communication costs, it may no longer be possible to ignore how each model (atmosphere, ocean, etc.) is decomposed relative to each other.  Climate models are a multi-physics mixed PDE-ODE application and memory bandwidth will be continue to be important for performance.  Early experiments at high resolution indicate that tracer transport will dominate performance at high-resolution making it possible to experiment with different exascale programming models and languages on a single mini-app.   Writing to disc will be more expensive at exascale but some data will always need to be output because a great deal of insight in climate modeling comes from comparing two or more simulations against each other.  Bit-for-bit reproducibility is heavily relied on during testing and development but it may be possible to relax that requirement for the long production runs that consume most of the time on today’s petascale systems.

Data Model and Data Redistribution for In-Situ Applications with Decaf
Matthieu Dreher, ANL

As we are moving toward Exascale, the gap between computational power and I/O bandwidth is becoming more and more concerning. The In-Situ paradigm is one promising solution to this problem. Data are treated and analyzed in memory as close as possible to the source avoiding the I/O for both the simulation and the analytics. Decaf is one infrastructure under development able to connect in-situ heterogeneous parallel codes such as simulations and analytics. The mantra of Decaf is to augment the usual NxM link between two parallel codes (called a Dataflow) with a staging area between the two codes where the user can treat, transform, filter or buffer data.
To perform these operations, the user must describe the data to send through Decaf. The data model is a key feature of Decaf to allow future automatic treatments within a Dataflow. However, the shape of the data from simulations and analytics are very different with particular semantics. Data are often breakdown to simple data types or serialized before transmitting data between codes or for I/O for instance.
In this talk, after a brief introduction of Decaf, we present a data model able to describe complex data with enough information to chunk and assemble them automatically. While pushing data inside the data model, the user can add annotations to capture the semantic of the data. The user has also the possibility to redefine the automatic behavior of the data model to chunk/assemble data if the data present some particularities. We then discuss the benefits of this data model with the case of data redistribution which can be performed automatically without supplementary user code.

Easing computational workflows through flexible and scalable tools
Vijay S. Mahadevan, ANL

High fidelity computational modeling of complex, coupled physical phenomena occurring in several scientific fields require accurate resolution of intricate geometry features (CGM), generation of good quality unstructured meshes that minimize modeling errors (MeshKit), scalable interfaces to load/manipulate/traverse these meshes in memory (MOAB), ability to leverage efficient nonlinear solvers on current and future architectures (PETSc) and support for checkpointing and in-situ visualization. The application of these libraries in a component-based architecture allows flexible usage for improving scientific productivity in several use-cases in order to tackle the heterogeneous descriptions of physical models and for resolving the stiff nonlinearity in coupled multi-physics (CouPE). The usage of these scalable meshing tools and computational solvers for some coupled-physics demonstration problems in nuclear engineering will be presented and outstanding challenges in numerics, software engineering and co-design will be discussed.

Argobots: Lightweight Low-level Threading/Tasking Framework
Sangmin Seo, ANL

In this talk, we present a lightweight low-level threading and tasking model, called Argobots, to support the massive parallelism required for applications on exascale systems. Argobots’ lightweight operations and controllable executions enable high-level programming models to easily tune the performance by hiding long-latency operations with low-cost context switching or by improving locality with cooperative and deterministic scheduling. Often, complex applications are written in hybrid programming models, e.g., MPI+threads, to better exploit inter- and intra-node parallelism on large-scale clusters. Argobots enhances this combination of programming models by exposing a common runtime with interoperable capabilities, providing a shared space where programming models become complementary. We provide an implementation of Argobots as a user-level library and runtime system. Through the evaluation on manycore architectures and clusters, we show that Argobots incurs very low overhead with scalable performance and it is indeed capable of bridging the gap between different programming models.

Swift Parallel Scripting: Novel Features and Applications
Justin M Wozniak, ANL

Dataflow languages offer a natural means to express concurrency but are not a natural representation of the architectural features of
high-performance, distributed-memory computers.  When used as the outermost language in a hierarchical programming model, dataflow is very effective at expressing the overall flow of a computation.  In this talk, we will present strategies and techniques used by the Swift dataflow language to obtain good performance task-parallel performance on extremely large computing systems, managing task priorities and locations.  We will also present new Swift applications in materialsscience and epidemiology.

Optimizing data staging based on autotuning, coordination and locality exploitation on large scale supercomputers
Florin Isaila, ANL

Efficient data handling on high-performance computing platforms is one of the critical obstacle to be overcome for reaching higher levels of scalability. This talk will outline three research activities related to data staging on large scale supercomputers. First, I will present a novel hybrid approach to autotuning parallel I/O  based on a combination of analytical and machine learning models. Second, I will discuss a coordination framework that targets to support the global improvement of key aspects of data staging including load-balance, I/O scheduling, and resilience.  Finally, I will shortly review some current efforts and results for improving the scalability and performance of the Swift workflow language by leveraging data locality through Hercules, a persistent data store, build around the Memcached distributed memory object caching system

Analytic Based Corruption Detection
Leonardo Bautista Gomez, ANL

The reliability of future high performance computing systems is one of the biggest challenges to overcome in order to achieve exascale computing. The number of components of supercomputers is increasing exponentially and the power consumption restrictions limit the amount of error verification mechanisms that can be implemented at the hardware level. The soft error rate is expected to increase dramatically in the coming years, leading to a high probability of silent data corruption. In this talk, we present a thorough overview of multiple analytic based corruption detection mechanisms and the difference between them. This survey includes temporal and spatial detectors and multiple prediction methods. In addition, we explore the impact of correcting suspected corruptions with the predictions made by these detectors .Our results show that it is possible to achieve less than 1% of error on the final results while detecting and correcting suspected corruptions automatically.

Chameleon: Building a Large-scale Experimental Testbed for Cloud Research
Kate Keahey, ANL

Cloud services have become essential to all major 21st century economic activities. The new capabilities they enable gave raise to many open questions, some of the most important and contentious issues being the relationship between cloud computing and high performance computing, the suitability of cloud computing for data-intensive applications, and its position with respect to emergent trends such as Software Defined Networking. A persistent barrier to further understanding of those issues has been the lack of a large-scale and open cloud research platforms.  With funding from the National Science Foundation, the Chameleon project is providing such a platform to the research community. The testbed, deployed at the University of Chicago and the Texas Advanced Computing Center, will ultimately consist of ~15,000 cores, 5PB of total disk space, and leverage 100 Gbps connection between the sites and consist of a mix of large-scale homogenous hardware and a smaller investment in heterogeneous components high-memory, large-disk, low-power, GPU, and co-processor units. The majority of the testbed is now deployed and available to Early Users with general availability planned for July this year. This talk will provide a detailed description of the available hardware capabilities as well as the workflow allowing users to develop their own experiments. To support a broad range of experiments, the project allows full user configurability of the software stack, from provisioning of bare metal and network interconnects to delivery of fully functioning cloud environments. This is achieved using an infrastructure developed on top of two open source software components: the Grid’5000 software and the widely-adopted OpenStack system. We will discuss the current state of the system as well as projected future features.

Title: Buffer-sharing Techniques in Insitu Workflows: Challenges and Pitfalls
Pavan Balaji, ANL

Workflows are gaining increasing popularity to address the needs of scientific computing users that require multiple applications (or components) to process raw data before it is ready to be used or analyzed by a human.  Traditional workflow models have relied on files as a medium of data exchange between such applications.  In the recent past, there has been a flurry of research trying to optimize this model using NVRAM-based data sharing, memory-to-memory communication techniques, and even zero-copy techniques using shared buffers. In this talk, I’ll discuss some of the challenges in such buffer-sharing techniques with respect to its impact on the computational model as well as data representation requirements.  Specifically, shared memory buffers do not have the same properties as private buffers with respect to how a compiler views them, or how the operating system assigns physical pages to them.  This makes it hard for computations to directly be carried out on such shared memory regions at the same efficiency as those on private memory regions.  Furthermore, different computations have different data layout requirements.  Thus, even if a process can “hand-off” data through a shared-memory buffer, unless the data is laid out exactly how the next application would need it, the cost of working with a bad data layout is often much higher than the cost of simply reorganizing the data before carrying out the required computation.  The talk will likely raise more questions than give answers, but is intended to showcase a problem of interest and seek potential collaborations in our search for a solution.

OpenPOWER: First Performance Results for Scientific Applications
Paul Baumeister, JSC

We will report on first experiences with selected scientific applications on IBM POWER8 servers with NVIDIA K40 GPUs. Applications from different research fields will be considered, which pose different requirements to hardware architecture. Our performance evaluation based on currently available hardware will be analyzed having the future roadmap for these technologies in mind.

Title: Experiences with Blue Gene Active Storage
Nicolas Vandenbergen, JSC

We report on experiences with Blue Gene Active Storage (BGAS) on the JUQUEEN system. Use cases of the BGAS system from different fields are presented, with a focus on active-storage-centered applications.

Recent CFD Research in the SimLab FSE
Andreas Lintermann, JSC

This talk introduces the SimLab concept of the Jülich Supercomputing Center (JSC) and the Jülich Aachen Research Alliance, High Performance Computing (JARA-HPC) and summarizes recent research activities of the SimLab “Highly Scalable Fluids & Solids Engineering” (SLFSE). In more detail, the results of CFD simulations in the field of human respiration, i.e., for the simulation of the flow in the human nasal cavity and particle depositions in the human lung using a Lattice-Boltzmann and a coupled Lagrange particle solver is presented. Furthermore, recent advances in the simulation of aircraft noise and in the optimization of pharmaceutical and chemical processes are given. Finally, prospective research topics and their challenges, i.e., the efficient simulation of respiratory sleep disorders and shape optimizations of shevrons for noise reduction of aircraft engines are discussed.

Task-Local Parallel I/O Support for Parallel Performance Analysis Tools with SIONlib
Wolfgang Frings, JSC

Parallel performance analysis tools like Scalasca and the performance measurement runtime infrastructure Score-P often need to store event traces in multiple task-local files efficiently to record performance data. For very large numbers of processors, these tools often experience scalability limitations since creating thousands of files simultaneously causes metadata-server contention and large file counts complicate file management. The parallel I/O library SIONlib, which alleviates this issue, has recently been extended to support the special requirements to API and data management of these parallel tools. In this talk we will present briefly the design principles of SIONlib and will cover the additional features of SIONlib for tool support.

Dynamic Node Replacement and Adaptive Scheduling for Fault Tolerance
Suraj Prabhakaran, JSC

Batch systems traditionally support only static resource management wherein a job’s resource set is unchanged throughout execution. Node failures force the batch systems to restart affected jobs on a fresh allocation (typically from a checkpoint) or replace failed nodes with statically allocated spare nodes. As future systems are expected to have high failure rates, this solution leads to increased job restart overhead, additional long waiting times before job restart and excessive resource wastage. In this talk, we present an extension of the TORQUE/Maui batch system with dynamic resource management facilities that enable instant replacement of failed nodes to affected jobs without requiring a job restart. The proposed batch system supports all job types for scheduling – rigid, moldable, evolving and malleable. We present an algorithm for the combined scheduling of all job types and show how the unique features of various jobs and the scheduling algorithm can expedite node replacements. The overall expected benefit of this approach is a highly resilient cluster environment that ensures timely completion of jobs and maintains high throughput even under frequent node failures.

VI-HPS and Scalasca
Brian Wylie, JSC

The Virtual Institute – High Productivity Supercomputing (VI-HPS) combines the expertise of twelve partner institutions (including JLESC members JSC and BSC) in development and application of tools for HPC program development, analysis and optimisation.  VI-HPS provides training in the application of these tools to PRACE Advanced Training Centres in Europe and at the invitation of other organizations around the world, particularly in the form of VI-HPS Tuning Workshops where application developers bring along and assisted to apply the tools to their own codes.  JSC contributes the open-source Scalasca toolset for scalable performance analysis of large-scale applications, along with the associated Score-P instrumentation and measurement infrastructure and CUBE analysis report utilities that are also used by other tools (including Periscope, TAU and Vampir).

First steps towards an automatic load balancing for the Fast
Multipole Method.
David Haensel, JSC

The Fast Multipole Method is a generic toolbox algorithm for many important scientific applications, like molecular dynamics, plasma physics or astrophysics. To reach the maximum performance on different hardware hierarchies a more sophisticated parallelization particularly with regard to the work distribution is required. As a very first step we introduced a high level of abstraction on the algorithm and the communication side. The communication layer features a few communicateable data-types used during the calculation, which are hiding the MPI communication. The top of the algorithm layer features work packages serving as a task manager for every rank. Work packages are constructed out of work units depending on the requirements for the calculation of the final targets. Those work units could be calculation or communication tasks and thus determine if a value is retrieved by calculation or communication. With this structure we will be able to implement a load-balancer making decisions based on different strategies in the future. The major strategy will be a partitioning of the communication graph optimizing the load distribution and minimizing the communication.

Parallel-in-Time Integration with PFASST
Robert Speck, JSC

The challenges arising from the extreme levels of parallelism required by todays and future HPC systems mandates the mathematical development of new numerical methods featuring a maximum degree of concurrency. Iterative time integration methods that can provide parallelism along the temporal axis have become increasingly popular over the last years.
The recently developed “parallel full approximation scheme in space and time” (PFASST) can integrate multiple time-steps simultaneously in a multigrid-fashioned way. Based on multilevel spectral deferred corrections (MLSDC), PFASST is able to apply multiple coarsening strategies in space and time. Here, careful balancing between aggressive coarsening and fast convergence is necessary. In this talk, we will discuss various, application-tailored coarsening strategies and show recent results on successful combinations of space-parallel solvers with PFASST. We will highlight extreme-scale benchmarks with a multigrid solver on up to 448K cores of the IBM Blue Gene/Q installation JUQUEEN and describe ongoing work on developing a space-time parallel tree code for plasma physics applications using a high-order Boris integrator.

Which Verification for Soft Error Detection?
Hongyang Sun, Inria

Many methods are available to detect silent errors in high-performance computing (HPC) applications. Each comes with a given cost and recall (fraction of all errors that are actually detected). The main contribution of this paper is to show which detector(s) to use, and to characterize the optimal computational pattern for the application: how many detectors of each type to use, together with the length of the work segment that precedes each of them. We conduct a comprehensive complexity analysis of this optimization problem, showing NP-completeness and designing an FPTAS (Fully Polynomial-Time Approximation Scheme). On the practical side, we provide a greedy algorithm whose performance is shown to be close to the optimal for a realistic set of evaluation scenarios.

Fast and Accurate Simulation of Multithreaded Dense and Sparse Linear Algebra Solvers
Arnaud Legrand, Inria

Multi-core architectures comprising several GPUs have become mainstream in the field of High Performance Computing. However, obtaining the maximum performance of such heterogeneous machines is challenging as it requires to carefully offload computations and manage data movements between the different processing units. The most promising and successful approaches so far build on task-based runtimes that abstract the machine and rely on opportunistic scheduling algorithms. As a consequence, the problem gets shifted to choosing the task granularity, task graph structure, and optimizing the scheduling strategies. Trying different combinations of these different alternatives is also itself a challenge. Indeed, getting accurate measurements requires reserving the target system for the whole duration of experiments. Furthermore, observations are limited to the few available systems at hand and may be difficult to generalize. We show how we crafted a coarse-grain hybrid simulation/emulation of StarPU, a dynamic runtime for hybrid architectures, over SimGrid, a versatile simulator for distributed systems. This approach allows to obtain performance
predictions of both classical dense and sparse linear algebra kernels
accurate within a few percents and in a matter of seconds while keeping track of aspects such as memory consumption that are critical in the case of sparse linear algebra. This allows both runtime and application designers to quickly decide which optimization/scheduler to enable or whether it is worth investing in higher-end GPUs/additional memory or not. Additionally, it allows to conduct robust and extensive scheduling studies in a controlled environment whose characteristics are very close to real platforms while having reproducible behavior.

Improving parallel I/O with topology-aware aggregators mapping
Emmanuel Jeannot, inria

The standard behavior for parallel I/O with MPI consist in electing some aggregators in a set of processes (Pset) to gather the pieces of data and write them to disk (I/O node). The way the aggregators are elected follow a greedy policy by default. As an intelligent mapping of processes is able to reduce the communication cost between them, a relevant choice of the aggregator can induce some gains in term of congestion, access cost and communication cost.

Optimal Multistage Algorithm for Adjoint Computation
Guillaume Aupy, Inria

We reexamine the work of Stumm and Walther on multistage algorithms for adjoint computation. We provide an optimal algorithm for this problem when there are two levels of checkpoints, in memory and on disk. Previously, optimal algorithms for adjoint computations were known only for a single level of checkpoints with no writing and reading costs; a well-known example is the binomial checkpointing algorithm of Griewank and Walther. Stumm and Walther extended that binomial checkpointing algorithm to the case of two levels of checkpoints, but they did not provide any optimality results. We bridge the gap by designing the first optimal algorithm in this context. We experimentally compare our optimal algorithm with that of Stumm and Walther to assess the difference in performance.

Development of scalable high order finite element solvers for computational nanophotonics in the context of the C2S@Exa Inria Project Lab
Stéphane Lanteri, Inria

This talk will be concerned with the development of hybrid MIMD/SIMD high order finite  element solvers for the simulation of the light/matter interaction on the nanoscale.  In the first part of the talk, we will present the context of this study, namely  the C2S@Exa (Computer and Computational Sciences at Exascale) Inria Project Lab  which is an initiative that was launched in 2013 for a duration of 4 years. C2S@Exa is a multi-disciplinary initiative for high performance computing in computational sciences. In the second part of the talk, we will discuss about our recent efforts towards the  design of high performance numerical methodologies based on high order discontinuous  Galerkin methods formulated on unstructured meshes, for the solution of the system of  time-domain Maxwell equations coupled to dispersive material models relevant to nanophotonics.

CORSE: Compiler Optimizations and Runtime SystEms
Jean-François Mehaut, Inria

In this talk, the main research activities of the Corse INRIA team will be presented. The Corse foundation  is related to the combination of static and dynamic techniques of compilation and runtime systems, with always in mind the goal of addressing high performance and low energy challenges.  While compilers and runtime systems obviously share a common goal of improving code performance, they play at different levels. Compilers typically apply hardware specific optimizations (register usage, loop unrolling, pipeline, vectorization) to take the most performance out of the underlying architecture’s performance. Runtime Systems, on their side, optimize resource allocation at a macroscopic level. They typically perform load balancing and map application data over the underlying architecture. In both cases – micro and macro – we need information on the underlying architecture in order to select the best options. We believe the two world can mutually benefit from each other by exchanging information and hints about application behavior and hardware capabilities. For instance, runtime systems have no precise information about the behavior of the tasks they have to schedule. Compilers can typically extract useful details such as computational complexity, data access  patterns, etc. Transmitting such information to the runtime system (e.g. by attaching properties to tasks)  could greatly improve task scheduling policies for instance, by influencing task allocation to better match  target processing units’ capabilities for instance. The software developments will be based on LLVM,  OpenMP, Charm++ to integrate the Corse contributions.

BOAST: Performance Portability Using Meta-Programming and Auto-Tuning
Brice Videau, Inria

Porting and tuning HPC applications to new platforms is of paramount importance but tedious and costly in terms of human resources. Unfortunately those efforts are often lost when migrating to new architectures as optimization are not generally applicable. In the Mont-Blanc European project, in collaboration with BSC, we tackle this problem from several angles. One of them is by using task base runtime (OmpSs) to get adaptive scientific applications. Another one is by promoting scientific application auto-tuning. While computing libraries might be auto-tuned, usually HPC applications are hand-tuned. In the fast paced world of HPC nowadays, we believe that HPC applications kernels should be auto-tuned instead. Unfortunately, the investment to setup a dedicated auto-tuning framework is usually too expensive for a single application. Source to source transformations or compiler based solutions exist but sometimes prove too restrictive to cover all use-cases. We thus propose BOAST a meta-programming framework aiming at generating parametrized source code. The aim is for the programmer to be able to orthogonally express optimizations on a computing kernel, enabling a thorough search of the optimization space. This also allows a lot of code factorization and thus code base reduction. We will demonstrate the use of BOAST on a classical Laplace kernel. Demonstrating how our embedded DSL allowed the description of non trivial optimizations. We will also show how the BOAST framework enabled performance and non regression tests to be performed on the generated code versions, resulting in proven and efficient computing kernels on several architectures.

I/O Scheduling Algorithm Selection for Parallel File Systems
Francieli Zanon Boito, Inria

High Performance Computing applications rely on Parallel File Systems (PFS) to achieve good performance even when handling large amounts of data. It is usual for HPC systems to provide a shared storage infrastructure for applications. In this situation, when multiple applications are concurrently accessing the shared PFS, their accesses will affect each other in a phenomenon called “interference”, which compromises I/O optimization techniques’ efficacy. In this talk, we focus on I/O scheduling as a tool to alleviate interference’s effects. We have conducted an extensive performance evaluation of five scheduling algorithms at a parallel file system’s data servers. Experiments were executed on different platforms and under different access patterns. Results indicate that schedulers’ results are deeply affected by applications’ access patterns and by the underlying I/O system characteristics – especially by storage devices. Our results have shown that there is no scheduling algorithm able to improve performance for all situations, and the best choice depends on applications’ and storage devices’ characteristics. For these reasons, we will discuss our approach to provide I/O scheduling with adaptivity to applications and devices. We use information about these two aspects to automatically select the best fit in scheduling algorithm to each situation. Our approach has provided better results than using the same algorithm to all situations – without adaptability – due to successfully applying I/O scheduling techniques to improve performance while avoiding situations where it would lead to performance impairment.

In-Situ Processing with FlowVR for Molecular Dynamics Simulations
Bruno Raffin, Inria

In this talk we will present the FlowVR framework for designing, deploying and executing in situ processing applications.  The FlowVR framework combines a  flexible programming environment  with a runtime enabling efficient executions.  Based on a component model, the scientist designs analytics workflows by first developing processing components that are next assembled in a dataflow graph through a Python script. At runtime the graph is instantiated according to the execution context, the framework taking care of deploying the application on the target architecture and coordinating the analytics workflows with the simulation execution. We will present our work through  in-situ processing scenarios developed   for analysing  Gromacs molecular dynamics simulations.

Chronos: Failure-Aware Scheduling in Shared Hadoop Clusters
Orcun Yildiz, Inria

Hadoop emerged as the de facto state-of-the-art system for MapReduce-based data analytics. The reliability of Hadoop systems depends in part on how well they handle failures. Currently, Hadoop handles machine failures by re-executing all the tasks of the failed machines (i.e., executing recovery tasks). Unfortunately, this elegant solution is entirely entrusted to the core of Hadoop and hidden from Hadoop schedulers. The unawareness of failures therefore may prevent Hadoop schedulers from operating correctly towards meeting their objectives (e.g., fairness, job priority) and can significantly impact the performance of MapReduce applications. This paper presents Chronos, a failure-aware scheduling strategy that enables early yet smart action for fast failure recovery while still operating within a specific scheduler objective. Upon failure detection, rather than waiting an uncertain amount of time to get resources for recovery tasks, Chronos leverages a waste-free preemption technique to carefully allocate these resources. In addition, Chronos considers data locality when scheduling recovery tasks to further improve the performance. We demonstrate the utility of Chronos by combining it with Fifo and Fair schedulers. The experimental results show that Chronos recovers to a correct scheduling behavior within a couple of seconds only and reduces the job completion times by up to 43% compared to state-of-the-art schedulers.

Recent progress on numerical kernels for large scale computing on heterogeneous manycores
Luc Giraud, Inria

In this work we will discuss recent progresses on the development, design and implementation of some basic numerical kernels for large scale calculations such as FMM and sparse linear systems solutions including sparse direct and hybrid iterative/direct techniques.
We will detail some of their implementations on top of run time systems to address the performance portability across possibly heterogeneous manycore platforms. Finally we will present their parallel performance on a few large scale engineering applications including some from the CS2@Exa initiative.

A generic Discontinuous Galerkin solver based on OpenCL task graph. Application to electromagnetic compatibility.
Philippe Helluy, Inria

We present how we have implemented a generic nonlinear Discontinuous Galerkin (DG) method in the OpenCL/MPI framework in order to achieve high efficiency. The implementation relies on a splitting of the DG mesh into sub-domains and sub-zones. Different kernels are compiled according to the zones properties. We rely on the OpenCL asynchronous task graph driver in order to overlap OpenCL computations and data transfers.
We show real-world industrial electromagnetic applications.

To Overlap or Not to Overlap: Optimizing Incremental MapReduce Computations for On-Demand Data Upload
Gabriel Antoniu, Inria

Research on cloud-based Big Data analytics has focused so far on optimizing the performance and cost-effectiveness of the computations, while largely neglecting an important as- pect: users need to upload massive datasets on clouds for their computations. This paper studies the problem of run- ning MapReduce applications when considering the simulta- neous optimization of performance and cost of both the data upload and its corresponding computation taken together. We analyze the feasibility of incremental MapReduce approaches to advance the computation as much as possible during the data upload by using already transferred data to calculate intermediate results. Our key finding shows that overlapping the transfer time with as many incremental computations as possible is not always efficient: a better solution is to wait for enough to fill the computational capacity of the MapReduce cluster. Results show significant performance and cost reduction compared with state-of-the-art solutions that leverage incremental computations in a naive fashion.

Scalable and Automated GPU kernel Transformations in Production Stencil Applications
Mohamed Wahib, Riken

We present a scalable method for exposing and exploiting hidden localities in production GPU stencil applications. Our target is to find the best permutation of kernel fusions that would minimize redundant memory accesses. To achieve this, we first expose the hidden localities by analyzing inter-kernel data dependencies. Next, we use a scalable search heuristic that relies on a lightweight performance model to identify the best candidate kernel fusions. To make kernel fusion a practical choice, we developed an end-to-end method for automated transformation. A CUDA-to-CUDA transformation collectively replaces the user-written kernels by auto-generated kernels optimized for data reuse. Moreover, the automated method allows us to improve the search process by enabling kernel fission and thread block tuning. We demonstrate the practicality and effectiveness of the proposed end-to-end automated method. With minimum intervention from the user, we improved the performance of six production applications with speedups ranging between 1.12x to 1.76x.

Spare Node Substitution
Atsushi Hori, Riken

In the coming Exa-flops era, fault resilience is believed to be a big issue. One of the recent research trends is user-level fault mitigation where a user program manages failures so that the program can survive from the failures and continue its execution. However, some applications (e.g., stencil applications) can only survive only when the number of nodes involved in its computation is invariant. To cope with this situation, having spare nodes to substitute failed nodes seems to be a good idea. However, at the best of our knowledge, there has been almost no discussion on how and how many spare nodes should be allocated and how the failed nodes should be substituted with spare nodes. In this talk, it will be shown that the possibility of communication performance degradation due to the substitutions and several substitution methods will be presented and discussed.

Climate modeling towards exascale: the case of NICAM-LETKF
Hisashi Yashiro, Riken

Ensemble-based data assimilation system using Nonhydrostatic ICosahedral Atmosphere Model (NICAM) and Local Ensemble Transform Kalman Filter (LETKF) is one of the proxy application for the development of Japanese post-K computer. This combined application requires data throughput in any layer of HPC system such as memory, network and file I/O. Both an appropriate estimation of the expecting hardware/middleware performance and efforts of application side (including the drastic modification of the current source codes) are essential to achieve a high total throughput in the future exa-scale computing. We will introduce our co-design approach in the Japanese post-K project.

Large-Scale Parallel Image Composition for In Situ Visualization Framework
Jorji Nonaka, Riken

In situ visualization and analysis approach has been shown as a promising approach for handling the ever increasing size and complexity of large-scale parallel simulation results. In a massively parallel environment, the sort-last visualization method, which requires parallel image composition at the end, has become the de facto standard. Since the image composition process requires the communication among the entire nodes, its performance can potentially be affected if the number of nodes still continues to increase. We have been investigating a parallel image composition approach for the SURFACE (Scalable and Ubiquitous Rendering Framework for Advanced Computing Environment), a visualization framework for the current and next-generation supercomputer.

Overview and Future Plan of the XcalableMP Parallel Language
Hitoshi Murai, Riken

XcalableMP (XMP) is a PGAS language for distributed-memory parallel computers. It supports two models of parallel programming for high performance and productivity: directive-based global-view one and RDMA-based local-view one. In this talk, we show the overview of the XMP language specification and the implementation of our Omni XMP compiler. Furthermore, we explain the progress of designing the next version, XcalableMP 2.0.

Understanding Performance On Extreme Scale System Takes Big Data and Extreme Tools
Bill Kramer, UIUC

This talk highlights the systematic performance tools and data collection methods in place across the Blue Waters system to begin.  It will discuss the needs for better data evaluation tools that can handle billions of data points per day. Using this Petascale example, the talk will make some observations and propose some guiding directions for the Extreme Scale systematic performance evaluation.  The second half of the presentation will discuss a planned initiative (proposed but awaiting funding) that will expand the Blue Waters Sustained Petascale Performance (SPP) test into a broader NSF wide sustained performance evaluation method.

An IDEAL Framework: Recent Work at the Intersection of Big Data and “Big Data”
Robert Sisneros, UIUC

Many types of “Big Data” are generated in the routine use and maintenance of an HPC resource.  In addition to the large, structured scientific data generated by at scale simulations run on supercomputers, the machines themselves generate substantial diagnostic data.  The latter is data for which the recent explosion of popular “Big Data” techniques is applicable.  In this talk we will first explore the differences in using visualization to analyze these differing types of data.  We will then present current work on a web-based visualization framework for  “Big Data” that is built on a scientific visualization foundation.  The result is IDEAL: the Interactive, Dynamic, Etc. Analytics Library.

Topology-Aware Performance Modelling
Amanda Bienz, UIUC

Sparse matrix vector multiplication (SpMV) is the main component of many iterative methods.  The cost of a SpMV consists of the performance of local computation as well as the cost of communicating values between processors.  The computational performance is similar on many high-performance computers, as it depends only on the cost of a floating-point operation.  However, the cost of communication varies widely across high performance computers, depending on latency (start up cost of a message), network bandwidth, and topology of a network.  Standard performance models, such as the alpha-beta model, often capture the difference between latency and bandwidth.  These performance models can be improved by taking into account the various topological parameters, such as the distance each message must travel.  Topology-aware methods, such as Abhinav Bhatel’s topology manager, calculate the number of network links that must be traversed for a message to get from one node to another.  The standard alpha-beta performance model can be improved with use of these topology-aware methods, allowing the model to capture the additional cost associated with traversing a large number of links and model a minimal cost associated with network contention.

Numerical integrators for a plane-wave implementation of real-time time-dependent density functional theory
Andre Schleife, UIUC

The adiabatic Born-Oppenheimer approximation is prevalent in electronic-structure simulations and molecular dynamics studies, since it signicantly reduces computational cost, however, within this approximation ultrafast electron dynamics is inaccessible. Achieving a computationally aordable, accurate description of real-time electron dynamics through time-dependent quantum-mechanical theory arguably is one of the greatest challenges in computational materials physics and chemistry today. Several groups are currently exploring real-time time-dependent density functional theory as a possible route and we recently implemented this technique into the
highly parallel Qbox/Qb@ll codes. The numerical integration of the time-dependent Kohn-Sham equations is highly non-trivial: Using a plane-wave basis set leads to large Hamiltonians which constrains what integrators can be used without losing computational eciency. Here, we studied various integrators for propagating the single-particle wave functions explicitly in time, while achieving high parallel scalability of the plane-wave pseudopotential implementation. We compare a fourth order Runge-Kutta scheme that we found to be conditionally stable and accurate to an enforced time reversal symmetry algorithm. Both are well-suited for highly parallelized supercomputers as proven by excellent performance on a large number of nodes on BlueGene based \Sequoia” at LLNL and Cray XE6 based \Blue Waters” at NCSA. This allows us to apply our scheme to materials science simulations involving hundreds of atoms and thousands of electrons.

The Einstein Toolkit: A Community Computational Infrastructure for Relativistic Astrophysics
Gabrielle Allen, UIUC

The Einstein Toolkit is a community-driven software platform of core computational tools to advance and support research in relativistic as- trophysics and gravitational physics, with goals to broaden and support the numerical relativity and computational astrophysics communities; to facilitate interdisciplinary collaborations, and to leverage and drive advances in high-end computing cyberinfrastructure. Currently, the Einstein Toolkit involves over 100 registered users from over 50 different research groups world-wide and  the premier tools enabling discoveries in strong and dynamical space-time phenomena. Furthermore, the toolkit is one of the platforms of choice to take advantage of the most powerful computational hardware available to study astrophysical systems endowed with complex multi-scale/multi-physics properties and governed by Einstein’s equations of General Relativity. This talk gives a brief overview of the current status of the Einstein Toolkit and provides some future directions.

BSC tools to study the computational efficiency of EC-Earth components
Miguel Castrilló, BSC

In this talk, we will present real and practical applications of the BSC performance tools use to analyse and understand the performance and also develop optimisations for the EC-EARTH climate model. This model is developed by the EC-Earth consortium, and is widely used at BSC for climate prediction forecasts. The EC-Earth component models are IFS for the atmosphere, NEMO for the ocean, and LIM for the sea-ice, coupled through OASIS. A coupled model composed by different components and running in different configurations and resolutions is a challenge for computer scientists to identify the parts of the code that should be improved to increase application performance. Methodology and examples of improvements done will be presented and discussed.

Performance Impacts with Reliable Parallel File Systems at Exascale Level
Ramopn Nou, BSC

The introduction of Exascale storage into production systems will lead to an increase on the number of storage servers needed by parallel file systems. In this scenario, parallel file system       designers should move from the current replication configurations to the more space and energy efficient erasure-coded configurations between storage servers. Unfortunately, the current trends on energy efficiency are directed to creating less powerful clients, but a larger number of them (light-weight Exascale nodes), increasing the frequency of write requests and therefore creating more parity update requests. We investigate RAID-5 and RAID-6 parity-based reliability organizations in Exascale storage systems. We propose two software mechanisms to improve the performance of write requests. The first mechanism reduces the number of operations to update a parity block, improving the performance of writes up to 200%. The second mechanism allows applications to notify when reliability is needed by the data, delaying the parity calculation and improving the performance up to a 300%. Using our proposals, traditional replication schemes can be replaced by reliability models like RAID-5 or RAID-6 without the expected performance loss.

Large-scale Simulations for Biomedical Research at Organ Level
Mariano Vázquez, BSC

In this seminar we describe HPC-based simulations of biological systems at organ level: target, methods and strategies. Unlike the molecular domain, large-scale simulations at organ level are still far from being usual, especially if multi-scale / multi-physics is involved. The interest is high indeed, as some of the latest Gordon Bell Prices awarded achievements in this domain. In this talk we describe the research lines of the CASE department.

DLB: Dynamic Load Balancing Library
Marta Garcia, BSC

DLB is a dynamic library intended to increase the performance of hybrid applications by improving the load balance within a computational node. DLB will redistribute the computational resources between different processes running in a shared memory node. The load balance will be done at runtime, allowing to solve imbalances coming from different sources, and transparent to the user. DLB supports different programming models and offers and API that can be used by programming models runtime or by application developers to provide useful information for load balancing purposes.

Asynchronous algorithms to mitigate faults recoveries and enable approximate computing
Marc Casas, BSC

Asynchronous algorithms have shown to be useful to enable different kinds of low-overhead resilience strategies. Additionally, since synchronization points constitute an important performance burden in High Performance Computing workloads, novel ideas are starting to emerge to mitigate such costs by trading performance for accuracy bits that do not contribute significantly in the final algorithm output. The talk will provide some results recently obtained at BSC to illustrate the potential of asynchronous and approximate computations for the future of HPC.

Persistent data as a first-class citizen in parallel programming models
Ana Queralt, BSC

DataClay is a storage platform that manages data in the form of objects. It enables the applications on top to deal with distributed persistent objects transparently, in the same way as if they were in memory. In addition, dataClay takes to the limit the concept of moving computation to the data, by never separating the data from the methods that allow to manipulate it.
By always keeping data and code together, dataClay makes it easier for programming models such as COMPSs to take advantage of data locality, for instance by means of locality-aware iterators that help to exploit parallelism. The combination of these two technologies provides a powerful solution to access and compute on huge datasets, allowing applications to easily handle objects that are too big to fit in memory or that are distributed among several nodes.
In this talk we will address how persistent data can be integrated in the programming model by presenting the integration of dataClay and COMPSs, both from the point of view of an application that manages objects and from the point of view of the runtime.

Programmability in PyCOMPSs
Rosa M Badia, BSC

On of today concerns on application development is programmability. While the concept is well understood, it is difficult to measure.  PyCOMPSs programmability is based on sequential programming, and in the preservation of the expresivity and potential of programming languages, only using few additions (annotations and small API). This talk will present several examples of PyCOMPSs programming  compared with other programming models, such as Apache Spark, and also examples of porting libraries and codes to this programming model.

Study the use of the Folding hardware-based profiler to assist on data distribution for heterogeneous memory systems in HPC
Harald Servat, BSC

Argonne’s research in data distribution and partitioning for heterogeneous memory compute nodes currently relies on a simulator-based data-oriented profiler as a first stage. The current profiling stage is time-consuming. We are interested in evaluating the possibility of adapting and using the profiling tool “Folding” from BSC for this purpose. Since it is based on hardware counters, it seems clear that the profiling time will be greatly reduced. Given the lossy nature of profilers based on hardware counters, however, it will be interesting to determine if this solution provides sufficient resolution for the subsequent stage to generate a well-optimized data distribution. In this talk we will present the project and its current status.

Efficient Software-based Fault Tolerance for Memory Errors in the Exascale Era
Omer Subasi, BSC

Memory reliability will be one of the major concerns for future HPC and Exascale systems. This concern is mostly attributed to the expected massive increase in memory capacity and the number of memory devices in Exascale systems. Error Correcting Codes (ECC) are the most commonly used techniques for memory systems. However state-of-the art hardware ECCs will not be sufficient in terms of error coverage for future computing systems and stronger hardware ECCs providing more coverage have prohibitive costs in terms of area, power and latency. Software-based solutions are needed to cooperate with hardware. In this work, we propose three runtime-based software mechanisms with diverse fault-tolerance capabilities as well as space/memory costs. This provides the flexibility to tailor fault-tolerance according to the system needs. We show that all three mechanisms incur low performance overhead, on average 3%, and are highly scalable. Somewhat surprisingly, we find that software-based CRC protection is feasible providing correction for up to 32-bit burst (consecutive) and 5-bit arbitrary errors while incurring only 1.7% performance overhead with hardware acceleration. Finally, we provide a recipe for how/when to adapt our mechanisms as well as analyze their reliabilities and error coverages. We find that our design reduces the Chipkill undetected error rate by as high as 10^15 times which is vital considering Exascale error rates.