2nd JLESC Workshop Agenda

Note: if you open this page on your smartphone and click on restaurant addresses, google map will be launched and show you the route (turn by turn navigation).

Here is a map of the meeting rooms.

 

Main TopicsScheduleSpeakerAffiliationType of presentationTitle (tentative)
Sunday Nov. 23rd
Dinner Before the Workshop
Meeting place: directly at the restaurant
19:00Only people registered for the dinner (included)VaPiano,
44 S Wabash Avenue, IL 60603 Chicago,
tel: 312 384 1960
Postdocs, Ph. Ds. and students meeting,
Co-chairs: Sheng Di and Min Si
Room: Inspiration Studio
21:00
Workshop Day 1Monday Nov. 24th
Breakfast
Room: Historic Art Hall
7:00
Registration
Room: Ballroom Prefunction
8:00Hotel, conference room level
Welcome and Introduction
Room: Crystal Ballroom
8:30 Franck CappelloANL, UIUC and InriaBackgroundWelcome, Workshop objectives and organization
Chair: Franck Cappello, ANL8:45Marc SnirANL and UIUCBackground
Argonne Novelties and vision of the collaboration
8:55Bill KramerBackgroundUIUC Novelties and vision of the collaboraiton
9:05Claude KrichnerInriaBackground
Inria Novelties and vision of the collaboration
9:15Jesus LabartaBSCBackgroundBSC Novelties and vision of the collaboraiton
9:30Dany PowelUIUC/NCSABackgroundGECAT (Global Initiative to Enhance @scale and distributed Computing and Analysis Technologies)
BREAK
Room: Ballroom foyer
10:00
Chair: Marc Snir, ANL
Room: Crystal Ballroom
10:30Bill HarrodDoEBackgroundDoE Exascale Initiative
11:00Jesus LabartaBSCBackground
11:30Naoya MaruyamaRikenBackgroundThe RIKEN Miniapp Suite
LUNCH
Room: Historic Art Hall
12:00
Parallel session 1
Applications and Numerical Libraries
Chair: Bill Gropp, UIUC
Room: Crystal Ballroom
13:30Salman HabibANLResearchFuture Plans for HACC
14:00Marie Alice FougeolsCNRSResearchClimate modelling - current status of IPSL climate model and work plan for exascale climate model
14:30Mariano VazquezBSCResearchAlya: multi-physics simulations for large-scale supercomputers
15:00Ivo KabadshowJSCResearchMore bang for the buck. Advancing FMMs for MD to the next level
BREAK
Room: Ballroom foyer
15:30
16:00Andrew SiegelANLResearchTrends in Next Generation HPC Architectures and Their Impact on Computational Methods for Nuclear Reactor Analysis
16:30Paul F BaumeisterJSCResearchCo-designing Exascale architectures with linear-scaling density-functional calculations
17:00Jean-François MehautInriaResearchSimulation of Seismic Wave Propagation on a Low Power Manycore Processor
17:30Jed BrownANLResearchHow can we quantify performance versatility?
18:00Adjourn
Parallel session 2
I/O, Storage, Visualization
Chair: Rob Ross
Room: Chicago/Alton
13:30Toni CortesBSCResearchdataClay: shaping the future of data sharing
14:00Tom PeterkaANLResearchFrom particles to meshes to grids: Data movement within and between data analysis codes
14:30Florin IsailaANL and University Carlos III (Spain)ResearchCLARISSE: Reforming the I/O stack of high-performance computing platforms
15:00Matthieu DorierInriaResearchEnergy/Performance Tradeoffs in Post-Petascale I/O Approaches: an Insight using Damaris
BREAK
Room: Ballroom foyer
15:30
16:00Venkat VishwanathANLResearchAddressing Data Movement Challenges at Extreme Scales
16:30Lokman RahmaniInriaResearchTowards a generic framework for post-processing tasks coupling for HPC applications
17:00Kate KeaheyANLResearchChameleon: A Large-scale, Reconfigurable Experimental Environment for Next Generation Cloud Research
17:30Adjourn
DINNER19:00Chicago Firehouse,
1401 S Michigan Ave, Chicago, IL 60605, tel: 312 786 1401
Postdocs, Ph. Ds. and students meeting,
Co-chairs: Sheng Di and Min Si
Room: Inspiration Studio
21:00
Workshop Day 2Tuesday Nov. 25th
Breakfast
Room: Historic Art Hall
7:00
Keynote
Chair: Sergi Girona, BSC
Room: Crystal Ballroom
8:30Charlie CatlettANLResearchUnderstanding Cities: Opportunities for Computation, Data Analytics, and Embedded Systems
Parallel session 3
Resilience, Fault Tolerance
Chair: Franck Cappello, ANL
Room: Crystal Ballroom
9:00Marc Casas GuixBSCResearchExploiting asynchronous programming models to reduce faults impact in iterative solvers
9:30Guillaume AupyInriaResearchScheduling computational workflows on failure-prone platforms
BREAK
Room: Ballroom foyer
10:00
10:30Leonardo Bautista GomezANLResearchDetecting Silent Data Corruption for Extreme-Scale Applications through Data Mining
11:00Tatiana MartsinkevichInriaResearchFault tolerant protocol for OmpSs+MPI hybrid HPC applications
11:30Christine MorinInriaResearchCheckpointing as a Service in Heterogeneous Cloud Environments
LUNCH
Room: Historic Art Hall
12:00
Parallel session 4
Programming
Model /runtime
and tools
Chair: Jean-François Mehaut, Inria
Room: Chicago/Alton
9:00Javier BartolomeBSCResearchIntegration of batch and performance monitoring tools
9:30Arnaud LegrandInriaResearchUpdate on the SMPI framework and introduction to spatio-temporal aggregation
BREAK
Room: Ballroom foyer
10:00
10:30Antonio PenaANLResearchThe Upcoming Era of Memory Heterogeneity in Compute Nodes
11:00Harald ServatBSCResearchFolding: instantaneous performance
11:30Torsten HoeflerETHResearchNotified Access: Extending Remote Memory Access Programming Models for Producer-Consumer Synchronization
LUNCH
Room: Historic Art Hall
12:00
Plenary talk
Chair: Sanjay Kale, UIUC
Room: Crystal Ballroom
13:30Mateo ValeroBSCForward lookingRuntime Aware Architectures
Parallel session 5
Resilience, Fault Tolerance
Chair: Yves Robert, Inria
Room: Crystal Ballroom
14:00Catello Di MartinoUIUCResearchMeasuring and Understanding Resilience at Extreme Scale: a Field Study of 5,000,000 Applications
14:30Sheng DiANLResearchAn Efficient Silent Data Corruption Detection Method with Error-feedback Control and Even Sampling for HPC Applications
15:00Aurélien CavelanInriaResearchAssessing general-purpose algorithms to cope with fail-stop and silent errors
BREAK
Room: Ballroom foyer
15:30
16:00Omer SubasiBSCResearchReliability Modeling of Selective Redundancy for Task-parallel Dataflow Programs
16:30Ana GainaruUIUCResearchDealing with prediction unfriendly failures: the road to specialized predictors.
17:00Thomas HeraultUTKResearchSoft Error Resilience in Dynamic Task-Based Runtime Systems
Chairs: Bill Kramer, UIUC and Mateo Valeo, BSC17:30PANELConnexion to Industry
18:30Adjourn
Parallel session 6
Programming
Model /runtime
and tools
Chair: Pavan Balaji, ANL
Room: Chicago/Alton
14:00Judit GimenezBSCResearchBSC Performance Models
14:30Carsten KarbachJSCResearchSystem monitoring with LLview and the Parallel Tools Platform
15:00Edson L. PadoinUFRGS-InriaResearchEnergy Consumption Reduction through Load Balancing and DVFS on Residually Imbalanced Cores
BREAK
Room: Ballroom foyer
15:30
16:00Emmanuel JeannotInriaResearchTopology-aware Resource Selection
16:30Min SIANLResearchCasper: An Asynchronous Progress Model for MPI RMA on Many-Core Architectures
17:00Yanhua SunUIUCResearchAn automatic control system for runtime adaptivity
18:30Adjourn
DINNER19:00Mercat a la Planxa, inside the Blackstone hotel.
Postdocs, Ph. Ds. and students meeting,
Co-chairs: Sheng Di and Min Si
Room: Inspiration Studio
21:00
Workshop Day 3
Wednesday Nov. 26th
Breakfast
Room: Historic Art Hall
7:00
Plenary8:30Franck Cappello, Yves Robert, Bill Kramer, Jesus LabartaInria, UIUC/NCSA, ANL, BSCOpen discussion about the workshop
Plenary talk
Chair: Indranil Gupta: UIUC
Room: Crystal Ballroom
9:00Gabriel AntoniuInriaForward lookingTransfer as a Service: Towards a Cost-Effective Model for Multi-Site Cloud Data Management
Session 7
Cloud and
distributed
Algorithm
Chair: Frederic Desprez, Inria
Room: Crystal Ballroom
9:30Rosa BadiaBSCResearchProgramming distributed platforms with PyCOMPSs and its integration with persistent storage systems
10:00Ian FosterANLResearchRAMSES: A new project in data-driven analytical modeling of distributed systems
BREAK
Room: Ballroom foyer
10:30
11:00Indranil GuptaUIUCResearchProbabilistic CAP and Timely Adaptive Key-value Stores
11:30Andrew ChienANL-U.ChicagoResearchCharacterizing Variation in Graph Computing Behavior
CLOSING12:00
LUNCH
Room: Historic Art Hall
12:30
Danny Powell

GECAT (Global Initiative to Enhance @scale and distributed Computing and Analysis Technologies):

An NSF SAVI award to NCSA to support collaborations addressing global grand challenges.  We will discuss this recent award, who is involved, what it is scoped to do, and how it fits with the JLESC effort.

Naoya Maruyama

The RIKEN Miniapp Suite

We present an overview of the RIKEN miniapp suite called FIBER, which is a set of the miniapps being developed and maintained at RIKEN AICS to promote faster, more effective codesign towards next generation supercomputing. It was originally started as a sub project of the Exascale Feasibility Study project conducted by RIKEN AICS and Tokyo Institute of Technology, and is now part of the next-generation flagship supercomputer project at RIKEN AICS. Most of the FIBER miniapps are based on production applications that have mainly been developed and used in the Japanese computational science communities at supercomputing facilities such as the K computer. For each of such production code, we collaborate with the original application developers to identify and extract essential part of the code for designing and evaluating next generation machines. While the primary objective of our miniapps is to contribute to the development of the next generation flagship machine, they are also freely available as (mostly) open-source programs so that a wide range of collaborations, being academia-only or involving industry, can be easily conducted. As of this talk, the suite includes five released miniapps and several more that are still under development.

Salman Habib

Future Plans for HACC

HACC (Hardware/Hybrid Accelerated Cosmology Code) is a multi-platform, high-performance computational cosmology framework. HACC is designed around a particle-based simulation model for very high global dynamic range applications. In this talk, I will briefly describe the key features of HACC and cover plans for its future development, focusing on in-situ analysis, resilience features, and physics enhancements.

Marie-Alice Foujols

Climate modelling – current status of IPSL climate model and work plan for exascale climate model

After a short presentation of European cooperation around climate modeling I will detail IPSL current model (IPSLCM6) and our plan based on DYNAMICO (a new dynamical core base on icosaedric grid) to be able to run IPSL climate model on o(100 000) cores. Some issues will be detailled to encourage possible cooperation with JLESC lab.

Mariano Vazquez

Alya: multi-physics simulations for large-scale supercomputers

Thanks to a multidisciplinary task force, BSC research lines cover the full range of HPC-based simulations tools in an integrated way: the hardware architecture, the programming model and compiler, the data management and storage, the performance analysis tools, the mathematical and numerical model, the parallel implementation of a simulation code and its porting, the results visualization and analysis. This working strategy gives BSC a unique perspective. This talk is about the simulation codes developed at the CASE department, with special focus on Alya, our multi-physics parallel code.

Ivo Kabadshow

More bang for the buck. Advancing FMMs for MD to the next level

Classical biophysical molecular dynamics simulations only utilize a small portion of todays multi-petaflop hardware. The problem size with millions or billions of particles is already sufficient for most MD application and will most certainly not strong-scale to exascale levels. Since the number of available flops is still increasing rapidly, we have to find other ways to harvest the additionally available cycles for MD simulations in the future. One way to increase the workload again is to introduce more complex physical models. We will present a lambda-dynamics scheme coupled with a Fast Multiple Method (FMM) to be introduced into GROMACS. Compared to currently available Particle-Mesh-Ewald (PME) implementations the FMM together with lambda-dynamics is expected to increase scalability greatly. Besides the sheer number of available flops a rich heterogeneous hierarchy of processing units needs to be addressed within the implementation as well. In the second part of our talk we will highlight our efforts to describe the FMM operators in a performance-portable way for optimal CPU-intranode utilization. We designed our current C++ version of the FMM to intrinsically exploit hardware features like SIMD, ILP, OoOE, SMT on multiple cores and show “out-of-the-box” performance results as well as comparisons with results from in-order architectures like BG/Qs A2 processor.

Andrew Siegel

Trends in Next Generation HPC Architectures and Their Impact on Computational Methods for Nuclear Reactor Analysis

Next-generation HPC platforms in many cases will force application developers to re-formulate fundamental algorithmic and implementation approaches that were adopted over the previous twenty years. Overall levels of concurrency, the relative cost of FLOP/s compared to data movement, available memory per floating point unit, the depth and complexity of the memory hierarchy, awareness of power costs, and overall resilience characteristics are a few broad areas where exascale-type machines are likely to depart signficantly from current practice. While constrained to some degree by the technology, in designing future HPC systems there is still considerable latitude both in a relatively broad range of design tradeoffs and the programming models that are used to optimally express them. At the same time, regardless of specific design choices, most applications will need to evolve considerably to make efficient use of these systems, including developing new algorithmic implementations, formulations, and potentially even new mathematical descriptions of the target physical problem. CESAR (The Center for Exascale Simulation of Advanced Reactors) is a project developed explicitly to address the “push” and “pull” of co-design for nuclear energy applications. In this talk I discuss in depth several concrete examples of recent developments and future research topics within the CESAR project.

Paul F Baumeister

Co-designing Exascale architectures with linear-scaling density-functional calculations

Quantum mechnical ab initio calculations in the framework of density functional theory have proven to be the best affordable method to predict a large set of material properties for any possible structure of atoms. Many real-world materials obtain their characteristics from broken symmetries which require huge simulations. KKRnano can perform DFT calculations with order-N scaling at system sizes beyond several thousand atoms. This implementation exposes a huge task parallelism and can benefit strongly from accelerator components. We will introduce our application oriented co-design approach for the evaluation of potential Exascale compute-node architectures and demonstrate its strength at the example of KKRnano.

Jean François Mehaut

Simulation of Seismic Wave Propagation on a Low Power Manycore Processor

Large-scale simulation of seismic wave propagation is an active research topic. Its high demand for processing power makes it a good match for High Performance Computing (HPC). Although we have observed a steady increase on the processing capabilities of HPC platforms, their energy efficiency is still lacking behind. In this talk, we analyze the use of a low-power manycore processor, the MPPA-256, for seismic wave propagation simulations. First we look at its peculiar characteristics such as limited amount of on-chip memory and describe the intricate solution we brought forth to deal with this processor’s idiosyncrasies. Next, we compare the performance and energy efficiency of seismic wave propagation on MPPA-256 to other common-place platforms such as general-purpose processors and a GPU. Finally, we wrap up with the conclusion that, even if MPPA-256 presents an increased software development complexity, it can indeed be used as an energy efficient alternative to current HPC platforms, resulting in up to 71% and 81% less energy than a GPU and a general-purpose processor, respectively.

Jed Brown

How can we quantify performance versatility?

Real applications are often constrained by factors such as external requirements for time-to-solution, a need to fit within some level of memory, or to accommodate workflow demands involving human decisions, provenance, proprietary software, or the like. Consequently, the region of multi-dimensional configuration space that an application cares about may be disjoint from the single point or path chosen by authors when promoting their new algorithm or machine. Different authors are likely to choose different configurations, leading to results that are not very useful to applications. Drawing on experience and performance data obtained while developing the HPGMG benchmark (https://hpgmg.org) and working with PETSc applications, we discuss issues and methods for collecting and presenting performance data to express versatility and improve relevance to applications.

Toni Cortes

dataClay: shaping the future of data sharing

The value of big data comes from the possibility of extracting information from large amounts of raw data. And, as in real life, the most valuable information comes from the merging of shared information from different sources. Unfortunately, current sharing mechanisms are either too restrictive and thus not flexible enough, or the data provider losses control over its asset (its data). This limitation prevents data owners and potential service designers from taking advantage of the available data. In this talk we will introduce the idea of self-contained objects and how 3rd-party enrichment of such objects can offer an environment where data providers keep full control over its data while service designers get the maximum flexibility.

Tom Peterka

From particles to meshes to grids: Data movement within and between data analysis codes

Scientific discovery hinges on the ability to analyze data, but as scientists’ ability to compute or collect raw data increases exponentially, our ability to process these data will become perhaps the single largest factor determining success or failure of a scientific campaign. Motivated by a use case in the analysis of a cosmology simulation, I will present an outline for a data analysis infrastructure designed to scale with the growth of raw data and future HPC architectures. This infrastructure includes a review of how to develop a single parallel data analysis code, and also a preview of a new project to couple multiple such codes together in a data analysis workflow.

Florin Isaila

CLARISSE: Reforming the I/O stack of high-performance computing platforms

Currently, the I/O software stack of high-performance computing platforms consists of independently developed layers (scientific libraries, middlewares, I/O forwarding, parallel file systems), lacking global coordination mechanisms. This uncoordinated development model negatively impacts the performance of both independent and ensembles of applications relying on the I/O stack for data access. This talk will present the CLARISSE project approach of redesigning the I/O stack aiming to facilitate global optimizations, programmability, and extendability. We will discuss a set of novel abstractions that enable novel optimizations for critical aspects of I/O stack including data aggregation, buffering, staging, selection, and exploitation of the data locality by in-situ and in-transit data processing.

Matthieu Dorier

Energy/Performance Tradeoffs in Post-Petascale I/O Approaches: an Insight using Damaris

A major challenge of future Exascale machines consists of sustaining a high performance per watt ratio. Many recent works have explored new approaches to I/O management aiming to reduce the I/O performance bottleneck exhibited by HPC applications (and hence to improve application performance). There is comparatively little work investigating the impact of I/O approaches on energy consumption. In particular, approaches that attempt to overlap computation with I/O have a beneficial effect on performance variability and thus, on energy consumption. In this presentation, we closely examine different I/O approaches implemented in the Damaris I/O middleware and perform extensive experiments with the CM1 atmospheric model to evaluate their performance and their energy consumption. We then propose and validate a mathematical model to estimate the energy consumption of a simulation under different I/O approaches. This presentation will be the occasion to give a brief summary of the status of the Damaris middleware, developed in the KerData team since 2010, including its most recent features and a summary of the research it allowed us to conduct.

Venkat Vishwanath

Addressing Data Movement Challenges at Extreme Scales

We present approaches to improve data movement at various scales including within a node, among nodes of a supercomputer,  and between nodes on a wide-area network. These approaches are focussed to develop solutions primarily for upcoming data-centric workloads including data analysis, I/O, data transfers, as well as in  computational science simulations. We will present promising  results  of our efforts with applications on leadership systems and architectures.

Lokman Rahmani

Towards a generic framework for post-processing tasks coupling for HPC applications

Optimizing the time-to-solution for HPC experimentations requires more than optimizing the simulation itself. Indeed, current HPC scientific experimentations group many analysis tasks applied to the generated data to extract useful knowledge from it. Tasks coupling consists of connecting those analysis tasks with the simulation and with each others. The first part of this talk shows an example of such HPC experimentations, where the data generated by a climate simulation is filtered before being visualized, to keep only scientifically relevant data (this will be a follow-up of my previous JLESC talk). The second part of the talk will focus on a work-in progress to define a generic, reusable, efficient way of coupling analysis tasks forming dataflow, and how to achieve it.

Charlie Catlett

Understanding Cities: Opportunities for Computation, Data Analytics, and Embedded Systems

Charlie will discuss a number of initiatives at Argonne and the University of Chicago including the use of computational models to inform urban design, predictive analytics using urban operations data, and opportunities for embedded systems research in urban sensing and new information services.

Marc Casas Guix

Exploiting asynchronous programming models to reduce faults impact in iterative solvers

In the context of fault tolerant computing, exploiting asynchronous programming models to tolerate the latency due to recovery mechanisms by overlapping them with computation is a promising idea. To be effective, this scheme must dynamically adapt the workload depending on the hardware status and the prevalence of faults without adding any programmability burden. In this talk, we show results obtained by deploying forward recoveries to the Conjugate Gradient solver (CG) either by overlapping them with algorithmic computations or by forcing them to be in the critical path of CG. We show that a trade-off exists between both approaches depending on the error rate the solver is suffering.

Guillaume Aupy

Scheduling computational workflows on failure-prone platforms

We study the scheduling of computational workflows on compute resources that experience exponentially distributed failures. When a failure occurs, rollback and recovery is used to resume the execution from the last checkpointed state. The scheduling problem is to minimize the expected execution time by deciding in which order to execute the tasks in the workflow and whether to checkpoint or not checkpoint a task after it completes. We give a polynomial-time algorithm for fork graphs and show that the problem is NP-complete with join graphs. Our main result is a polynomial-time algorithm to compute the execution time of a workflow with specified to-be-checkpointed tasks. Using this algorithm as a basis, we propose efficient heuristics for solving the scheduling problem. We evaluate these heuristics for representative workflow
configurations.

Leonardo Bautista Gomez

Detecting Silent Data Corruption for Extreme-Scale Applications through Data Mining

Next-generation machines are expected to have more components and, at the same time, consume several times less energy per operation. These trends are pushing supercomputer construction to the limits of miniaturization and energy-saving strategies. Consequently, the number of soft errors is expected to increase dramatically in the coming years. While mechanisms are in place to correct or at least detect some soft errors, a significant percentage of those errors pass unnoticed by the hardware. Such silent errors are extremely damaging because they can make applications silently produce wrong results. In this work we propose a technique that leverages certain properties of high-performance computing applications in order to detect silent errors at the application level. Our technique detects corruption solely based on the behavior of the application datasets and is application-agnostic. We propose multiple corruption detectors, and we couple them to work together in a fashion transparent to the user. We demonstrate that this strategy can detect the majority of the corruptions, while incurring negligible overhead. We show that with the help of these detectors, applications can have up to 80% of coverage against data corruption.

Tatiana Martsinkevich

Fault tolerant protocol for OmpSs+MPI hybrid HPC applications

During my talk I will present a fault tolerant protocol that can be applied to task-parallel applications to mitigate transient errors. Our approach limits the consequences of a fault to the task that experienced it and allows a fast and asynchronous recovery that is more efficient than the conventional full application rollback-recovery. We implemented it to fully support hybrid OmpSs+MPI applications. Experimental evaluation showed that the protocol has low overhead and does not have an impact on the application scalability.

Thomas Herault

Soft Error Resilience in Dynamic Task-Based Runtime Systems

As the scale of modern computing systems grows, failures will happen more frequently. On the way to Exascale a generic, low-overhead, resilient extension becomes a desired aptitude of any programming paradigm. In this talk, I will present three additions to a dynamic task-based runtime to build a generic framework providing soft error resilience to task-based programming paradigms. The first recovers the application by re-executing the minimum required sub-DAG, the second takes critical checkpoints of the data flowing between tasks to minimize the necessary re-execution, while the last one takes advantage of algorithmic properties to recover the data without re-execution. These mechanisms have been implemented in the PaRSEC task-based runtime framework. Experimental results validate the approach and quantify the overhead introduced by such mechanisms.

Javier Bartolome Rodriguez

Integration of batch and performance monitoring tools

During this presentation shows the different tools that the BSC Operations department is using to monitor and operate the different HPC clusters. These bunch of tools try to integrate information from different sources (batch system, performance monitoring, etc.) with a common interface.

Arnaud Legrand

Update on the SMPI framework and introduction to spatio-temporal aggregation

In the first part of this talk, I will present the recent developments obtained in the SMPI framework (integration with Paraver, network topologies, Infiniband models, emulation of unmodified applications, simulation of dynamic applications). Then, I will present entropy-based aggregation techniques that allow to build multi-scale visualizations of parallel programs and which have been developed in the MESCAL and MOAIS team. Analysts commonly use execution traces collected at runtime to understand the behavior of an application running on distributed and parallel systems. These traces are inspected post mortem using various visualization techniques that, however, do not scale properly for a large number of events. This issue, mainly due to human perception limitations, is also the result of bounded screen resolutions preventing
the proper drawing of many graphical objects. This paper proposes a new visualization technique overcoming such limitations by providing a concise overview of the trace behavior as the result of a spatiotemporal data aggregation process. The experimental results show that this approach can help the quick and accurate detection of anomalies in traces containing up to two hundred million events.

Antonio Pena

The Upcoming Era of Memory Heterogeneity in Compute Nodes

Compute nodes equipped with a variety of memory technologies such as scratchpad memory, on-chip 3D-stacked memory, or NVRAM-based memory, apart from the different varieties of DRAM-based memories, are already a reality. Careful use of the different memory subsystems is mandatory in order to exploit the potential of such supercomputers. I will present our view on upcoming heterogeneous memory systems, which comprises exposing the different memory subsystems as first-class citizens to efficiently exploit their capabilities, and our ongoing research in how to efficiently partition and distribute applications’ data among the different memory subsystems.

Harald Servat

Folding: instantaneous performance

In this talk, we present the folding mechanism which provides instantaneous performance metrics using coarse-grain sampling and instrumentation. The mechanism not only helps pointing out the nature of the bottlenecks in a piece of code, but also identifies the associated source-code. Therefore, the mechanism assist the analyst in finely understanding the application behavior without incurring in large overheads.

Torsten Hoefler

Notified Access: Extending Remote Memory Access Programming Models for Producer-Consumer Synchronization

Abstract: Remote Memory Access (RMA) programming enables direct access to low-level hardware features to achieve high performance for distributed-memory programs. However, the design of RMA programming schemes focuses on the memory access and less on process synchronization. For example, in contemporary RMA programming systems, the widely used producer-consumer pattern can only be implemented inefficiently, incurring the overhead of an additional round-trip message. We propose Notified Access, a scheme where the target process of an access can receive a completion notification. This scheme enables direct and efficient synchronization with a minimum number of messages. We implement our scheme in an open source MPI-3 RMA library and demonstrate lower overheads (two cache misses) than other point-to-point synchronization mechanisms. We also evaluate our implementation on three real-world benchmarks: a stencil computation, a tree computation, and a Cholesky factorization implemented with tasks. Our scheme always performs better than traditional message passing and other existing RMA synchronization schemes, providing up to 50% speedup on small messages. Our analysis shows that Notified Access is a valuable primitive for any RMA system. Furthermore, we provide guidance for the design of low-level network interfaces to support Notified Access efficiently.

Mateo Valero

Runtime Aware Architectures

In the last few years, the traditional ways to keep the increase of hardware performance to the rate predicted by the Moore’s Law have vanished. When uni-cores were the norm, hardware design was decoupled from the software stack thanks to a well defined Instruction Set Architecture (ISA). This simple interface allowed developing applications without worrying too much about the underlying hardware, while hardware designers were able to aggressively exploit instruction-level parallelism (ILP) in superscalar processors. With the irruption of multi-cores and parallel applications, this simple interface started to leak. As a consequence, the role of decoupling again applications from the hardware was moved to the runtime system. Efficiently using the underlying hardware from this runtime without exposing its complexities to the application has been the target of very active and prolific research in the last years.
Current multi-cores are designed as simple symmetric multiprocessors (SMP) on a chip. However, we believe that this is not enough to overcome all the problems that multi-cores already have to face. It is our position that the runtime has to drive the design of future multi-cores to overcome the restrictions in terms of power, memory, programmability and resilience that multi-cores have. In this talk, we introduce a first approach towards a Runtime-Aware Architecture (RAA), a massively parallel architecture designed from the runtime’s perspective

Catello Di Martino

Measuring and Understanding Resilience at Extreme Scale: a Field Study of 5,000,000 Applications

Failures are inevitable in extreme-scale systems. Many techniques have been proposed with the ultimate goal of providing resilience at scale. However, an important and often neglected step in addressing the resilience challenge is to understand how system errors and failures impact extreme-scale jobs and applications. In this talk, we present an in-depth characterization of the error/failure sensitivity of more than 5 million XE and XK applications launched by about 1000 users during the first 515 production days of Blue Waters at NCSA. The characterization is performed by mining i) job-level logs from the scheduler, ii) application-level logs from the application loader and placement subsystem, iii) error logs extracted from the syslogs and system-level hardware sensors, and iv) manual failure reports produced by the system maintenance experts. The talk presents also the results of an in-depth analysis of the factors influencing the application-level resiliency such as scale, node hours, user experience and underlying computing platform (i.e., XE or XK), as well as an overview of LogDiver, the tool we have developed to implement the analysis workflow.

Sheng Di

An Efficient Silent Data Corruption Detection Method with Error-feedback Control and Even Sampling for HPC Applications

The Silent Data Corruption (SDC) problem is attracting more and more attentions as it is understood it will have a great impact on exascale HPC applications. Such SDC faults are fairly hazardous in that they pass unnoticed by hardware and have the potential to lead to wrong computation results. In this work, we formulate SDC detection as a run-time one-step ahead prediction, leveraging multiple linear prediction methods in order to improve the detection results. The contributions are three-fold. (1) We propose an error feed-back control model which can effectively reduce the prediction errors for different linear prediction methods. (2) We propose a spatial-data based even-sampling method to minimize the detection overheads (including memory and computation cost). (3) We implement our algorithms into the Fault Tolerance Interface (FTI), a practical fault tolerance (FT) library with multiple checkpoint levels, such that users can conveniently protect their HPC applications against both SDC errors and fail-stop errors. We evaluate our approach using large-scale traces from well known large-scale HPC applications, as well as by running those HPC applications on a real cluster environment. Experiments show that our error feed-back control model can improve detection sensitivity by 34-189% for bit-flip memory errors injected with the bit positions in the range [20,30], without any degradation on detection accuracy. Furthermore, memory size can also be reduced by 33% under our spatial-data even-sampling method, with only a slight graceful degradation on the detection sensitivity.

Aurélien Cavelan

Assessing general-purpose algorithms to cope with fail-stop and silent errors

In this paper, we combine the traditional checkpointing and rollback recovery strategies with verification mechanisms to address both fail-stop and silent errors. The objective is to minimize either makespan or energy consumption. While DVFS is a popular approach for reducing the energy consumption, using lower speeds/voltages can increase the number of errors, thereby complicating the problem. We consider an application workflow whose dependence graph is a chain of tasks, and we study three execution scenarios: (i) a single speed is used during the whole execution; (ii) a second, possibly higher speed is used for any potential re-execution; (iii) different pairs of speeds can be used throughout the execution. For each scenario, we determine the optimal checkpointing and verification locations (and the optimal speeds for the third scenario) to minimize either objective. The different execution scenarios are then assessed and compared through an extensive set of experiments.

Ana Gainaru

Dealing with prediction unfriendly failures: the road to specialized predictors

The analysis of the Blue Waters system has shown that predictors encounter limitations not seen before when analyzing smaller systems. The number of events generated by the Blue Waters system is two orders of magnitude larger leading to an increased number of event types that create complex patterns. Our previous studies have show that the recall and precision depend greatly on the failure type, a few types of failures being responsible for high gaps in the overall recall. Specifically, filesystem failures are one of the main reasons for the low recall obtained for the software type.
In this talk, I will show the main causes of bottlenecks in current prediction methods and analyze what are the main differences between small and large systems that influence prediction results. My talk will also focus on showing how designing predictors specifically for one type of failure (ex FS failures) has the potential of improving the overall results for the Blue Waters system.

Omer Subasi

Reliability Modeling of Selective Redundancy for Task-parallel Dataflow Programs

As we get closer to exascale era, reliability becomes one of the main concerns for future high performance computing (HPC) and exascale systems. It is widely believed that because of the limited budged and the increasing requirements for power efficiency, the future supercomputer will be mainly assembled from less complex commodity components. On the other side, the limited resiliency support in such hardware will hardly be sufficient to meet the even higher reliability requirements of such systems. Therefore, complementary software-based techniques would be of key importance for strengthening the system’s fault tolerance. Our main goal is to develop software techniques for selective task replication in task-based parallel programs. Selective task replication provides a practical solution for a real-life problem in which long running applications on multiple nodes cannot complete successfully due to a crash or a silent data corruption caused by a fault. In this presentation we introduce our Markov-chain based theoretical reliability model; we then apply this model for selective task replication. Our validation of the reliability confirms its accuracy and suggests that it can be applied for solving problems such as fault prediction and avoidance. Before executing each task, the reliability model estimates the improvement the task replication will have on the overall application reliability. Then, we utilize this model to design two selective task replication implementations. The first implementation uses a heuristic to select the most appropriate tasks given a target task replication rate – this implementation is useful in systems that has limited spare resources for replication. Our second implementation develops another heuristic that selects the most appropriate tasks for replication given an application reliability target – this implementation is useful in systems that has distinct per application reliability requirements.

Christine Morin

Checkpointing as a Service in Heterogeneous Cloud Environments

We will present a non-invasive, cloud-agnostic approach for extending existing cloud platforms to include checkpoint-restart capability. Most cloud platforms currently rely on each application to provide its own fault tolerance. A uniform mechanism within the cloud itself serves two purposes: (a) direct support for long-running jobs, which would otherwise require a custom fault-tolerant mechanism for each application; and (b) the administrative capability to manage an over-subscribed cloud by temporarily swapping out jobs when higher priority jobs arrive. An advantage of this uniform approach is that it also supports parallel and distributed computations, over both TCP and InfiniBand, thus allowing traditional HPC applications to take advantage of an existing cloud infrastructure. Additionally, an integrated health-monitoring mechanism detects when long- running jobs either fail or incur exceptionally low performance, perhaps due to resource starvation, and proactively suspends the job. The cloud-agnostic feature is demonstrated by applying the implementation to two very different cloud platforms: Snooze and OpenStack. The use of a cloud-agnostic architecture also enables migration of applications from one cloud platform to another.

Judit Gimenez

BSC Performance Models

The talk would introduce a model developed by BSC that decomposes the efficiency as the product of what we name fundamental factors. This model can be used to identify the most influencing factor that affects the performance of an execution, to study an application scalability and even to predict its performance at a very large scale. The talk will present the model, a framework we have developed to semi-automatize the process and results of the current studies we are carrying on.

Carsten Karbach

System monitoring with LLview and the Parallel Tools Platform

The increase in size and heterogeneity of today’s HPC systems makes on-line monitoring to an essential part of production for system administrators as well as end users. The graphical monitoring tool LLview developed by JSC presents the live status of jobs and their distribution across compute resources. This monitoring architecture is incorporated into the Parallel Tools Platform (PTP), which is an integrated development environment for parallel applications. This talk introduces both monitoring systems along with recent development intending to further optimize the monitoring architecture for large scale systems.

Edson L. Padoin

Energy Consumption Reduction through Load Balancing and DVFS on Residually Imbalanced Cores

The power consumption of High Performance Computing (HPC) systems is an increasing concern as large-scale systems grow in size and, consequently, consume more energy. In response to this challenge, we propose two variants of a new energy-aware load balancer implemented in Charm++ that aim at reducing the energy consumption of parallel platforms running imbalanced scientific applications without degrading their performance. Our research combines dynamic load balancing with DVFS techniques in order to reduce the clock frequency of underloaded computing cores which experience some residual imbalance even after tasks are remapped. Experimental results with benchmarks and a real-world application presented energy savings of up to 32% with our fine-grained variant that performs per-core DVFS, and of up to 34% with our coarse-grained variant that performs per-chip DVFS. For the next steps, we plan to extend our techniques with more advanced load balancing algorithms and more refined DVFS controls, and start to consider heterogeneous environments.

Emmanuel Jeannot

Topology-aware Resource Selection

The way resources are allocated to application plays a crucial role in the performance of the execution. It has been shown recently that a non-contiguous allocation can slowdown the performance by more than 30%. However, a batch scheduler cannot always provide a contiguous allocation and even in the case of such allocation the way processes are mapped to the allocated resources have a big impact on the performance. The reason is that the topology of HPC machine is hierarchical and that the process affinity is not uniform (some pairs of processes exchange more data than some other pairs). Hence taking into account the topology of the machine and the process affinity is an effective way to increase the application performance.Nowadays, the allocation and the mapping are decoupled. For instance, in Zoltan, processors are first allocated to the application and then pro- cesses are mapped to the allocated resources depending on the topology and the communication pattern. Decoupling allocation and mapping can lead to sub- optimal solutions where a better mapping could have been found if the resource selection had taken into account the process affinity. In this talk, we will present our work for coupling the resource allocation and the topology-mapping. We have designed and implemented a new Slurm plug-in that takes as input the process affinity of the application and that, according to the machine topology selects resources and maps processes taking into account these two entries (affinity and topology). It is based on our process placement tool called TreeMatch that provides the algorithmic engine to compute the solution. We will present our preliminary results by emulating traces of the Curie machine that features 5040 nodes (2 socket of 8 cores each) and comparing our solution with the plain Slurm.

Min Si

Casper: An Asynchronous Progress Model for MPI RMA on Many-Core Architectures

One-sided communication semantics (also known as remote memory access or RMA) allows a process to access memory regions from other processes in distributed-memory systems. With this model, processes can move data without explicitly synchronizing with the remote target process. However, the MPI standard does not guarantee that it is asynchronous. That is, an MPI implementation might still require the remote target to make MPI calls to ensure communication progress so that any RMA operations issued on that target complete. In this talk, I will introduce “Casper,” a process-based asynchronous progress solution for MPI one-sided communication on multicore and many-core architectures by utilizing transparent MPI call redirection through PMPI and MPI-3 shared-memory windows to map memory from multiple user processes into the address space of arbitrary number of ghost processes. The objective is to enable asynchronous progress where needed while allowing native hardware-based communication where available. I will discuss the detailed design of the proposed architecture including several techniques for maintaining correctness according to the MPI-3 standard as well as performance optimizations where possible. I will also compare the performance of Casper with that of traditional thread- and interrupt-based asynchronous progress models.

Yanhua Sun

An automatic control system for runtime adaptivity

Parallel programming has always been difficult due to the complexity of hardware and the diversity of applications. Although significant progress has been achieved with the remarkable efforts of researchers in academia and industry, attaining high parallel efficiency on large supercomputers with millions of cores for various applications remains challenging. Therefore, performance tuning has become even more important and challenging than ever before. In this talk, we present the design and implementation of PICS: Performance-analysis-based Introspective Control System, which is used to tune parallel programs. PICS provides a generic set of abstractions to the applications to expose the application-specific knowledge to the runtime system. The abstractions are called control points, which are tunable parameters affecting application performance. The application behaviors are observed, measured and automatically analyzed by the PICS. Based on the analysis results and expert knowledge rules, program characteristics are extracted to assist the search for optimal configurations of the control points. We have implemented the PICS control system in Charm++, an asynchronous message-driven parallel programming model. We demonstrate the utility of PICS with several benchmarks and a real-world application and show its effectiveness.

Gabriel Antoniu

Transfer as a Service: Towards a Cost-Effective Model for Multi-Site Cloud Data Management

The global deployment of cloud datacenters is enabling large web services to deliver fast response to users worldwide. This unprecedented geographical distribution of the computation also brings new challenges related to the efficient data management across sites. High throughput, low latencies, cost- or energy-related trade-offs are just a few concerns for both cloud providers and users when it comes to handling data across datacenters. Existing cloud data management solutions are limited to cloud-provided storage, which offers low performance based on rigid cost schemas. Users are therefore forced to design and deploy custom solutions, achieving performance at the cost of complex system configurations, maintenance overheads, reduced reliability and reusability. We are proposing a dedicated cloud data transfer service that supports large-scale data dissemination across geographically distributed sites, advocating for a Transfer as a Service (TaaS) paradigm. The idea is to aggregate the available bandwidth by enabling multi-route transfers across cloud sites. We argue that the adoption of such a TaaS approach brings several benefits for both users and the cloud providers who propose it. For users of multi-site or federated clouds, our proposal is able to decrease the variability of transfers and increase the throughput up to three times compared to baseline user options, while benefiting from the well-known high availability of cloud-provided services. For cloud providers, such a service can decrease the energy consumption within a datacenter down to half compared to user-based transfers. Finally, we propose a dynamic cost model schema for the service usage, which enables the cloud providers to regulate and encourage data exchanges via a data transfer market.

Kate Keahey

Chameleon: A Large-scale, Reconfigurable Experimental Environment for Next Generation Cloud Research

Cloud services have become ubiquitous to all major 21st century economic activities. There are still however many open questions surrounding this new technology, one of the most important and contentious issues being the relationship between cloud computing and high performance computing, the suitability of cloud computing for data-intensive applications, and its position with respect to emergent trends such as Software Defined Networking. A persistent barrier to further understanding of those issues has been the lack of a large-scale and open cloud research platforms. With funding from the National Science Foundation (NSF), the Chameleon project will provide such a large-scale platform to the open research community allowing them explore transformative concepts in deeply programmable cloud services, design, and core technologies. The testbed, deployed at the University of Chicago and the Texas Advanced Computing Center, will consist of almost 15,000 cores, 5PB of total disk space, and leverage 100 Gbps connection between the sites. While a large part of the testbed will consist of homogenous hardware to support large-scale experiments, a portion of it will support heterogeneous units allowing experimentation with high-memory, large-disk, low-power, GPU, and co-processor units.To support the broad range of experiments described above, the project will support a graduated configuration system allowing full user configurability of the software stack, from provisioning of bare metal and network interconnects to delivery of fully functioning cloud environments.In addition, to facilitate experiments, Chameleon will support a set of services designed to meet researchers needs, including support for experimental management, reproducibility, and repositories of trace and workload data of production cloud workloads.

Rosa Badia

Programming distributed platforms with PyCOMPSs and its integration with persistent storage systems

COMPSs programming framework which intends to simplify the execution of sequential applications in distributed infrastructures, including clusters and Clouds. For that purpose, COMPSs provides both a straightforward programming model and a runtime that is able to interact with a wide variety of distributed computing middleware (e.g. gLite, Globus) and Cloud APIs (e.g. OpenStack, OpenNebula, Amazon EC2). The talk will focus in the recent extensions to COMPSs: PyCOMPSs, a binding for the Python language which will enable a larger number of scientific applications in fields such as lifesciences and in the integration of COMPSs with new Big Data resource management methodologies developed at BSC, such as the Wasabi self-containent objects library and Cassandra data management policies. These activities are performed under the flagship project Human Brain Project and the Spanish BSC Severo Ochoa project.

Ian Foster

RAMSES: A new project in data-driven analytical modeling of distributed systems

RAMSES is a new DOE-funded project on the end-to-end analytical performance modeling of science workflows in extreme-scale science environments. It aims to link multiple threads of inquiry that have not, until now, been adequately connected: namely, first-principles performance modeling within individual sub-disciplines (e.g., networks, storage systems, applications), and data-driven methods for evaluating, calibrating, and synthesizing models of complex phenomena. What makes this fusion necessary is the drive to explain, predict, and optimize not just individual system components but complex end-to-end workflows. In this talk, I will introduce the goals of the project and some aspects of our technical approach. I hope to identify opportunities for collaboration with other JLESC participants.

Indranil Gupta

Probabilistic CAP and Timely Adaptive Key-value Stores

The CAP theorem is a fundamental result that applies to distributed storage systems. In this talk, we first present generalized versions of the CAP model and theorem. These theorems extend the CAP theorem from merely being a binary (yes-no) choice among consistency (C), availability (A), and partition-tolerance (P), to being a tradeoff involving probabilistic parameters for C, A, and P. Next, we present the design of a new system called PCAP which leverages these results. Our system PCAP allows applications to specify either an availability SLA or a consistency SLA. The system then automatically adapts, in real-time and under changing network conditions, to meet the SLA while optimizing the other C/A metric. We have incorporated PCAP into two popular key-value stores — Cassandra and Riak. Our experiments with these two deployments, under realistic workloads, reveal that the PCAP system satisfactorily meets SLAs, and performs close to the bounds dictated by our generalized CAP theorems. This is joint work with Muntasir Raihan Rahman, Lewis Tseng, Son Nguyen, and Nitin Vaidya.

Andrew Chien

Characterizing Variation in Graph Computing Behavior

Graph processing is widely recognized as important for a growing range of applications, including social network analysis, machine learning, data mining, and web search. Recently, many new graph processing systems and assessments have been published in the parallel computing community. To explore the robustness of these studies, we explore the behavior of a variety of graph algorithms (graph analytics, collaborative filtering, clustering, linear solvers) on a diverse collection of graphs (size, degree distributions, values). We characterize variation quantitatively.