Agenda

Tuesday, September 27

Time
18:30 – 20:30Drop in Pizza and Drinks
Location: Hampton Inn – Lobby Meeting Space

Wednesday, September 28

Start TimeTrack 1Track 2
08:45Registration and Breakfast
Location: NCSA Atrium
Registration and Breakfast
Location: NCSA Atrium
09:30Opening
Location: NCSA Auditorium
Opening
Location: NCSA Auditorium
10:00Short Talks 1 – Programming languages and runtimes
Session Chair: Daniel Patrick Barry
Location: NCSA 1040
Short Talks 2 – I/O, storage and in-situ Processing
Session Chair: Robert Underwood
Location: NCSA Auditorium
12:00Lunch
Location: NCSA Atrium
Lunch
Location: NCSA Atrium
13:00Project Talks 1 – Numerics
Session Chair: Jon Cameron Calhoun
Location: NCSA 1040
Break Out Session: QC
Location: NCSA Auditorium
14:30Break
Location: NCSA Atrium
Break
Location: NCSA Atrium
15:00Short Talks 3 – AI/ML
Session Chair: Philippe Swartvagher
Location: NCSA 1040
Short Talks 4 – Numerics
Session Chair: Radita Liem
Location: NCSA Auditorium
17:00Buses Depart from NCSABuses Depart from NCSA
17:15Poster Session – Alice Campbell Alumni CenterPoster Session – Alice Campbell Alumni Center
18:15Social Event – Alice Campbell Alumni CenterSocial Event – Alice Campbell Alumni Center
19:45Buses Depart to return to Hampton InnBuses Depart to return to Hampton Inn

Thursday, September 29

Start TimeTrack 1Track 2
08:30Breakfast
Location: NCSA Atrium
Breakfast
Location: NCSA Atrium
09:00Keynote: Women in HPC
Session Chair: Sharon Broude Geva
Location: NCSA Auditorium
Join Remotely
Why are there so few women working in HPC? What started with this simple question resulted in the powerful international initiative “Women in High Performance Computing (WHPC: https://womeninhpc.org). The members of this network are working for more equality, diversity and inclusion in the HPC community. The initiative is active at conferences, offers workshops and mentoring programs, and aims to raise more awareness in the HPC community with the slogan “diversity creates a stronger community.
Keynote: Women in HPC
Session Chair: Sharon Broude Geva
Location: NCSA Auditorium
Join Remotely
Why are there so few women working in HPC? What started with this simple question resulted in the powerful international initiative “Women in High Performance Computing (WHPC: https://womeninhpc.org). The members of this network are working for more equality, diversity and inclusion in the HPC community. The initiative is active at conferences, offers workshops and mentoring programs, and aims to raise more awareness in the HPC community with the slogan “diversity creates a stronger community.
09:45Break
Location: NCSA Atrium
Break
Location: NCSA Atrium
10:00Project Talks 2 – Workflows, Performance, Cloud
Session Chair: Ivo Kabadshow
Location: NCSA Auditorium
Break Out Session: Women in HPC
Location: NCSA 1040
Join Remotely
12:00Lunch
Location: NCSA Atrium
Lunch
Location: NCSA Atrium
13:00Short Talks 5 – Ecosystems, Services and Resilience
Session Chair: Thomas Baumann
Location: NCSA 1040
Short Talks 6 – Performance Tools
Session Chair: Antoni Navarro
Location: NCSA Auditorium
15:00Break
Location: NCSA Atrium
Break
Location: NCSA Atrium
15:30Panel Session: Will HPC-Cloud Convergence Happen
Moderator: Gabriel Antoniu, Inria
Location: NCSA Auditorium

Since several years we have been witnessing the emergence of complex workflows combining simulations (traditionally running on supercomputers) with data analysis codes (traditionally running on clouds and/or on edge-based decentralised infrastructures more recently). Such complex workflows seem to naturally need to jointly use supercomputers and clouds (and potentially edge-based systems). On the other hand, recently, some commenters announced the end of the supercomputing era, claiming that the energy consumption needed by supercomputers is simply not sustainable, and can motivate a major move towards HPC clouds, which could replace supercomputers.
This panel will address questions related to the relevance of the HPC-Cloud convergence and to the challenges posed by such a convergence.

Panelists
• Bill Kramer, UIUC
• Justin Wozniak, ANL
• Fumiyoshi Shoji, RIKEN
• François Tessier, Inria
Panel Session: Will HPC-Cloud Convergence Happen
Moderator: Gabriel Antoniu, Inria
Location: NCSA Auditorium

Since several years we have been witnessing the emergence of complex workflows combining simulations (traditionally running on supercomputers) with data analysis codes (traditionally running on clouds and/or on edge-based decentralised infrastructures more recently). Such complex workflows seem to naturally need to jointly use supercomputers and clouds (and potentially edge-based systems). On the other hand, recently, some commenters announced the end of the supercomputing era, claiming that the energy consumption needed by supercomputers is simply not sustainable, and can motivate a major move towards HPC clouds, which could replace supercomputers.
This panel will address questions related to the relevance of the HPC-Cloud convergence and to the challenges posed by such a convergence.

Panelists
• Bill Kramer, UIUC
• Justin Wozniak, ANL
• Fumiyoshi Shoji, RIKEN
• François Tessier, Inria
17:00End for the DayEnd for the Day
17:30Buses Depart from Hampton InnBuses Depart from Hampton Inn
18:00Social Event
Riggs Beer Company and Jason Mack Glass Blowing
Social Event
Riggs Beer Company and Jason Mack Glass Blowing
20:00Buses Depart to return to Hampton InnBuses Depart to return to Hampton Inn

Friday, September 30

Start TimeTrack 1Track 2
08:30Breakfast
Location: NCSA Atrium
Breakfast
Location: NCSA Atrium
09:00Keynote:
International Association of Supercomputing Centers
Session Chair: Brendan McGinty
Location: NCSA Auditorium
Keynote:
International Association of Supercomputing Centers
Session Chair: Brendan McGinty
Location: NCSA Auditorium
09:45Break
Location: NCSA Atrium
Break
Location: NCSA Atrium
10:00Break Out Session: FPGA
Location: NCSA Auditorium
Break Out Session: CI
Location: NCSA 1040
12:00Lunch
Location: NCSA Atrium
Lunch
Location: NCSA Atrium
13:00Project Talks 3 – AI/ML
Session Chair: George Bosilca
Location: NCSA 1040
Project Talks 4 – Resilience and Compression
Session Chair: Robert Speck
Location: NCSA Auditorium
14:30Farewell
Location: NCSA Auditorium
Farewell
Location: NCSA Auditorium
17:00Buses Depart from Hampton InnBuses Depart from Hampton Inn
17:30Social Event
Permanent Staff: Lodgic
Temporary Staff: Jupiters at the Crossing
Social Event
Permanent Staff: Lodgic
Temporary Staff: Jupiters at the Crossing
19:30Buses Depart to return to Hampton InnBuses Depart to return to Hampton Inn

Short Talks 1 – Programming languages and runtimes

PresenterTalk Title and Abstract
Philippe SWARTVAGHER, INRIAMemory contention between computations and communications: which solutions?
Our previous work showed memory contention between computations and communications when they are executed in parallel can have an important impact on performance of both computations and communications. This contention and its consequences worsen when computations are memory-bound and large messages are exchanged. To better understand this contention and be able to take it into account at a runtime system level, we proposed a model predicting the memory bandwidth share between computations and communications, according to the number of computing cores and data placement.
Jiakun Yan, UIUC/NCSAEfficient Message Passing support for irregular, multithreaded communication
Task-based programming systems, such as PaRSEC, HPX, and Legion, are a potential solution to programming nowadays increasing parallel and heterogeneous systems. They usually use MPI or GASNet-EX as their communication backends. However, their communication is largely irregular and multithreaded. This leads to some mismatches between what task-based systems want and what their underlying communication libraries offer, leading to inefficiencies such as poor multithreaded performance, unnecessary memory copies and messages, unpredictable background task processing, and inefficient polling for completion.
We are developing a low-level communication library, Lightweight Communication Interface (LCI), to explore ways to eliminate these mismatches and provide direct communication support to high-level task-based programming systems. LCI’s features include (a) flexible communication primitives including two-sided send/recv, one-sided put/get with/without user-provided target buffers (b) better-multithreaded performance (c) explicit user control of communication resources (d) flexible signaling mechanisms such as synchronizer, completion queue, and active message handler. We believe these features can also benefit other applications and systems with irregular, multithreaded communication patterns.
Omri Mor, UIUC/NCSALCI: Communication for Asynchronous Many-Task Runtimes
Asynchronous Many-Task Runtimes, such as PaRSEC, HPX, Legion, Charm++, and others have very different communication needs than many other applications: there are many control messages that must be delivered with high priority and low latency; data transfer can be easily overlapped with computation and other communications; and both these communication needs can be highly imbalanced between different nodes. MPI and other interfaces used in existing task runtimes are not designed to support these use cases well and the alternative of using low-level interfaces such as InfiniBand Verbs, Libfabric, or UCX is error-prone and difficult for runtime developers to support. We demonstrate how a communication interface designed for modern asynchronous runtimes and hardware improves the performance and scalability of the PaRSEC runtime.
Thomas Herault, UTKTemplate Task Graphs: Composability and Hybrid Systems
Template Task Graphs (TTG) have been developed to enable a straightforward expression of task parallelism for algorithms working on irregular and unbalanced data sets. The TTG Application Programming Interface employs C++ templates to build an abstract representation of the task graph, and schedule it on distributed resources. It offers a scalable and efficient API to port complex applications on top of task-based runtime systems to gain access to asynchronous progress, computation/communication overlap, and efficient use of all computing resources available on the target system. In this presentation, I will provide an updated view of its performance over two runtime systems, through a variety of applications, ranging from well-known regular to irregular and data dependent examples.
Michel Schanen, ANLJulia for Rapid Prototyping in HPC
We present our experience in using Julia for rapid prototyping of GPU-focused numerical optimization algorithms in the ECP project and our current work on adjoint checkpointing in the DJ4Earth project.
Jakob Fritz, JSCAutomated Creation of Gitlab-Runners using Ansible
Creation of Gitlab-Runners can be time-consuming and tedious. Ansible promises to automate repetitive tasks for server-setup and is used here, to prepare and register new runners to be used in the Gitlab-CI. The servers that host the runners are virtual instances in OpenStack. Because of this, the creation and removal of those host-instances is also automated in ansible. Therefore, the whole creation, preparation and registration, as well as unregistration and removal of the runners is done with ansible by simply providing the wanted number of runners. This shall help to reduce the entry-barrier to using Continuous Integration in cases, where Gitlab is used but suitable runners are not (yet) available.
Emil Vatai, R-CCSOutlook on generating optimal HPC code with ML
With the end of Dennard scaling and continuing with the denouement of Moore’s law, keeping the performance growth of HPC systems is becoming increasingly difficult. The innovations in hardware often demand changes in the software implementations to fully utilize the performance potential. Despite the aid of compilers, a vast number of experts need to be working on optimizing HPC codes to obtain maximal performance on supercomputers, and researchers are turning to machine learning (ML) for aid. ML researchers are mostly examining problems such as code generation from text (ignoring performance issues), while compiler developers are using ML to improve (existing) parts of the compiler, e.g. register allocation. However, the use of ML for high-level optimization is mostly left unexplored, because in addition to ML and compiler technologies such optimizations require a deep understanding of the target applications. We present our efforts in discovering the optimal approach to tackle the problem of high-level optimizations of HPC codes using ML. We consider potential representations of the source code, candidate applications and ML methods which would be best suited for this problem.

Short Talks 2 – I/O, storage and in-situ Processing

PresenterTalk Title and Abstract
Sheng Di, ANLDynamic Quality Metric Oriented Error Bounded Lossy Compression for Scientific Datasets
Error-bounded lossy compression has been considered a very promising solution to address the big-data issue for scientific applications, because it can significantly reduce the data volume with low time cost meanwhile allowing users to control the compression errors with a specified error bound. The existing error-bounded lossy compressors, however, are all developed based on inflexible designs or compression pipelines, which cannot adapt to diverse compression quality requirements/metrics favored by different application users. In this work, we propose a novel dynamic quality metric oriented error-bounded lossy compression framework, namely QoZ. The detailed contribution is three fold. (1) We design a novel highly-parameterized multi-level interpolation-based data predictor, which can significantly improve the overall compression quality with the same compressed size. (2) We design the error bounded lossy compression framework QoZ based on the adaptive predictor, which can auto-tune the critical parameters and optimize the compression result according to user-specified quality metrics during online compression. (3) We evaluate QoZ carefully by comparing its compression quality with multiple state-of-the-arts on various real-world scientific application datasets. Experiments show that, compared with the second best lossy compressor, QoZ can achieve up to 70% compression ratio improvement under the same error bound, up to 150% compression ratio improvement under the same PSNR, or up to 270% compression ratio improvement under the same SSIM.
Xavier Yepes-Arbós, BSCExtending XIOS lossy compression functionalities using SZ
Earth system models (ESMs) have increased the spatial resolution to achieve more accurate solutions. As a consequence, the number of grid points increases dramatically, so an enormous amount of data is produced as simulation results. In addition, if ESMs manage to take advantage of the upcoming exascale computing power, their current data management system will become a bottleneck as the data production will grow exponentially.
The XML Input/Output Server (XIOS) is an MPI parallel I/O server designed for ESMs to efficiently post-process data inline as well as read and write data in NetCDF4 format. Although it offers a good performance in terms of computational efficiency for current resolutions, this could change for larger resolutions since the XIOS performance is very dependent on the output size. To address this problem we positively explored the use of the SZ lossy compressor developed by the Argonne National Laboratory (ANL) instead of the default HDF5 lossless compression. SZ allows reaching high compression ratios and enough compression speed to considerably reduce the I/O time while keeping high accuracy.
In this work we extend the XIOS lossy compression capabilities taking advantage of different SZ functionalities. Individual compression parameters (error bounds) are read for each field and passed to the SZ filter via the NetCDF API because in climate modeling each field requires a different accuracy depending on scientific needs. This implies a discussion with climate scientists to determine the different tolerated error bounds. In addition, we explore the possibility of adapting XIOS to use SZ with HDF5 parallel I/O to write highly compressed single NetCDF files. As a case study, the Open Integrated Forecast System (OpenIFS) is used, an atmospheric general circulation model that can use XIOS to output data.
Luan Teylo, INRIAIO-Sets: Simple and efficient approaches for I/O bandwidth management
One of the main performance issues faced by high-performance computing platforms is the congestion caused by concurrent I/O from applications. When this happens, the platform’s overall performance and utilization are harmed. From the extensive work in this field, I/O scheduling is the essential solution to this problem. The main drawback of current techniques is the amount of information needed about applications, which compromises their applicability. In this work, we propose a novel method for I/O management, IO-SETS. We present its potential through a scheduling heuristic called Set-10, which requires minimum information and can be easily implemented.
Alexis BANDET, INRIASharing I/O nodes between applications
I/O nodes are dedicated I/O hardware placed between compute nodes and parallel file system. They intercepet all I/O requests. They allow I/O optimization technics such as I/O balancing, request reordering etc… to optimize the I/O perfomance of the machine.
We propose a simple model for sharing them between some application in order to increase system performance in case of I/O node shortage or reducing the number of I/O node needed by the machine with minimal impact over application stretch.
Bogdan Nicolae, ANLDataStates: Perspectives on the Versatility of a Searchable Lineage of Intermediate Data at Scale
DataStates a data model in which users do not interact with a data service directly to read/write datasets but rather tag datasets with properties expressing hints, constraints, and persistency semantics, which automatically adds snapshots (called data states) into the lineage– a history recording the evolution of all snapshots using an optimal I/O plan. This talk will emphasize several advantages of DataStates (eliminates the need to explicitly interact with complex heterogeneous storage stacks at large scale, brings an incentive to collaborate more, verify and understand the results more thoroughly by sharing and analyzing intermediate results, encourages the development of new algorithms and ideas that reuse and revisit intermediate and historical data frequently) and will discuss several scenarios where it has been successfully applied, concluding with perspectives on future collaboration opportunities.
Phil Carns, ANLAgility versus understandibility in HPC data services
An explosion of data-intensive scientific computing applications and storage technologies are driving the need for agile data services that can rapidly adapt to specialized use cases and environments. What are the the challenges and opportunities associated with retaining understandability (i.e., for tuning, reproducibility, and resource allocation) in this environment?
François Tessier, INRIAInvestigating allocation of heterogeneous storage resources on HPC systems
The ability of large-scale infrastructures to store and retrieve a massive amount of data is now decisive to scale up scientific applications. However, there is an ever-widening gap between I/O and computing performance on these systems. A way to mitigate this gap consists of deploying new intermediate storage tiers (node-local storage, burst-buffers, …) between the compute nodes and the traditional global shared parallel file-system. Unfortunately, without advanced techniques to allocate and size these resources, they remain often underutilized. To address this problem, we investigate how heterogeneous storage resources can be allocated on a HPC platform, in a similar way as compute resources. In that regard, we introduce StorAlloc, a simulator used as a testbed for assessing storage-aware job scheduling algorithms and evaluating various storage infrastructures.

Short Talks 3 – AI/ML

PresenterTalk Title and Abstract
Prasanna Balaprakash, ANLScalable Automated Deep Learning with DeepHyper
In recent years, deep neural networks (DNNs) have achieved considerable success in learning complex nonlinear relationships between features and targets from large datasets. Nevertheless, designing high-performing DNN architecture for a given data set is an expert-driven, time-consuming, trial-and-error manual task. A major bottleneck in the construction of DNNs is the vast search space of architectures that need to be explored in the face of new data sets. Moreover, DNNs typically require user-specified values for hyperparameters, which strongly influence performance factors such as training time and prediction accuracy. In this talk, we will introduce DeepHyper, a scalable automated machine learning package for developing a diverse set of deep neural network ensembles and leveraging them for improved prediction and uncertainty quantification in scientific machine learning applications. DeepHyper provides an infrastructure that targets experimental research in neural architecture search (NAS) and hyperparameter search (HPS) methods, scalability, and portability across different U.S. Department of Energy supercomputers.
Piotr Luszczek, UTKSurrogate AI Benchmarking Applications’ Testing Harness
We present a project follows principles of FAIR benchmarking of surrogate ML/AI models enhancing ab initio scientific models.
Aaron Saxton, UIUC/NCSAScoping System Requirements For Large Scale ML Research
Large datasets offer the promise of unique insight. For example satellite imagery can provide geospatial situational awareness, or HPC system monitoring data can direct efforts to faster failure resolution. With both of these examples, most analysis to date has been done on only small subsets. For satellite data, only one or three channels are typically used, few features are studied by a single model, and training sets are cherry picked. Analyzing systems monitoring data will often consist of basic aggregations, but the actionable insight normally comes from identifying a specific features on a specific metric (e.g. high filesystem IO). The promise of novel and unique insight will only come when studying the data holistically. Modern ML models have lowered the barrier on the raw computational analysis but it still remains a challange to host and serve data to the computing device. Therefore the major barrier for holistic analysis of large datasets is optimizing the entire processing pipeline to host, serve, and process large scale data. I propose the first steps toward this goal is to make more clear the requirements from the data and models and how those are fitted with the systems they are intended to run on. In this talk I’ll present basic statistical concepts, modern ML advances, and critical features taken from these two topics that should be used to measure the requirements for a successful ML/AI project to run at scale.
Alexandru Costan, INRIASupporting Efficient Workflow Deployment of Federated Learning Systems on the Computing Continuum
Federated Learning (FL) allows multiple devices to learn a shared model without exchanging private data. A typical scenario involves using constrained devices in a massively distributed environment combining Cloud, Fog and Edge resources, also called Computing Continuum (CC).
Running FL workflows across the CC involves frequent deployment and monitoring in large scale and heterogeneous environments while taking into account several objectives such as privacy preservation, quality of the prediction and resource consumption.
To this purpose, additional tools have to be used to adapt ML workflows and better support deployment of FL Systems. We propose a framework to automatically deploy FL workloads in heterogeneous environments using formal description of the underlying infrastructure, hyperparameter optimization and monitoring tools to ease management of the system.
Mohamed Wahib, R-CCSAdaptively Decaying Datasets in Neural Networks Training
We investigate dataset decaying to reduce the total amount of compute in each deep neural network training epoch by eliminating the samples of the least importance. We propose an adaptive threshold mechanism to automatic decide whether to eliminate a sample from training train based on its importance, e.g., its training loss.
Silvina Caino-Lores, UTKMethods, Workflows, and Data Commons for Reducing Training Costs in Neural Architecture Search on High-Performance Computing Platforms
Neural Networks (NNs) are powerful models that have been used in traditional high performance computing (HPC) scientific simulations, and are increasingly deployed in new research areas such as high-throughput data analytics to solve problems in physics, materials science, neuroscience, and medical imaging. Neural Architecture Search (NAS) automates the process of finding near-optimal NN models, but at a high training cost.
In a previous edition of the workshop we presented PENGUIN, a performance estimation engine that predicts neural network performance early in training. PENGUIN plugs into existing NAS methods; it predicts the final accuracy of the NNs that the NAS selects from the search space and reports these predictions to the NAS, enabling early training termination of the NNs in the NAS. NASes augmented by PENGUIN exhibit a throughput gain of 1.6 to 7.1 times. Furthermore, walltime tests indicate that PENGUIN can reduce training time by a factor of 2.5 to 5.3.This enables the NAS to use fewer resources, explore more architectures, or explore a larger search space.
The success of this foundational work lead us to develop a full project to define methods to increase NN throughput in NAS: enabling rapid, flexible training termination early in the training process; designing workflows that decouple the search from the prediction of fitness for general NAS implementations across fitness measurements, datasets, and problems; and generating an NN data commons that shares full provenance for a variety of NNs. Our methods, workflows, and NN data commons will support users in studying a large and diverse suite of NNs and connecting those NNs with scientific knowledge embedded in real datasets.
In this presentation we will cover the building blocks to transform existing NAS implementations from tightly-coupled, monolithic software tools embedding both search and prediction into a flexible, modular, and reusable workflow in which search and prediction are decoupled. Our workflow will enable users to reduce training costs; increase NAS throughput; adapt predictions to different fitness measurements, datasets, and problems; and capture the NN’s lifespan through generation, training, and validation stages. In addition, we will present our vision of a searchable and reusable NN data commons that will enable users to study the evolution of NN performance during training and identify relationships between an NN’s architecture and its performance on a given dataset with specific properties, ultimately supporting effective searches for accurate NNs across a spectrum of real-world scientific datasets.
Seid Koric, UIUC/NCSAConfluence of Numerical Modeling Methods and Artificial Intelligence in Physics-based Simulations
Like many computational fields, physics-based modeling is being very recently revolutionized by the use of Artificial Intelligence (AI) techniques. A properly trained deep learning model can almost instantly produce (inference) results rivaling classical modeling methods without HPC resources or modeling software. NCSA and The Grainger college of engineering researchers have lately developed and used such data-driven an physics-informed surrogate deep learning models to accelerate modeling and design in topological optimization, highly nonlinear material responses, turbulence, sensitivity and design, material processing, and advanced manufacturing. In our short talk, we will provide a quick overview of this novel research.

Short Talks 4 – Numerics

PresenterTalk Title and Abstract
Atsushi Suzuki, R-CCSPorting of a domain decomposition solver with direct solver package on hybrid parallel architecture
Large sparse matrix with high condition number problem, which is not easily solved by Krylov subspace method even with multi-grid preconditioner, appears after discretization of incompressible flow problem, composite material problem, or elasticity contact problem. Domain decomposition method based on accurate local solver and coarse space solution that can recover the balance in global information lost after domain decomposition, is a powerful solver for these kind of problems. For elliptic problems, GenEO preconditioner is established by dealing with coarse space by eigenvalue problems, and there is an implementation for distributed parallel environment.
However to get best performance of the solver on modern hybrid parallel environment, utilizing mutli-core CPU by local solver, MUMPS, Pardiso, or Dissection is necessary. Porting of GenEO solver with sparse direct solver on modern various architecture with mulit-core CPU and shared-memory will bring some impact to simulation of PDE problems.
Daniel Bielich, UTKImprovement of QR performance in SLATE
The work presented shows what steps were taken to improve the overall performance of the QR factorization within SLATE. This includes off-loading the panel factorization onto GPU and improving the construction of the triangular blocking factor for Householder orthogonalization.
Hartwig Anzt, UTKMixed Precision and Compression Techniques for Memory Bound Linear Algebra
The performance of sparse linear algebra is to a large extent constrained by the communication bandwidth, motivating the recent investigation of sophisticated techniques to avoid, reduce, and/or hide data transfers in-between processors and between processors and main memory. One promising strategy is to decouple the memory precision from the arithmetic precision, and compress the data before invoking communication operations. While this generally comes with a loss of information, the strategy can be reasonable when operating with approximate objects like preconditioners used in iterative methods. We will present a memory accessor separating the arithmetic precision from the memory precision and mixed precision algorithms based on the strategy of employing lower precision formats for communication and memory access without impacting the final accuracy.
Wissam Sid-Lakhdar, UTKPAQR: Pivoting Avoiding QR factorization
The solution of linear least-squares problems is at the heart of many scientific and engineering applications.
While any method able to minimize the backward error of such problems is considered numerically stable, the theory states that the forward error depends on the condition number of the matrix in the system of equations.
On the one hand, the QR factorization is an efficient method to solve such problems, but the solutions it produces may have large forward errors when the matrix is deficient.
On the other hand, QR with column pivoting (QRCP) is able to produce smaller forward errors on deficient matrices, but its cost is prohibitive compared to \QR.
The aim of this talk is to propose PAQR, an alternative solution method with the same cost (or smaller) as QR and as accurate as QRCP in practical cases, for the solution of rank-deficient linear least-squares problems.
After presenting the algorithm and its implementations on different architectures, we compare its accuracy and performance results on a variety of application problems.
Jan Hückelheim, ANLMixed Mode Automatic Differentiation in PyTorch
We have developed a prototype for mixing different gradient computation strategies in PyTorch, in addition to (or instead of) simply using back-propagation. These strategies can be used to compute Jacobians, randomized / approximate gradients, or to reduce memory consumption or run time of training.
Bob Haber, UIUC/NCSASpacetime Solvers for Extreme Multiscale Hyperbolic Systems
A research collaboration between NCSA/Illinois and University of Tennessee has developed a novel spacetime discontinuous Galerkin solver for hyperbolic systems. It uses characteristics-based meshing to discretize spacetime into polytopes (called patches} whose facets are all space-like. This establishes a partial ordering of patches that allows each patch to be solved as a locally-implicit problem that depends only on adjacent, previously-solved patches and initial/boundary data. Moreover, patch solution is interleaved with an asynchronous, adaptive, advancing-front meshing procedure such that each patch is solved immediately after it is generated and patch generation, patch solution and localized adaptive spacetime meshing are performed as embarrassingly parallel operations at a common granularity. The result is an unstructured spacetime discretization that is free of synchronous time steps with extremely dynamic adaptive meshing capabilities and O(N) computational complexity with N the total number of spacetime patches.

In recent research, we extended the adaptive spacetime meshing procedure to problems in E^d x R (with d the spatial dimension) and developed a fully asynchronous parallel–adaptive solution architecture that does not involve domain decomposition. That is, we abandon the Bulk Synchronous Parallel model and the Domain Decomposition Method to achieve a fully asynchronous barrier-free scheme for hyperbolic problems with properties that bode well for scaling on exoscale platforms. We have developed a number of applications in this framework, including dynamic crack propagation (including probabilistic nucleation, growth and coalescence of fractures), earthquake rupture simulations, electromagnetics, compressible gas dynamics (inviscid Euler equations), and hyperbolic advection–diffusion models.
Daniel Bielich, UTKCAQR in SLATE: Offloading Panel to GPU
I am presenting my work in SLATE to improve the performance of CAQR. We push the panel factorization onto the device calling vendor routines. From this work we have more than doubled performance for square factorizations on Summit and achieve up to five times speedup for tall-skinny matrices.

Short Talks 5 – Ecosystems, Services and Resilience

PresenterTalk Title and Abstract
Kazutomo Yoshii, ANLPrototyping custom hardware accelerators leveraging the RISC-V ecosystem
Hardware specialization is becoming an important factor in improving the performance and efficiency of computing in the post-Moore era. While productization of specialized accelerators may still need to be done by industries as it requires significant efforts and budgets, the feasibility study (e.g., design, verification, resource estimation) of hardware algorithms at the academic level could help industries to adopt our custom acceleration needs into future architectures. As RISC-V, an open instruction set architecture, and its open-source ecosystems are becoming mature, it is now possible to extend the CPU core with custom accelerators and evaluate them in a simulated environment relatively easily. In this short talk, I will give a summary of RISC-V and its ecosystem and explain the steps to extend RISC-V and simulate it.
Gabriel Antoniu, INRIATowards Integrated Hardware/Software Ecosystems for the Edge-Cloud-HPC Continuum: the Transcontinuum Initiative
Modern use cases such as autonomous vehicles, digital twins, smart buildings and precision agriculture, greatly increase the complexity of application workflows. They typically combine physics-based simulations, analysis of large data volumes and machine learning and require a hybrid execution infrastructure: edge devices create streams of input data, which are processed by data analytics and machine learning applications in the Cloud, and simulations on large, specialised HPC systems provide insights into and prediction of future system state. All of these steps pose different requirements for the best suited execution platforms, and they need to be connected in an efficient and secure way. This assembly is called the Computing Continuum (CC). It raises challenges at multiple levels: at the application level, innovative algorithms are needed to bridge simulations, machine learning and data-driven analytics; at the middleware level, adequate tools must enable efficient deployment, scheduling and orchestration of the workflow components across the whole distributed infrastructure; and, finally, a capable resource management system must allocate a suitable set of components of the infrastructure to run the application workflow, preferably in a dynamic and adaptive way, taking into account the specific capabilities of each component of the underlying heterogeneous infrastructure. This talk introduces TCI – the Transcontinuum Initiative – a European multidisciplinary collaborative action aiming to identify the related gaps for both hardware and software infrastructures to build CC use cases, with the ultimate goal of accelerating scientific discovery, improving timeliness, quality and sustainability of engineering artefacts, and supporting decisions in complex and potentially urgent situations.
Jakob Luettgau, UTKToward a Lightweight Indexing Service for the National Science Data Fabric
Across domains massive amounts of scientific data are generated which are useful beyond their original purpose. Yet, discoverability of these data is often hard especially for researchers and students from other domains. As part of the NSF-funded National Science Data Fabric (NSDF) initiative (http://nationalsciencedatafabric.org/) we developed a testbed to demonstrate that these boundaries can be overcome. As part of our effort, we identified the need for indexing large-amounts of scientific data across scientific domains.
Instead of focusing and waiting on the development of a metadata convention across domains, we propose to build a lightweight indexing service with minimal metadata that complements existing domain-specific and rich-metadata collection efforts. The NSDF-Catalog is designed to facilitate multiple related objectives within a flexible microservice to: 1) coordinate data movements and replication of data from origin repositories within the NSDF federation 2) build an inventory of existing scientific data to inform the design of next-generation cyberinfrastructure and 3) provide a suite of tools for discovery of datasets for cross-disciplinary research. Our service indexes at a fine-granularity both at the file or object level to inform data distribution strategies and to improve the experience for users from the consumer perspective, with the goal of allowing end-to-end workflow optimizations.
Aurelien Bouteiller, UTKHow to overlap recovery costs in MPI
The User Level Failure Mitigation (ULFM) proposal to integrate fault handling in MPI is undergoing significant changes that are intended to enable a more asynchronous, non-blocking approach to recovering MPI state. This opens new opportunities for other recovery activities to happen concurrently, thus decreasing the overall cost of each recovery event.
Justin M. Wozniak, ANLBuilding infrastructure and methods for policy-relevant epidemiological modeling
Our team, joint between ANL and INRIA, is building a sustainable simulation, data, decision support, and learning collaborative platform for pandemic monitoring, robust scenario analyses under uncertainty, and rapid response, known as the OSPREY platform. The OSPREY design will emphasize flexible, dynamic, and scalable approaches to directly support rapid exploration, experimentation, verification, and validation as public health crises evolve.
Jens Domke, R-CCSOctopodes: A candidate to replace Mini Apps and Motifs?
The use of proxy applications expanded and improved the co-design capability of modern supercomputers, but we believe that current hardware trends and software complexities require a new set of tools for the co-design of post-exascale supercomputers and federated HPC/data centers to better capture, analyze, and model existing and future workload demands. To open the floor for future, community-wide discussions, we will outline the state-of-the-art and its shortcomings, and propose an alternative, hopefully better suited set of highly-parameterizable, easily-amendable, Motif-like problem representations which we call Octopodes. These algorithms or complex operations shall not replace proxy applications entirely, but supersede them as the primary target in the co-design process. Octopodes will hopefully become the common language between HPC users, system operators, co-designers, and vendors to describe the to-be-solved scientific challenges, what needs to be computed, and how it can be computed, in an abstract way. This approach allows for more flexibility in the hardware/software design and selection to match the users needs with the best architecture, instead of fine-tuning legacy architectures to legacy implementations.

Short Talks 6 – Performance Tools

PresenterTalk Title and Abstract
Brice Videau, ANLTHAPI: Tracing Heterogeneous APIs
THAPI is a tracing framework developed at ANL, and is used to debug and profile HPC applications and runtimes with minimal overhead and fine granularity. THAPI supports tracing OpenCL, Level-Zero, CUDA, and OpenMP offload programming models. THAPI modular design and flexibility allows a broad range of usage scenarios, from creating simple run summaries, to complex API usage validation, and rich visualizations. In this talk we will present the developments and improvements to THAPI since we first presented the project at the JLESC11, and discuss future perspectives and features.
Frédéric Vivien, INRIADynamic Scheduling Strategies for Firm Semi-Periodic Real-Time Tasks
This work introduces and assesses novel strategies to schedule firm semi-periodic real-time tasks. Jobs are released period- ically and have the same relative deadline. Job execution times obey an arbitrary probability distribution and can take either bounded or unbounded values. We investigate several optimization criteria, the most prominent being the Deadline Miss Ratio (DMR). All previous work uses some admission policies but never interrupt the execution of an admitted job before its deadline. On the contrary, we introduce three new control parameters to dynamically decide whether to interrupt a job at any given time. We derive a Markov model and use its stationary distribution to determine the best value of each control parameter. Finally we conduct an extensive simulation campaign with 16 different probability distributions. The results nicely demonstrate how the new strategies help improve system performance compared with traditional approaches. In partic- ular, we show that (i) compared to pre-execution admission rules, the control parameters make significantly better decisions; (ii) specifically, the key control parameter is to upper bound the waiting time of each job; (iii) the best scheduling strategy decreases the DMR by up to 0.35 over traditional competitors.
Radita Liem (RWTH Aachen University), JSCPERMAVOST: Bringing Together Stakeholders in Performance Analysis and Engineering
In this talk, I am going to present about PERMAVOST workshop that is held in conjuction with HPDC from 2021 and 2022. The workshop aims to bring together stakeholders in performance analysis, from domain scientist with limited knowledge on the tools but know what they want from their application, performance analyst, and tools developers itself.
There’s a need to create a collaboration between these stakeholders and form a feedback loop before we create more tools that difficult to use by its intended users or doesn’t answer their problem also from the domain scientists to get some guidelines from experts in this field on what to look out when developing/improving their applications.
Daniel Barry, UTKBenchmark Reproducibility as a Tool in Native Hardware Event Identification
The Counter Analysis Toolkit (CAT) is a feature of the Performance API (PAPI) which aims to identify the semantic meaning of native hardware events. CAT measures the occurrences of events through a series of benchmarks. The patterns of these event occurrences inform the high-level meanings of native events. One problem involved in analyzing these patterns is detecting which of them are form a linearly independent basis. We leverage the consistency of an event’s occurrence pattern via repeated executions of the benchmarks to help solve this problem.
Daniel Barry, UTKExa-PAPI Demo: New Features to Aid App Performance on Extreme-Scale Architectures
Application performance on different architectures can be monitored by reading the occurrences of various hardware events. The Performance API (PAPI) serves as an easy-to-use and consistent interface for monitoring hardware events across the entire compute system.
This talk demonstrates new monitoring capabilities in PAPI, such as the ‘sysdetect’ component, which detects details of the available hardware on a given compute system. The goal is to provide a consistent interface for the application to get information about the topology of the hardware exposed to their algorithms, specific aspects about the memory hierarchy, number and type of GPUs on a node, number and type of CPUs on a node, what nodes can access shared memory or are on the same rack, to name but a few examples.
In addition, we demo how users can monitor memory intensity or bandwidth of an application on different architectures, including Fujitsu A64FX, IBM Power, and the latest Intel processors.
Anshu Dubey, ANLLanguage agnostic performance portability
We have developed a methodology and a set of tools that can enable performance portability for heterogeneous platforms without relying upon the application being in a specific programming language. These tools are being applied to Flash-X, a new multiphysics simulation software instrument.

Project Talks 1 – Numerics

PresenterTalk Title and Abstract
Yifan Yao, UIUCFast Integrators for Scalable Quantum Molecular Dynamics
A comprehensive understanding of the ultrafast response of materials under different types of radiation becomes critically important in investigating and designing functional materials in semiconductors, 2D materials, and medicine. Numerical simulations within real-time time-dependent density functional theory have been successfully employed to examine such ultrafast response of electrons in materials, but at the same time are extremely challenging from a computational cost perspective. A limitation of current numerical-integration techniques of the underlying time-dependent Kohn-Sham equations is the small time step required for numerical stability. In this project, we interfaced the Qb@ll and PETSc library to test existing and explore novel numerical time-stepping algorithms in physical simulations. Here we will report on our most recent attempts that include a linear stability analysis of the eigenvalues of the Jacobian of the Hamiltonian on various materials simulations systems under different radiation conditions that are of interest in practice. These eigenvalues provide a reasonable estimation of the largest stable time step of different time stepping methods. Further, this stability analysis guides us to explore adaptive time-stepping methods as the eigenvalues of the Jacobian varies under some circumstance, for example, when a large materials system irradiated is under strong laser irradiation, expressed by means of a time-dependent electric field.
Toshiyuki Imamura, R-CCSHPC libraries for solving dense symmetric eigenvalue problems
This project has continued since the early days of JLESC, and we would like to summarize this project by reviewing the evolution of HPC eigenvalue solvers over the past seven years or thereabouts.
Sri Hari Krishna Narayanan, ANL & Jan HückelheimShared Infrastructure for Source Transformation Automatic Differentiation
We will present a number of developments.
We have implemented a novel combination of reverse mode automatic differentiation and formal methods to enable efficient differentiation of (or backpropagation through) shared-memory parallel loops in OpenMP, reducing the need for atomic updates or private data copies during the parallel derivative computation. We have demonstrated this approach on a number of scientific computing benchmarks.
We have developed SICOPOLIS-AD v2, leveraging Tapenade to compute the adjoint of the SICOPOLIS ice sheet model (SImulation COde for POLythermal Ice Sheets). These adjoint sensitivities of the quantities of interest in ice sheet modeling to important independent
input parameters are crucial in improving the accuracy of ice sheet simulations.
Based on the success of differentiating SICOPOLIS, we have begun differentiating the MIT global circulation model (MITgcm) using Tapenade (other AD tools were used in the past). MITgcm can be used to study both atmospheric and oceanic phenomena; one hydrodynamical kernel is used to drive forward both atmospheric and oceanic models.

Project Talks 2 – Workflows, Performance, Cloud

PresenterTalk Title and Abstract
Orcun Yildiz and Tom Peterka, ANLExtreme-Scale Workflow Tools – Swift, Decaf, Damaris, FlowVR
Today’s scientific workflows are increasingly complex and include a large number of tasks with dynamic data and computation requirements. In this project, we explore providing increased capabilities for scientific computing to meet such requirements. In particular, we are developing LowFive, which is a data model specification, redistribution, and communication library built on top of HDF5-VOL. LowFive serves as the base layer for our new in situ workflow management system—Wilkins. Wilkins provides a data-centric API for defining the workflow graph, creates and launches tasks, establishes communicators between the tasks, and allows for dynamic changes to the workflow tasks once they are running. In this talk, we will present the recent developments in LowFive and Wilkins libraries, which we hope that will increase understanding and motivate further research into dynamic heterogeneous in situ workflows.
Brian WylieDeveloper tools for porting and tuning parallel applications on extreme-scale parallel systems
Developments in the partners’ tools and their interoperability will be reported, along with their use with large-scale parallel applications on a variety of JLESC members’ supercomputers. In particular, porting of the tools to Fujitsu A64fx and AMD EPYC Rome CPUs and AMD MI250X GPUs will be reviewed, and early experience analysing flagship application execution performance. There will also be an update on recent and upcoming tools training and application scaling events.
Swann Perarnau, ANLImproving the Performance and Energy Efficiency of HPC Applications Using Autonomic Computing Techniques
As applications struggle to make use of increasingly heterogeneous compute nodes, maintaining high efficiency (performance per watt) for the whole platform becomes a challenge. Alongside the growing complexity of scientific workloads, this extreme heterogeneity is also an opportunity: as applications dynamically undergo variations in workload, due to phases or data/compute movement between devices, one can dynamically adjust power across compute elements to save energy without impacting performance. With an aim toward an autonomous and dynamic power management strategy for current and future HPC architectures, this project made several advances on the use of control theory for the design of a dynamic power regulation method. Structured as a feedback loop, our approach consists of periodically monitoring application progress and choosing at runtime a suitable power cap for processors. This presentation will provide an update on recent publications in this project, and our future plans for a more adaptive and comprehensive power management infrastructure at the node level.
Daniel Rosendo, INRIAAdvancing Chameleon and Grid’5000 testbeds
Distributed digital infrastructures for computation and analytics are now evolving towards an interconnected ecosystem allowing complex applications to be executed from IoT Edge devices to the HPC Cloud (aka the Computing Continuum). Understanding end-to-end performance in such a complex continuum is challenging. This breaks down to reconciling many, typically contradicting application requirements and constraints with low-level infrastructure design choices. One important challenge is to accurately reproduce relevant behaviors of a given application workflow and representative settings of the physical infrastructure underlying this complex continuum.
The main research goal of this project is to enable scientists to effectively reproduce and explore experiments run in the Chameleon Cloud, CHI@Edge, and Grid5000 testbeds. Our ultimate goal is to lower the barrier to reproducing research by combining the reproducible artifacts and the experimental environment. We will demonstrate how our Jupyter/Trovi approach for reproducibility helps scientists to reproduce complex Edge-to-Cloud workflows across Chameleon/CHI@Edge/G5K.

Project Talks 3 – AI/ML

PresenterTalk Title and Abstract
Thomas Bouvier, INRIATowards Continual Learning at Scale
During the past decade, Deep learning (DL) supported the shift from rule-based systems towards statistical models. Deep Neural Networks (DNNs) are achieving high accuracy on various benchmarks by extracting patterns from complex datasets. Although presenting promising results, most existing supervised learning algorithms operate under the assumptions that the data is (i) i.i.d.; (ii) static; and (iii) available before the training process. These constraints limit their use in real-life scenarios where the aforementioned datasets are replaced by high volume, high velocity data streams generated over time by distributed devices. It is unfeasible to keep training models in an offline fashion from scratch every time new data arrives, as this would lead to prohibitive time and/or resource constraints. At the same time, it is not possible to train learning models incrementally either, due to catastrophic forgetting, a phenomenon causing typical DNNs to reinforce new patterns at the expense of previously acquired knowledge i.e. inducing biases.
In this talk, we will present techniques based on rehearsal to achieve Continual Learning at scale. Rehearsal-based approaches leverage representative samples previously encountered during training to augment future minibatches with. The key novelty we address is how to adopt rehearsal in the context of data-parallel training, which is one of the main techniques to achieve training scalability on HPC systems. The goal is to design and implement a distributed rehearsal buffer that handles the selection of representative samples and the augmentation of minibatches asynchronously in the background. We will discuss trade-offs introduced by such a continual learning setting in terms of training time, accuracy and memory usage.
Mario Rüttgers, AIA RWTH, JSCDeep Neural Networks for CFD Simulations
The collaboration in the framework of JLESC focuses on the prediction of flow fields using machine learning (ML) techniques. The basis for the project are jointly developed convolutional neural networks (CNNs) with an autoencoder-decoder type architecture. Based on these CNNs flow fields are predicted from two different perspectives. The group from RIKEN and KOBE University uses the networks to investigate dimensional-reduction techniques for a three-dimensional flow field. They are implemented with a performance-effective distributed parallel scheme on Fugaku. Furthermore, time-evolution of the reduced-order space is evaluated using a reduced-order model (ROM) based on long short-term memory networks (LSTMs). The group from JSC and RWTH Aachen University uses the jointly developed CNNs to accelerate simulations of respiratory flows. The integration of numerical flow simulations into daily clinical practice represents a key aspect for the improvement of diagnostics and treatments in rhinology. Such an integration is, however, only feasible if the involved numerical methods provide fast results. Considering steady state simulations’ convergence may take hours, even on HPC systems, an improved flow field initialization to accelerate convergence is investigated. That is, simulations are initialized with flow fields predicted by a physics-informed CNN, whose loss function solely consists of the equations for mass and momentum conservation.
Paula Olaya, UTKMachine Learning-driven Predictive Analysis of Protein Diffraction Data
Capturing structural information of a biological molecule is crucial to determine its function and understand its mechanics. X-ray Free Electron Lasers (XFEL) are an experimental method used to create diffraction patterns (images) that can reveal structural information. In this project we present the design, implementation, and evaluation of XPSI (X-ray Free Electron Laser-based Protein Structure Identifier), a framework capable of predicting three structural properties in molecules (i.e., orientation, conformation, and protein type) from their diffraction patterns. XPSI predicts these properties with high accuracy in challenging scenarios, such as recognizing orientations despite symmetries in diffraction patterns, distinguishing conformations even when they have similar structures, and identifying protein types under different noise conditions.

Project Talks 4 – Resilience and Compression

PresenterTalk Title and Abstract
Yves RobertOptimization of Fault-Tolerance Strategies for Workflow Applications
Everything is in the title. Just relax, sit back and enjoy the talk.
Jon Calhoun, Clemson UniversityEffective Use of Lossy Compression for Numerical Linear Algebra Resilience and Performance
As HPC users seek to solve larger and more complex problems, the memory requirements grow. To run these larger problems, users need access to new systems with larger amounts of memory, which requires large capital investments. Reducing the memory footprint of an application allows it to be run on current systems with no additional investments. Lossy data compression has been shown to be an effective technique to significantly reduce the size of HPC data. Using in-line lossy data compression shrinks the memory footprint, but incurs performance penalties and makes the data vulnerable to silent data corruption. In this talk, we discuss challenges to obtain good performance and why in-line compression increases an application’s susceptibility to silent data corruption. Finally, we present in-progress work that addresses these concerns.
Argonne and Riken-RCCSCompression for instruments
The talk will present the progress made for LCLS and Spring-8 light sources to reduce significantly the data produced by their detectors while keeping the potential for scientific discoveries
Concerning LCLS, we have prepared a new compression method ROIBIN-SZ3 specialized for serial crystallography at scale. This method leverages lossless region of interest preservation with binning+SZ lossy compression of background information. This method enables very high compression ratios at high throughput while preserving scientific integrity.
We also have made progress on the GPU port of this code for NVidia GPUs.
Concerning Spring-8, we productized an AI-based data compression tool, TEZip. We implemented three lossy compression modes, absolute error bound (abs), relative bound ratio (rel), both of abs&rel and point wise relative error bound (pwrel) in accordance with SZ. We also prepared user and developer documents on the readthedoc page.
The talk will be presented by two experts: Robert Underwood for LCLS and Kento Sato for Spring-8
Nigel Tan, UTKTowards Scalable GPU-Accelerated Incremental Checkpointing of Sparsely Updated Data Structures
Checkpointing large amounts of related data concurrently to stable storage is a common I/O pattern of many HPC applications in a variety of scenarios: checkpoint-restart fault tolerance, coupled workflows that combine simulations with analytics, adjoint computations, etc. This pattern is challenging because it needs to happen frequently and typically leads to I/O bottlenecks that negatively impact the performance and scalability of the applications.
Furthermore, checkpoint sizes are continuously increasing and overwhelm the capacity of the storage stack, prompting the need for data reduction. A large class of applications including graph algorithms such as graph alignment, perform sparse updates to large data structures between checkpoints. In this case, incremental checkpointing approaches that save only the differences from one checkpoint to another can dramatically reduce the checkpoint sizes, which reduces both the I/O bottlenecks and the storage capacity utilization. However, such techniques are not without challenges: it is non-trivial to transparently determine what data changed since a previous checkpoint and to assemble the differences in a compact fashion that does not result in excessive metadata. State-of-art deduplication techniques have limited support to address these challenges for modern applications that manipulate data structures directly on GPUs. In this paper, we aim to fill this gap by proposing a hash-based incremental checkpointing technique specifically designed to take advantage of the high memory bandwidth and massive parallelism of GPUs. Our approach builds a compact representation of the differences between checkpoints using Merkle-tree inspired data structures optimized for parallel construction and manipulation. Our results show a significant reduction in checkpointing overheads and sizes compared with traditional checkpointing approaches and a significant reduction in metadata overheads compared with GPU-enabled deduplication techniques.

Break Out Session – Heterogeneous and reconfigurable architectures for the future of computing

Session ChairsTalk Title and Abstract
Kentaro Sano, Tomohiro Ueno, RIKEN; 
Franck Cappello, ANL
Kazutomo Yoshii, ANL
Xavier Martorell, BSC
Daniel Jimenez-Gonzalez, BSC
Carlos Alvarez Martinez, BSC
The forthcoming end of Moore’s law encourages us to challenge new approaches for the future of computing. One of the promising approaches is heterogeneous architecture with reconfigurable devices such as FPGAs and CGRA processors, which leverages hardware specialization and/or dataflow computing. In this break-out session, we will discuss subjects and opportunities related to custom accelerators, new architectures, and new programming paradigms with talks on recent research activities. We also plan to exchange research seeds between attendees and discuss the needs for adjustment in the scope and direction of our JLESC collaboration.
Mohamed El-Hadedy,  UIUC/California State Polytechnic University, mealy@cpp.edu, “Lightweight cryptography engines for FPGA and ASIC

– Thomas Applencourt, ANL, tapplencourt@anl.gov, “Programming heterogeneous platforms with SYCL”

– Albert Kahira, Jülich Supercomputing Centre (JSC), a.kahira@fz-juelich.de
https://www.researchgate.net/profile/Albert-Kahira, “Introduction of  OPTIMA-HPC Project”    https://optima-hpc.eu

– Tomohiro Ueno, RIKEN R-CCS, tomohiro.ueno@riken.jp, “XXXX”

– Carlos Álvarez/Daniel Jiménez/Juan Miguel de Haro, BSC-CNS, carlos.alvarez@bsc.es, “Programming heterogeneous clusters with OmpSs”.

Break Out Session – Quantum Computing and HPC

Session ChairTalk Title and Abstract
MiwakoTsuji, RIKENQuantum computing is the computation using the properties of quantum states, and considered to be an important block for the post-Moore Era. This break out session aims to introduce the researches and activities in the quantum computing area. Especially, we would like to focus on the hybrid/cooperative computations by the quantum and classical computing.

– Miwako Tsuji (RIKEN)
– Kentaro Sano (RIKEN)
– Yuri Alexeev (ANL)
– Hannes Lagemann (JSC)

Break Out Session – CI for HPC

Session ChairTalk Title and Abstract
Robert Speck, JSCFor HPC codes, clean and careful software engineering is a crucial and challenging aspect. This is in particular true for continuous testing, integration, benchmarking, deployment (CT/CI/CB/CD = Cx): different architectures, different compilers, different degrees and types of parallelism, different software stacks, restricted access to machines and so on. All these aspects need to be addressed and while various approaches exist, a common strategy in the field of HPC is still missing. It is the goal of this BOS to bring together experts as well as beginners and interested researchers in this field to learn from each other, to see what others are doing and how and why.

– Robert Speck (JSC): CI for HPC – Introduction (project talk!)
– Jakob Fritz (JSC): Automated Testing of Parameter-Spaces
– Darko Marinov (NCSA): Flaky Tests in Continuous Integration
– Yoshifumi Nakamura (R-CCS): CI/CD with Fugaku at R-CCS

Break Out Session – Women in HPC

Session ChairTalk Title and Abstract
Sharon Broude Geva, JSCOn five continents, many institutions and organizations are already working together with WHPC. In this BOS the local networks Women@NCSA (NCSA), WiCS: Women in Computer Sciences (BSC) and JuWinHPC: Jülich Women in HPC (JSC) introduce themselves.

Topics covered in this BOS include the establishment of the individual networks, their activities, goals and values. Members of the NCSA, BSC and JSC will learn how they can benefit from their local networks; members of other organizations may be interested in starting their own network and can get advice on how to do so. Although the networks were founded by women, ANYONE interested in the topic of equal opportunities is encouraged to join. We look forward to lively discussions, questions and comments of all kinds.

– Jewel Malu Goodly, NSCA, “Women@NCSA”.
– Carolin Penke, JSC, “JuWinHPC: Jülich Women in HPC”.
– Marta Gonzalez, BSC, “WiCS: Women in Computer Sciences”.