Home - Interactive Digital Media: Heterogeneous Architecture and Design

About The Project

The Heterogeneous Architecture and Design research project is conducted at Advanced Digital Sciences Center (ADSC) as part of the Interactive Digital Media (IDM) subprogram. The Principal Investigator leading our research effort is Professor Deming Chen from University of Illinois at Urbana-Champaign.

Project Goals

Computing system design continues to grow in complexity. Compute performance demands continue to increase at the same time that devices demand increased power efficiency. To help meet these demands, computing systems increasingly use heterogeneous compute elements. Current popular platforms include NVidia’s Tegra platform, Apple’s A4/A6 platform, AMD’s Fusion architecture, Intel’s CPU + embedded GPU, and platforms from Xilinx and Altera with CPUs embedded in FPGA hardware. However, the additional complexity of these systems also demands increased designer expertise to properly take advantage of the platform’s abilities. This project seeks to lower this barrier to entry — Expert use of next generation hardware platforms should not require extensive and detailed hardware design experience. To this end, the hardware design team at ADSC is undertaking several efforts to simplify the process of obtaining high quality algorithm implementations on heterogeneous platforms including work in both GPU kernel optimization and High Level Synthesis to generate FPGA designs.

Expert Design and Interactive Digital Media

ADSC’s central project theme of interactive digital media is a main driving application for expert design without expertise. With the continuing rise in interactive digital media there is a wealth of media applications, and the algorithms for these applications are rapidly evolving to meet the quality, performance and energy consumption goals. However, the algorithm designers are commonly not experienced in hardware design and the time-to-market constraints of new media applications demands fast design iterations and low end to end design time. In past media applications, an algorithm designer could be an expert software engineer — effectively designing their own algorithm enhancement, creating an efficient implementation and mapping that implementation to the (normally CPU only) platform. However, although software design experience is wide-spread, hardware design experience is still relatively rare and it is yet more rare for a media domain-expert to also have detailed hardware design experience. Thus, it is critical to develop tool flows that allow typical software designers to effectively utilize the resources available to them on next generation hardware platforms — we are pursuing two parallel but inter-related paths towards this goal:

Automatic GPU kernel optimization – The CUDA and OpenCL programming modelsappear similar to typical software coding, but include many implicit assumptions about the underlying hardware execution environment. We are developing a range of automatic kernel optimization flows that can take unoptimized GPU kernels and automatically transform to remove performance bottlenecks induced by hardware limitations such as control flow divergence, register allocation, thread parallelism, and memory coalescing.
High Level Synthesis – FPGA or ASIC design at the register transfer level is a time-consuming and tedious process — design times for hardware implementations are commonly one to two orders of magnitude greater than software implementations. HLS tools are currently popular, yet there is still a performance gap between manually created and HLS created designs, and there can be significant limitations on acceptable software coding styles. We are developing several high level synthesis tool flows in order to target these limitations. Current work in the FCUDA project maps CUDA kernels to FPGA implementations and has achieved parity performance with up to 90% reduction in power vs. the GPU implementation.

GPU Optimization Challenges and Impact

Optimization of GPU kernels requires exploration of a variety of performance impacts including control flow divergence, register allocation, thread parallelism, and memory coalescing. Furthermore all of these factors vary depending on both application code and the underlying GPU architecture. Currently, GPU compilation does not effectively re-target code to multiple platforms — both CUDA and OpenCL kernels are commonly re-written or re-optimized for each platform it will execute on. The performance difference of optimized and non-optimized codes can be significant: it is not uncommon to double performance by optimizing for a new platform. Furthermore, this performance difference is not simply a matter of unused resources. Due to thread organization, memory coalescing, and shared register resources (among other impacts), an architecture with more resources may perform worse than an architecture with fewer resources (without optimization).

The impact of improved GPU kernel optimization is pervasive throughout computing — reduced development effort to port code between GPU architectures, improved performance and power/energy efficiency with little or no additional developer effort, and reduced barrier to entry to effectively utilize GPU resources. Furthermore, with easy optimization tools, more application designers can use GPU resources without extensive and detailed GPU programming experience. Thus, improved optimization also increases the audience of application developers that will choose to use GPUs. Finally, these optimization techniques are not limited to GPU architectures — techniques that can effectively re-organize computation in order to match architectural constraints can also be applied as optimizations for High Level Synthesis, automatic parallelization for multi-core CPUs, and design space exploration for heterogeneous systems.

High Level Synthesis Challenges and Impact

High Level Synthesis is generally the process of mapping a high level language description of an algorithm into a register transfer level implementation suitable for hardware synthesis. There are many HLS tools in both industry and academia, but there are numerous remaining challenges in high level synthesis. Many HLS tools choose to restrict the input language to a subset of features that are suitable for hardware, and transformation between the high level language and RTL is a relatively straight-forward mapping, guided by experienced hardware designers inserting compiler directives to specify how to transform code. For these HLS tools, the appeal of HLS is increased productivity by using a higher level language, but the user base is still expected to be experienced hardware designers who write their high level language code in a hardware-appropriate style. However, the vast majority of application developers have no hardware design experience, and thus most applications that we may wish to accelerate with hardware were not written in a hardware appropriate style. HLS tools have achieved excellent quality with guided implementations, but relaxing software constraints, automating pipelining, automated data marshalling, area power/energy performance design space exploration, and technology mapping software to hardware implementations remain large challenges in HLS tool design.

The impact of improved automation in HLS is expansive — improved automation requires fewer design iterations by the user, further reducing the design effort to map a software algorithm to a hardware implementation. As software constraints are relaxed and the quality of automated transformations improve, there will be less and less work to map software implementations towards the (ideal) goal of creating good hardware implementations from any software input without user interaction. Although this goal is challenging due to the plethora of potential design goals and the numerous ways that an algorithm may be implemented in software (including poorly written yet correct software), each step towards this ideal goal improves the suitability of HLS for use in fast design-time production environments.

Summary of Accomplishments

Through the past three years, the hardware group has published a total of 13 papers. Among these, one received Best Paper Award (FCCM 2011, 1 out of 120 submissions); one received Best Paper Nomination (FPT 2011, 4 out of 110 submissions); and two were invited. Four were based on collaborative effort with other ADSC groups, and one with IHPC. All papers are published in top-tier conferences, except one in an influential workshop and another in a special issue of an open-access journal. Several representative works are summarized below.

Multilevel Granularity Parallelism Synthesis on FPGAs (FCCM’11): built an efficient design space search heuristic to derive a performance near-optimal configuration for FCUDA flow, providing 7X better performance than the initial FCUDA work.
High Level Synthesis of Multiple Dependent CUDA Kernels for FPGAs (ASPDAC’13): demonstrated that CUDA language provided intrinsic advantage over C language for FPGAs, and FCUDA solution provided 16X energy reduction on FPGA over GPU.
Improving High Level Synthesis Optimization Opportunity through Polyhedral Transformations (FPGA’13): developed a polyhedral model-based technique that performed loop transformations to achieve 6X speedup over existing HLS solution.
An Accurate GPU Performance Model for Effective Control Flow Divergence Optimization (IPDPS’12): provided GPU performance model and regrouping algorithms to optimize against control flow divergence, achieving up to 3.2X speedup.
High Level Synthesis of Stereo Matching: Productivity, Performance, and Software Constraints (FPT’11): performed critical systematic evaluation of the productivity, performance, and common practices of HLS using stereo matching as a driver application.
Real-time Implementation and Performance Optimization of 3D Sound Localization on GPUs (DATE’12): finished an efficient real-time implementation of 3D sound localization on GPUs; optimized various aspects of GPU implementation; achieved 501X and 130X speedup compared to a single-thread and a multi-thread CPU implementation respectively.