Open64 Workshop at CGO 2008

http://www.capsl.udel.edu/conferences/open64/2008

April 6, 2008
Boston, Massachusetts

In conjunction with IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

http://www.cgo.org/

"Temporal Distribution Based Software Cache Partition To Reduce I-cache Misses" [PPT]
Authors: XiaoMi An and Jiqiang Song and Wendong Wang (SimpLight)
Abstract: As multimedia applications on mobile devices become more computationally demanding, embedded processors with one level I-cache become more prevalent, typically with a combined I-cache and SRAM of 32KB ~ 48KB total size. Code size reduction alone is no longer adequate for such applications since program sizes are much larger than the SRAM and I-cache combined. For such systems, a 3% I-cache miss rate could easily translate to more than 50% performance degradation. As such, code layout to minimize I-cache miss is essential to reduce the cycles lost. In this paper, we propose a new code layout algorithm - temporal distribution based software cache partition with focus on multimedia code for mobile devices. This algorithm is built on top of Open64 code reordering scheme. By characterizing code according to their temporal reference distribution characteristics, we partition the code and map them to logically different regions of the cache. Both capacity and conflict misses can be significantly reduced, and the cache is more effectively used. The algorithm has been implemented as a part of our tool-chain for our products. We compare our results with previous works and show more efficacy in reducing I-cache misses with our approach, especially for applications suffering from capacity misses.

"Register Pressure Guided Unroll-and-Jam" [PPT]
Authors: Yin Ma and Steven Carr (Univ. of Michigan)
Abstract: Unroll-and-jam is an effective loop optimization that not only improves cache locality and instruction level parallelism (ILP) but also benefits other loop optimizations such as scalar replacement. However, unroll-and-jam increases register pressure, potentially resulting in performance degradation when the increase in register pressure causes register spilling. In this paper, we present a low cost method to predict the register pressure of a loop before applying unroll-and-jam on high-level source code with the consideration of the collaborative effects of scalar replacement, general scalar optimizations, software pipelining and register allocation. We also describe a performance model that utilizes prediction results to determine automatically the unroll vector, from a given unroll space, that achieves the best run-time performance. Our experiments show that the heuristic prediction algorithm predicts the floating point register pressure within 3 registers and the integer register pressure within 4 registers. With this algorithm, for the Polyhedron benchmark, our register pressure guided unroll-and-jam improves the overall performance about 2% over the model in the industry-leading optimizing Open64 backend for both the x86 and x86-64 architectures.

"A Practical Stride Prefetching Implementation in Global Optimizer" [PPT]
Authors: Hucheng Zhou, Xing Zhou, Tianwei Sheng, Dehao Chen, Jianian Yan, Shin-ming Liu, Wenguang Chen, Weimin Zheng (Tsinghua Univ.)
Abstract: Software data prefetching is a key technique for hiding memory latencies on modern high performance processors. Stride memory references are prime candidates for software prefetches on architectures with, and without, support for hardware prefetching. Compilers typically implement software prefetching in the context of loop nest optimizer (LNO), which focuses on affine references in well formed loops but miss out on opportunities in C++ STL style codes. In this paper, we describe a new inductive data prefetching algorithm implemented in the global optimizer. It bases the prefetching decisions on demand driven speculative recognition of inductive expressions, which equals to strongly connected component detection in data flow graph, thus eliminating the need to invoke the loop nest optimizer. This technique allows accurate computation of stride values and exploits phase ordering. We present an efficient implementation after SSAPRE optimization, which further includes sophisticated prefetch scheduling and loop transformations to reduce unnecessary prefetches, such as loop unrolling and splitting to further reduce unnecessary prefetches. Our experiments using SPEC2006 on IA-64 show that we have competitive performance to the LNO based algorithms.

"Explore Be-Nice Instruction Scheduling in Open64 for an Embedded SMT Processor" [PPT]
Authors: Handong Ye, Ge Gan, Xiaomi An, Ziang Hu, Guang R. Gao (Univ. of Delaware and SimpLight)
Abstract: A SMT processor can fetch and issue instructions from multiple independent hardware threads at every CPU cycle. Therefore, hardware resources are shared among the concurrently-running threads at a very fine grain level, which can increase the utilization of processor pipeline. However, the concurrently-running threads in a SMT processor may interfere with each other and stall the CPU pipeline. We call this kind of pipeline stall inter-thread stall (ITS for short) or thread interlock. In this paper, we present our study on the ITS problem on an embedded heterogeneous SMT processor. Our experiments demonstrate that, for some test cases, 50% of the total pipeline stalls are caused by ITS. Therefore, we have developed a new instruction scheduling algorithm called be-nice instruction scheduling, based on Open64 Global Code Motion, to coordinate the conflicts between concurrent threads. The instruction scheduler uses the thread interference information (obtained by profiling) as heuristics to decrease the number of ITS without sacrificing the overall CPU performance. The experimental results show that, for our current test cases the be-nice instruction scheduler can reduce 15% of the inter-thread stall cycles, and increase the IPC of the critical thread by 2%-3%. The experiments are performed using the Open64 compiler infrastructure.

"Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements" [PPT]
Authors: Gautam Chakrabarti and Fred Chow (PathScale)
Abstract: A common performance problem faced by today's application programs is poor data locality. Real-world applications are often written to traverse data structures in a manner that results in data cache miss overhead. These data structures are often declared as structs in C and classes in C++. Compiler optimizations try to modify the layout of such data structures so that they are accessed in a more cache-friendly manner. These compiler transformations to optimize data layout include structure splitting, structure peeling, and structure field reordering. In this paper, we present the design and implementation of the transformations of structure splitting and structure peeling, in a commercial version of the Open64 compiler. We present details of how these transformations can be achieved in the Open64 compiler, as well as the analyses required to safely and usefully perform the transformation. We present some performance results from the SPEC CPU2000 and the CPU2006 suites of benchmarks to demonstrate the effectiveness of our implementation. Our results show that for some applications these layout optimizations can provide substantial performance improvement.

"NVIDIA's Experience with Open64" [PPT]
Authors: Mike Murphy (nVidia)
Abstract: NVIDIA uses Open64 as part of its CUDA toolchain for general purpose computing using GPUs. Open64 was chosen for the strength of its optimizations, and its usage has been a success, though there have been some difficulties. This paper will give an overview of its usage, the modifications that were made, and some thoughts about future usage.

"Quantitative approach to ISA design and compilation for code size reduction" [PPT]
Authors: K. M. Lo and Lin Ma (SimpLight)
Abstract: In this paper, an efficient code size optimization instruction set architecture targeting embedded telecommunication applications is introduced. Nowadays, mixed 16-bit and 32-bit size instruction set approaches are commonly used to achieve code size reduction while minimizing performance loss. They are usually designed with some restrictions such as reducing the number of accessible registers, mode switching, or special hardware logic handling. The approach starts with a common, basic RISC ISA and a re-targetable high performance compiler. The Open64 compiler was chosen for its machine independent optimization so that once retargeted, the generated code will be of high performance quality. Once retargeted, we start our ISA compression design based on statistics collected from the code generated. By judicious selection from actual instructions generated, a high code compression rate is achieved without adding restrictions to the number of registers used and hardware implementation. Furthermore, this approach does not introduce any noticeable performance degradation due to the mixed 32/16-bit ISA compared to the full 32-bit ISA.

"An Open64-Based Framework Tool for Analyzing Parallel Applications" [PPT]
Authors: Laksono Adhianto and Barbara Chapman (Univ. of Houston)
Abstract: We propose an infrastructure based on the Open64 compiler for analyzing, modeling and optimizing MPI and/or OpenMP applications. The framework consists of four main parts: a compiler, microbenchmarks, a user interface and a runtime library. The compiler generates the application signature containing a portable representation of the application structure that may influence program performance. Microbenchmarks are needed to capture the system profile, including MPI latency and OpenMP overhead. The user interface, based on Eclipse, is used to drive code transformation, such as OpenMP code generation. And lastly, our runtime library can be used to balance the MPI workload, thus reducing load imbalance. In this paper we show that our framework can analyze and model MPI and/or OpenMP applications. We also demonstrate that it can be used for program understanding of large scale complex applications.

"Development of an Efficient DSP Compiler Based on Open64" [PPT]
Authors: Subrato K De, Anshuman Dasgupta, Sundeep Kushwaha, Tony Linthicum, Susan Brownhill, Sergei Larin, Taylor Simpson (Qualcomm)
Abstract: In this paper we describe the development of an efficient compiler for digital signal processors (DSP) based on the Open64 compiler infrastructure. We have focused on the state of the art advanced DSP architectures that allow high degree of instruction level parallelism, has hardware loops, address-generation units, DSP-specific addressing features (e.g., circular and bit-reversed), and many specialized instructions. We discuss the enhancements made to the Open64 compiler infrastructure to exploit the architectural features of contemporary DSPs.

"An Open64-based Compiler Approach to Performance Prediction and Performance Sensitivity Analysis for Scientific Codes" [PPT]
Authors: Jeremy Abramson and Pedro Diniz (USC and IST)
Abstract: The lack of tools that provide adequate feedback at a level of abstraction programmers can relate to makes the problem of performance prediction and portability in today's or tomorrow's machines extremely difficult. We describe an Open64-based compiler approach to the problem of performance prediction and architecture sensitivity analysis done at a source-level. Our analysis tool extracts the computation's high-level dataflow-graph from the code's WHIRL representation, and uses source-level data access patterns information as well as register needs to derive performance bounds for the program under various architectural scenarios. The end result is a very fast performance prediction as well as insight into where performance bottlenecks are. We have experimented with a real code engineers and scientists use in practice - a sparse matrix-vector multiplication kernel. The results correlate very well with the execution of the code on a real machine and allow programmers to understand the performance bottlenecks without having to engage in very low-level instrumentation analysis.

"Implementing an Open64-based Tool for Improving the Performance of MPI Programs" [PPT]
Authors: Anthony Danalis and Lori Pollock and Martin Swany and John Cavazos (Univ. of Delaware)
Abstract: While MPI parallel programming has become the primary approach to achieving performance gains in cluster computing, the communication overhead inherent in a cluster environment continues to be a major obstacle. A promising approach to improve performance is the use of computation-communication overlapping, which is enabled by communication libraries that utilize Remote Data Memory Access (RDMA), either directly in the form of one-sided communication, or via two-sided communication over a low overhead rendezvous protocol. To spare the scientific programmer from learning how to utilize these libraries to effectively maximize computation-communication overlap, we have developed a tool that automatically transforms an MPI parallel program to a semantically equivalent program with selected data exchange calls in MPI replaced to leverage an RDMA-targeted communication library. In this paper, we describe the implementation of this MPI program transformer using the Open64 compiler.

"Extending Global Optimizations in the OpenUH Compiler for OpenMP" [PPT]
Authors: Lei Huang, Deepak Eachempati, Marcus W. Hervey, Barbara Chapman (Univ. of Houston)
Abstract: This paper presents our design and implementation of a framework for analyzing and optimizing OpenMP programs within the OpenUH compiler, which is based on Open64. The paper describes the existing analyses and optimizations in OpenUH, and explains why the compiler may not apply classical optimizations to OpenMP programs directly. It then presents an enhanced compiler framework including Parallel Control Flow Graph and Concurrent SSA that represent both intra-thread and inter-thread data flow. With this framework, the compiler is able to perform traditional compiler optimizations on OpenMP programs, and it further increases the opportunities for more aggressive optimizations for OpenMP. We describe our current implementation in the OpenUH compiler and use a code example to demonstrate the optimizations enabled by the new framework. This framework may lead to a significant improvement in the performance of the translated code.

"Feedback-Directed Optimizations with Estimated Edge Profiles from Hardware Event Sampling" [PPT]
Authors: Vinodha Ramasamy and Dehao Chen and Robert Hundt and Wenguang Chen (Google and Tsinghua Univ.)
Abstract: Traditional feedback-directed optimization (FDO) uses static instrumentation to collect edge profiles. Although this method has shown good application performance gains, it is not commonly used in practice due to the high runtime overhead of profile collection, the tedious dual-compile usage model, and difficulties in generating representative training data sets. In this paper, we show that edge frequency estimates can be successfully constructed with heuristics using profile data collected by sampling of hardware events, incurring low runtime overhead (e.g., less then 2%), and requiring no instrumentation, yet achieving competetive performance gains. Our initial results show a 3-4% performance gain on the SPEC C benchmarks.

We invite you to participate in the Open64 Workshop (associated with CGO 2008) to take place on Sunday afternoon, April 6th and to present your recent research & development work using the Open64 compiler infrastructure. Active Open64 researchers & developers from academia and industry will report their research projects, experimental results, and development experiences using the Open64 compiler. A considerable amount of work has been performed with the Open64 compiler infrastructure over the last few years, we anticipate hearing a number of significant research topics and encouraging results at this workshop. Topics of discussion will be (but are not limited to) findings and results in

porting Open64 to other languages
porting Open64 to new target architectures
research in general-purpose code analysis and optimization using Open64
research in using Open64 for embedded systems
parallelization using Open64
notable project in using Open64 for significant applications
infrastructure/tool development for Open64
research in using Open64 for multi-core (many-core) architecture

The workshop provides a forum for discussion of your findings and experiences with a broad range of Open64 researchers and developers. It is also the main opportunity for the participants to exchange their expectations and wishes for future development of Open64.

Paper Submission and Guideline (Submit Paper Here):

The paper must be original material that has not been previously published.
The paper is limited to twenty two (22) 8.5"x11" double spaced pages, using 11pt or larger font. The page limit includes everything: references, title, figures, appendices, abstract, etc.
Please make sure your paper prints satisfactorily on 8.5"x11" paper
Papers are to be submitted for double-blind review. Author names as well as hints of identity are to be removed from the submitted paper. Use care in naming your files. In addition, do not omit references to provide anonymity, as this leaves the reviewer unable to grasp the context.
Your submission must be formatted for black-and-white printers and not color printers. This is especially true for plots and graphs in the paper.
Please make sure that the labels on your graphs are readable without the aid of a magnifying glass.
Please number the pages.
The paper must be submitted in PDF. We cannot accept any other format.

Important Dates:

Submission: ~~January 15, 2008~~ Extended to Friday February 29, 2008
Acceptance: Friday Mar. 14th 2008
Final Version: Friday Mar. 28th 2008

Open64 Workshop steering Committee:

Guang R. Gao (Univ. of Delaware)
Shin-Ming Liu (HP)
Robert Hundt (Google)

Open64 Workshop Program Committee

Chair: Guang R. Gao (Univ. of Delaware)
Co-Chair: Suneel Jain (HP) and Barbara Chapman (Univ. Houston)

PC members:

Guang R. Gao (Univ. of Delaware)
Roy Ju (AMD)
Suneel Jain (HP)
Barbara Chapman (Univ. Houston)
Sun Chan (Simplight Nanoelectronics)
Fred Chow (Pathscale)
Mike Murphy (nVidia)
Robert Hundt (Google)
Robert Kennedy (Google)
Shin-Ming Liu (HP)
Tony Linthicum (Qualcomm)
John Mellor-Crummey (Rice)
Albert Cohen (INRIA Saclay - Ile-de-France)
Pen-Chung Yew (Univ. of Minnesota)
Wei-Chung Hsu (Univ. of Minnesota)
José Nelson Amaral (Univ. of Alberta)
Wenguang Chen (Tsinghua)