In conjunction with
IEEE/ACM International Symposium on Code Generation and Optimization
(CGO)
"Temporal
Distribution Based Software Cache Partition To Reduce I-cache
Misses" [PPT] Authors:
XiaoMi
An and Jiqiang Song and Wendong Wang (SimpLight) Abstract:
As
multimedia applications on mobile devices become more
computationally demanding, embedded processors with one level
I-cache become more prevalent, typically with a combined I-cache
and SRAM of 32KB ~ 48KB total size. Code size reduction alone is
no longer adequate for such applications since program sizes are
much larger than the SRAM and I-cache combined. For such systems,
a 3% I-cache miss rate could easily translate to more than 50%
performance degradation. As such, code layout to minimize I-cache
miss is essential to reduce the cycles lost. In this paper, we
propose a new code layout algorithm - temporal distribution based
software cache partition with focus on multimedia code for mobile
devices. This algorithm is built on top of Open64 code reordering
scheme. By characterizing code according to their temporal
reference distribution characteristics, we partition the code and
map them to logically different regions of the cache. Both
capacity and conflict misses can be significantly reduced, and the
cache is more effectively used. The algorithm has been implemented
as a part of our tool-chain for our products. We compare our
results with previous works and show more efficacy in reducing
I-cache misses with our approach, especially for applications
suffering from capacity misses.
|
"Register
Pressure Guided Unroll-and-Jam" [PPT] Authors:
Yin
Ma and Steven Carr (Univ. of Michigan) Abstract:
Unroll-and-jam
is an effective loop optimization that not only improves cache
locality and instruction level parallelism (ILP) but also benefits
other loop optimizations such as scalar replacement. However,
unroll-and-jam increases register pressure, potentially resulting
in performance degradation when the increase in register pressure
causes register spilling. In this paper, we present a low cost
method to predict the register pressure of a loop before applying
unroll-and-jam on high-level source code with the consideration of
the collaborative effects of scalar replacement, general scalar
optimizations, software pipelining and register allocation. We
also describe a performance model that utilizes prediction results
to determine automatically the unroll vector, from a given unroll
space, that achieves the best run-time performance. Our
experiments show that the heuristic prediction algorithm predicts
the floating point register pressure within 3 registers and the
integer register pressure within 4 registers. With this algorithm,
for the Polyhedron benchmark, our register pressure guided
unroll-and-jam improves the overall performance about 2% over the
model in the industry-leading optimizing Open64 backend for both
the x86 and x86-64 architectures.
|
"A
Practical Stride Prefetching Implementation in Global
Optimizer" [PPT] Authors:
Hucheng
Zhou, Xing Zhou, Tianwei Sheng, Dehao Chen, Jianian Yan, Shin-ming
Liu, Wenguang Chen, Weimin Zheng (Tsinghua Univ.) Abstract:
Software
data prefetching is a key technique for hiding memory latencies on
modern high performance processors. Stride memory references are
prime candidates for software prefetches on architectures with,
and without, support for hardware prefetching. Compilers typically
implement software prefetching in the context of loop nest
optimizer (LNO), which focuses on affine references in well formed
loops but miss out on opportunities in C++ STL style codes. In
this paper, we describe a new inductive data prefetching algorithm
implemented in the global optimizer. It bases the prefetching
decisions on demand driven speculative recognition of inductive
expressions, which equals to strongly connected component
detection in data flow graph, thus eliminating the need to invoke
the loop nest optimizer. This technique allows accurate
computation of stride values and exploits phase ordering. We
present an efficient implementation after SSAPRE optimization,
which further includes sophisticated prefetch scheduling and loop
transformations to reduce unnecessary prefetches, such as loop
unrolling and splitting to further reduce unnecessary prefetches.
Our experiments using SPEC2006 on IA-64 show that we have
competitive performance to the LNO based algorithms.
|
"Explore
Be-Nice Instruction Scheduling in Open64 for an Embedded SMT
Processor" [PPT] Authors:
Handong
Ye, Ge Gan, Xiaomi An, Ziang Hu, Guang R. Gao (Univ. of Delaware and SimpLight) Abstract:
A
SMT processor can fetch and issue instructions from multiple
independent hardware threads at every CPU cycle. Therefore,
hardware resources are shared among the concurrently-running
threads at a very fine grain level, which can increase the
utilization of processor pipeline. However, the
concurrently-running threads in a SMT processor may interfere with
each other and stall the CPU pipeline. We call this kind of
pipeline stall inter-thread stall (ITS for short) or thread
interlock. In this paper, we present our study on the ITS problem
on an embedded heterogeneous SMT processor. Our experiments
demonstrate that, for some test cases, 50% of the total pipeline
stalls are caused by ITS. Therefore, we have developed a new
instruction scheduling algorithm called be-nice instruction
scheduling, based on Open64 Global Code Motion, to coordinate the
conflicts between concurrent threads. The instruction scheduler
uses the thread interference information (obtained by profiling)
as heuristics to decrease the number of ITS without sacrificing
the overall CPU performance. The experimental results show that,
for our current test cases the be-nice instruction scheduler can
reduce 15% of the inter-thread stall cycles, and increase the IPC
of the critical thread by 2%-3%. The experiments are performed
using the Open64 compiler infrastructure.
|
"Structure
Layout Optimizations in the Open64 Compiler: Design,
Implementation and Measurements" [PPT] Authors:
Gautam
Chakrabarti and Fred Chow (PathScale) Abstract:
A
common performance problem faced by today's application programs
is poor data locality. Real-world applications are often written
to traverse data structures in a manner that results in data cache
miss overhead. These data structures are often declared as structs
in C and classes in C++. Compiler optimizations try to modify the
layout of such data structures so that they are accessed in a more
cache-friendly manner. These compiler transformations to optimize
data layout include structure splitting, structure peeling, and
structure field reordering. In this paper, we present the design
and implementation of the transformations of structure splitting
and structure peeling, in a commercial version of the Open64
compiler. We present details of how these transformations can be
achieved in the Open64 compiler, as well as the analyses required
to safely and usefully perform the transformation. We present some
performance results from the SPEC CPU2000 and the CPU2006 suites
of benchmarks to demonstrate the effectiveness of our
implementation. Our results show that for some applications these
layout optimizations can provide substantial performance
improvement.
|
"NVIDIA's
Experience with Open64" [PPT] Authors:
Mike
Murphy (nVidia) Abstract:
NVIDIA
uses Open64 as part of its CUDA toolchain for general purpose
computing using GPUs. Open64 was chosen for the strength of its
optimizations, and its usage has been a success, though there have
been some difficulties. This paper will give an overview of its
usage, the modifications that were made, and some thoughts about
future usage.
|
"Quantitative
approach to ISA design and compilation for code size
reduction" [PPT] Authors:
K.
M. Lo and Lin Ma (SimpLight) Abstract:
In
this paper, an efficient code size optimization instruction set
architecture targeting embedded telecommunication applications is
introduced. Nowadays, mixed 16-bit and 32-bit size instruction set
approaches are commonly used to achieve code size reduction while
minimizing performance loss. They are usually designed with some
restrictions such as reducing the number of accessible registers,
mode switching, or special hardware logic handling. The approach
starts with a common, basic RISC ISA and a re-targetable high
performance compiler. The Open64 compiler was chosen for its
machine independent optimization so that once retargeted, the
generated code will be of high performance quality. Once
retargeted, we start our ISA compression design based on
statistics collected from the code generated. By judicious
selection from actual instructions generated, a high code
compression rate is achieved without adding restrictions to the
number of registers used and hardware implementation. Furthermore,
this approach does not introduce any noticeable performance
degradation due to the mixed 32/16-bit ISA compared to the full
32-bit ISA.
|
"An
Open64-Based Framework Tool for Analyzing Parallel
Applications" [PPT] Authors:
Laksono
Adhianto and Barbara Chapman (Univ. of Houston) Abstract:
We
propose an infrastructure based on the Open64 compiler for
analyzing, modeling and optimizing MPI and/or OpenMP applications.
The framework consists of four main parts: a compiler,
microbenchmarks, a user interface and a runtime library. The
compiler generates the application signature containing a portable
representation of the application structure that may influence
program performance. Microbenchmarks are needed to capture the
system profile, including MPI latency and OpenMP overhead. The
user interface, based on Eclipse, is used to drive code
transformation, such as OpenMP code generation. And lastly, our
runtime library can be used to balance the MPI workload, thus
reducing load imbalance. In this paper we show that our framework
can analyze and model MPI and/or OpenMP applications. We also
demonstrate that it can be used for program understanding of large
scale complex applications.
|
"Development
of an Efficient DSP Compiler Based on Open64" [PPT] Authors:
Subrato
K De, Anshuman Dasgupta, Sundeep Kushwaha, Tony Linthicum, Susan
Brownhill, Sergei Larin, Taylor Simpson (Qualcomm) Abstract:
In this
paper we describe the development of an efficient compiler for
digital signal processors (DSP) based on the Open64 compiler
infrastructure. We have focused on the state of the art advanced
DSP architectures that allow high degree of instruction level
parallelism, has hardware loops, address-generation units,
DSP-specific addressing features (e.g., circular and
bit-reversed), and many specialized instructions. We discuss the
enhancements made to the Open64 compiler infrastructure to exploit
the architectural features of contemporary DSPs.
|
"An
Open64-based Compiler Approach to Performance Prediction and
Performance Sensitivity Analysis for Scientific Codes" [PPT] Authors:
Jeremy
Abramson and Pedro Diniz (USC and IST) Abstract:
The lack of
tools that provide adequate feedback at a level of abstraction
programmers can relate to makes the problem of performance
prediction and portability in today's or tomorrow's machines
extremely difficult. We describe an Open64-based compiler approach
to the problem of performance prediction and architecture
sensitivity analysis done at a source-level. Our analysis tool
extracts the computation's high-level dataflow-graph from the
code's WHIRL representation, and uses source-level data access
patterns information as well as register needs to derive
performance bounds for the program under various architectural
scenarios. The end result is a very fast performance prediction as
well as insight into where performance bottlenecks are. We have
experimented with a real code engineers and scientists use in
practice - a sparse matrix-vector multiplication kernel. The
results correlate very well with the execution of the code on a
real machine and allow programmers to understand the performance
bottlenecks without having to engage in very low-level
instrumentation analysis.
|
"Implementing
an Open64-based Tool for Improving the Performance of MPI
Programs" [PPT] Authors:
Anthony
Danalis and Lori Pollock and Martin Swany and John
Cavazos (Univ. of Delaware) Abstract: While
MPI parallel programming has become the primary approach to
achieving performance gains in cluster computing, the
communication overhead inherent in a cluster environment continues
to be a major obstacle. A promising approach to improve
performance is the use of computation-communication overlapping,
which is enabled by communication libraries that utilize Remote
Data Memory Access (RDMA), either directly in the form of
one-sided communication, or via two-sided communication over a low
overhead rendezvous protocol. To spare the scientific programmer
from learning how to utilize these libraries to effectively
maximize computation-communication overlap, we have developed a
tool that automatically transforms an MPI parallel program to a
semantically equivalent program with selected data exchange calls
in MPI replaced to leverage an RDMA-targeted communication
library. In this paper, we describe the implementation of this MPI
program transformer using the Open64 compiler.
|
"Extending
Global Optimizations in the OpenUH Compiler for OpenMP" [PPT] Authors:
Lei
Huang, Deepak Eachempati, Marcus W. Hervey, Barbara
Chapman (Univ. of Houston) Abstract: This
paper presents our design and implementation of a framework for
analyzing and optimizing OpenMP programs within the OpenUH
compiler, which is based on Open64. The paper describes the
existing analyses and optimizations in OpenUH, and explains why
the compiler may not apply classical optimizations to OpenMP
programs directly. It then presents an enhanced compiler framework
including Parallel Control Flow Graph and Concurrent SSA that
represent both intra-thread and inter-thread data flow. With this
framework, the compiler is able to perform traditional compiler
optimizations on OpenMP programs, and it further increases the
opportunities for more aggressive optimizations for OpenMP. We
describe our current implementation in the OpenUH compiler and use
a code example to demonstrate the optimizations enabled by the new
framework. This framework may lead to a significant improvement in
the performance of the translated code.
|
"Feedback-Directed
Optimizations with Estimated Edge Profiles from Hardware Event
Sampling" [PPT] Authors:
Vinodha
Ramasamy and Dehao Chen and Robert Hundt and Wenguang
Chen (Google and Tsinghua Univ.) Abstract: Traditional
feedback-directed optimization (FDO) uses static instrumentation
to collect edge profiles. Although this method has shown good
application performance gains, it is not commonly used in practice
due to the high runtime overhead of profile collection, the
tedious dual-compile usage model, and difficulties in generating
representative training data sets. In this paper, we show that
edge frequency estimates can be successfully constructed with
heuristics using profile data collected by sampling of hardware
events, incurring low runtime overhead (e.g., less then 2%), and
requiring no instrumentation, yet achieving competetive
performance gains. Our initial results show a 3-4% performance
gain on the SPEC C benchmarks.
|
The workshop provides a forum for
discussion of your findings and experiences with a broad range of
Open64 researchers and developers. It is also the main opportunity
for the participants to exchange their expectations and wishes for
future development of Open64.
Paper
Submission and Guideline (Submit
Paper Here):
Chair: Guang R. Gao (Univ. of
Delaware)
Co-Chair: Suneel Jain (HP) and Barbara Chapman (Univ.
Houston)