07-08-2012, 03:10 PM
Iterative Optimization in the Polyhedral Model: One-Dimensional Scheduling case
1Iterative Optimization.pdf (Size: 376.63 KB / Downloads: 34)
Abstract
Emerging micro-processors introduce unprecedented parallel computing
capabilities and deeper memory hierarchies, increasing the
importance of loop transformations in optimizing compilers. Because
compiler heuristics rely on simplistic performance models,
and because they are bound to a limited set of transformations sequences,
they only uncover a fraction of the peak performance on
typical benchmarks. Iterative optimization is a maturing framework
addressing these limitations, but so far, it was not successfully applied
complex loop transformation sequences because of the combinatorics
of the optimization search space.
Introduction
Feedback-directed and iterative optimizations have become essential
defenses in the fight of optimizing compilers fight to stay competitive
with hand-optimized code: they freshen the static information
flow with dynamic properties, adapting to complex architecture
behaviors, and compensating for the inaccurate single-shot of
model-based heuristics. Whether a single application (for clientside
iterative optimization) or a reference benchmark suite (for inhouse
compiler tuning) are considered,
Related Work
Iterative compilation aims at selecting the best parameterization of
the optimization chain, for a given program or for a given application
domain. It typically affects optimization flags (switches),
parameters (e.g., loop unrolling, tiling) and the phase ordering.
[6, 5, 3, 20, 1, 25].
This paper studies a different search space: instead of relying on
the compiler options to transform the program, we statically construct
a set of candidate program versions, considering the distinct
result of all legal transformations in a particular class. Our method
is more tightly coupled with the compiler transformations and is
thus complementary to other forms of iterative optimization. Furthermore,
it is completely independent from the compiler back-end.
Because iterative compilation relies on multiple, costly “runs”
(including compilation and execution), the current emphasis is on
improving the profile cost of individual program versions [20, 13],
or the total number of runs, using, e.g., genetic algorithms [19] or
machine learning [31, 1]. Our meta-heuristic is tuned to the rich
mathematical properties of the qunderlying polyhedral model of
the search space, and exploits the regularity of this model to reduce
the number of runs.
Generating a Variety of Program Versions
Program restructuring is usually broken into sequences of primitive
transformations. In the case of loops, typical primitives are the loop
fusion, loop tiling, or loop interchange [2]. This approach has severe
drawbacks. First, it is difficult to decide the completeness of a
set of directives and to understand their interactions. Many different
sequences lead to the same target code and it is typically impossible
to build an exhaustive set of candidate transformed programs in
this way. Next, each basic transformation comes with its own application
criteria such as legality check or pattern-matching rules.
For instance it is unlikely that loop fusion would be applied by a
compiler if the bounds of the original loops do not match (while
this may be the result of a former transformation in the sequence).
Finally, long sequences of transformations contribute to code size
explosion, polluting instruction cache and potentially forbidding
further compiler optimizations
Building The Search Space
In general, restructuring a program will change its semantics. When
a transformation preserves the original program semantics, we will
say that it is legal. Previous works on iterative optimization using
a polyhedral representation ensure this property by checking,
after computing a transformation, whether it is legal or not [22,
23] (non-iterative optimization algorithms use either a similar approach
[18], either consider programs simple enough that nearly
every transformation is possible [36]). This results in considering
huge search spaces, since every illegal or redundant solutions have
to be checked, and to a significant computation overhead corresponding
to each legality check (typically most of them stating that
the transformation must not be applied). Such an approach cannot
scale since the number of redundant and/or illegal transformations
grows exponentially faster than the number of different and legal
transformations with the size of the input program.
Exhaustive Space Scanning
Because our search space is only based on legal solutions, the acceptable
number of solutions for our various kernel benchmarks
makes it possible to achieve an exhaustive search in a reasonable
amount of time. Figure 4 summarizes our results. The Benchmark
column states the input original program; the Compiler column
shows the compiler used to build each program version of the
search space (GCC version was 4.1.1); the Options column precises
the compiler options; the Parameters column gives the
values of the global parameters (for instance the array sizes);
the Improved column shows the number of transformations that
achieves a better performance than the original program (the total
number of versions is shown in Figure 3); the ID best gives the
“number” of the best solution; lastly, the Speedup column gives the
speedup achieved by the best solution with respect to the original
program performance.
FutureWork
Affine multidimensional schedules It is always possible to find a
multidimensional affine schedule to a SCoP, while a one-dimensional
schedule may not exists. Unfortunately, the generalization of our
method to multidimensional schedules leads to a well known combinatorial
barrier: if there is exactly one way to choose the set of
dependences to satisfy in the one-dimensional case (they must all
be satisfied in one dimension, i.e. in one set), there is a combinatorial
way to choose the sets as soon as there is more than one
dimension. Feautrier proposed a greedy algorithm to solve the maximal
set of dependences at a given depth, and increment the depth if
unresolved dependences remain [11]. This would give us the minimal
sequential depth of the schedule [33], but the combinatorics
remains if we want to explore all the legal schedules.
Conclusion
Iterative and empirical search techniques are one of the last hopes to
let compilers harness the complexity of modern processors and hide
it from the application programmers.We focus on loop transformations
because they are both criticaly important for performance and
very hard to drive in an optimizing compiler.
Iterative loop nest optimization is intrinsicly difficult because
of the large number of distinct, legal transformations for a given
program, and it is complicated by the inability of classical loop
transformation frameworks to statically characterize this set. So
far, all attempts have relied on a separate filtering step to remove
redundant and/or illegal candidate transformation from the search
space. Our experiments show that such approaches are likely to be
impractical, even for tiny kernels of a few lines of code.