Seminar Topics & Project Ideas On Computer Science Electronics Electrical Mechanical Engineering Civil MBA Medicine Nursing Science Physics Mathematics Chemistry ppt pdf doc presentation downloads and Abstract

Near-Optimal Placement of MPI processes on Hierarchical
NUMA Architectures

[attachment=31078]

Introduction

The landscape of parallel computing has undergone tremendous changes since the introduction
of multicore architectures. Multicore machines feature hardware characteristics that
are a novelty, especially when compared to cluster-based architectures. Indeed, the amount
of cores available within each system is much higher and the memory hierarchy becomes
much more complex than previously. Thus, the communication performance can dramatically
change according to the processes location within the system since the closer the data
is located from the process, the faster the access shall be. This is know as the Non-Uniform
Memory Access (NUMA) effect and can be commonly experienced in modern computers.
As the core amount in a node is expected to grow sharply in the near future, all these
changes have to be taken into account in order to exploit such architectures at their full
potential. However, there is a gap between the hardware and the software. Indeed, as far
as programming is concerned, the change is less drastic since users still rely on standards
such as MPI or OpenMP. Hybrid programming (that is, mixing both message-passing and
shared memory paradigms) is one of the keys to obtain the best performance from hierarchical
multicore machines. This implies new programming practices that users should follow
and apply. However, legacy MPI applications can already take advantage of the computing
power offered by such complex architectures. One way of achieving this goal is to match
an application’s communication pattern to the underlying hardware. That is, the processes
that communicate the most would be bound on cores that share the most levels in the memory
hierarchy (e.g. caches). The idea is therefore to build a correspondence between the list
of MPI process ranks and the list of core numbers in the machine. Thus, the placement
of MPI processes relies totally on the matching that is computed by a relevant algorithm.
Then, the issue is to use an algorithm that yields a satisfactory solution to our problem. In
this paper, we introduce a new algorithm, called TreeMatch that efficiently computes a
solution to our problem by taking into account the specificities of the underlying hardware.
The rest of this paper is organized as follows: in section 2 we will describe some related
works. Section 3 exposes the problem and show how it can be modeled while section 4
describes our TreeMatch algorithm. Both theoretical and empirical results are detailed
in section 5. At last, section 6 concludes this paper.

Related work

Concerning process placement a pioneer work is provided by Kruskal and Snir in [9] where
the problem is modeled by a multicommodity flow. The MPIPP [2] framework takes into
consideration a multicluster context and strives to dispatch the various MPI processes on the
different clusters used by an application. Graph theory algorithms are widely used in order
to determine the matching (a list of (MPI process rank,core number) couples). For instance,
several vendor MPI implementations, such as the ones provided by Hewlett-Packard [3]4
or by IBM (according to [4]) make use of such mechanism. [6] also formalizes the problem
with graphs. In these cases, however, the algorithm computing the final mapping is MPI
implementation-specific and does not take into account the complexity of the hierarchy
encountered in multicore NUMA machines nor their topologies. In a previous paper [10],
we used a graph partitioning algorithm called Scotch [5] to perform the task of mapping
computation. However, Scotch is able to work on any type of graphs, not just trees as in
our case. Making some hypothesis on the graph structure can lead to improvements and
that is why we have developed a new algorithm, called TreeMatch, tailored to fit exactly
our specific needs.

Problem Modeling
Hardware Architecture

The first step for determining a relevant process placement consists of gathering information
about the underlying hardware. Retrieving the information about the memory hierarchy
in a portable way is not a trivial task. Indeed, no tool is able to provide information
about the various caches levels (such as their respective sizes and which cores can access
them) on a wide spectrum of systems. To this end, we participated in the development of
a specific software tool that fulfills this goal: Hardware Locality or Hwloc [1]. Thanks to
Hwloc, we can get all the needed information about the architecture, that is, the number of
NUMA nodes, sockets and cores as well as the information about the memory hierarchy. In
previous works [2,10] the architecture is modeled by a topology matrix. Entries in this matrix

HP-MPI has since been replaced by Platform MPI

correspond to the communication speed between two cores and take into account the number
elements of the memory hierarchy shared between cores. However, such a representation
induces a side effect as it “flattens” the view of the hardware’s structure. The result is a loss of
valuable information that could be exploited by the matching algorithm in order to compute
the best possible placement. Indeed, a NUMA machine is most of the time hierarchically
structured. A tree can thus provide a more reliable representation than a topology matrix.
Since Hwloc internally uses a tree-like data structure to store all information, we can easily
build another data structure derived from Hwloc’s. Formally, we have a tree of hardware
elements. The depth of this tree corresponds to the depth of this element in the hierarchy,
cores (or other computing elements) are leaves of the tree.

Application Communication Patterns

The second vital piece of information is the target application’s communication pattern. In
our case, this pattern consists of the global amount of data exchanged between each pair of
processes in the application and is stored in a p p communication matrix (where p is the
number of processes). This approximation yields to a “flatten” view of the communication
activities that occur during an application’s execution. In order to gather the needed data,
we chose to introduce a slight amount of profiling elements within an existing MPI implementation
(MPICH2). By modifying the low-level communication channels in the MPICH2
stack we are able to trace data exchanges in both cases of point-to-point and collective
communication5. Since this tracing is very light, it will not disturb an application’s execution.
This approach has also been implemented by other vendor MPI implementations,
such as HP-MPI (now Platform MPI) [3]. The main drawback is that a preliminary run of
the application is mandatory in order to get the communication pattern and a change in
the execution (number of processors, input data, etc.) often requires to rerun the profiling.
However, there are other possible approaches. For instance, the MPIPP framework uses a
tool called FAST [11] that combines static analysis of the application’s code and dynamic
execution of a modified application that executes much faster while retaining the same
communication pattern as the original source application. The goal is to reduce the time
necessary to determine the communication pattern by several magnitudes. It is to be noted
that in all approaches, an application, either the original one or a simpler version, has to
be executed.

seminar ideas