Seminar Topics & Project Ideas On Computer Science Electronics Electrical Mechanical Engineering Civil MBA Medicine Nursing Science Physics Mathematics Chemistry ppt pdf doc presentation downloads and Abstract

Full Version: A Scheduling Strategy Supporting OpenMP Task on Heterogeneous Multicore
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
[attachment=70742]



Abstract- One of the most important topics in software
industry is how to utilize the OpenMP 3.0 programming model
to improve the execution of irregular and unstructured
applications. In this paper, we present an original task
scheduling strategy- Hybrid strategy, which is suited to the
execution of OpenMP programs on Cell heterogeneous
multicore. Hybrid scheduling strategy creates tasks in
breadth-first order while executes tasks in work-first fashion
during application execution. The former is capable of
creating enough tasks, which prevents the worker threads
from idling. While the latter guarantees task dependence is
freed quickly and consequently, the overhead with respect to
task searching is significant decreased. The evaluation, with a
variety of Barcelona OpenMP Task Suite, is conducted on a
PS3 heterogeneous multicore. And the experimental results
indicate that Hybrid policy outperforms the existing work-first
and breadth-first scheduling strategies for most irregular and
unstructured benchmarks, with speedups from 1.5 to 4.6 when
6 SPEs are used.



INTRODUCTION
With the ever increasing of semconductor industry,
multicore system has become one of the research hotspots in
integrated circuit design. Homogeneous multicore could
greatly improve the application parallelism, but it fails in
exploiting the concurrency available in some complex
applications. Heterogeneous multicore, which integrates a
number of diverse processors in a single chip, is inevitable
in computer architecture [1].
The Cell Broadband Engine (Cell BE)[2] is a
representative heterogeneous multicore processor, which has
drawn substantial attention from both industry and academia.
It comprises a conventional Power Processor Element (PPE)
that controls eight synergistic processing elements (SPEs).
The PPE has two levels of hardware cache while SPEs don’t
have caches but each has a 256KB of local store (LS). The
PPE can access the main memory while SPE can only
operate directly on its LS. Such architectural characteristics
make Cell architecture compelling for some specific
scientific computing.
As the hardware complexity increases, modern
applications are getting more sophisticated. Irregular and
dynamic structures, such as recursive kernels, unbounded
loops are prevalent in high performance computing.
Therefore, many mainstream programming environments adopt tasks as high level abstraction to solve such problems,
with OpenMP 3.0[10] as a prominent example. OpenMP 3.0
specification, which adds a new directive task, has shifted
from a thread-centric to a task-centric execution model. Task
mechanism allows the programmers to explicitly identify
units of independent work, leaving the scheduling strategy
of how and when to execute the created tasks to the runtime
system. As a consequence, an appropriate and flexible
scheduling decision of OpenMP 3.0 task mechanism on Cell
architecture has become a rapidly growing area of research.
Broadly speaking, these policies are approximately
categorized into work-first scheduling strategy (WF) and
breadth-first scheduling strategy (BF).
However, there are still some deficiencies existing in the
WF and BF strategies, which will be specified elaborately in
the following related works. Considering this, we design and
implement an original task scheduling strategy - Hybrid
scheduling strategy which supports OpenMP 3.0 task model
targeting Cell heterogeneous multicore. Hybrid scheduling
strategy combines the advantages of existing BF and WF
scheduling strategies. It creates tasks in BF order while
executes tasks in WF fashion. The former could spawn
enough tasks within a short time, which prevents the
computation threads from idling and thus decreases the
number of suspended tasks. Additionally, it avoids a
situation that all tied tasks in OpenMP 3.0 standard tie to a
certain thread in WF task creation fashion. While the latter
guarantees that the tasks at the bottom in the task-tree are
executed primarily. Therefore, task dependence is freed
earlier and the overhead with respect to task lookup is
drastically decreased. Moreover, the overload balance is
achieved.
To evaluate the performance of Hybrid strategy, we
conducted the experiments on a set of Barcelona OpenMP
Task Suite (BOTS) benchmarks. The experimental results
suggest that Hybrid scheduling strategy achieves better
performance improvements than WF and BF policies for
most benchmarks, with the speedups up to 4.6 when 6 SPEs
are used.
The rest of the paper is organized as follows. The related
works are presented in section II. The design of Hybrid
scheduling strategy on Cell BE is presented in section III.
Section IV describes the execution process of Hybrid policy.
Section V shows the evaluation results. And the last section
concludes the paper.



RELATED WORKS
There has been considerable research on task scheduling
strategies. Here we concentrate our discussion of related
works on task scheduling decisions targeting OpenMP 3.0
specification due to space considerations. Specifically for
the OpenMP 3.0 task mechanism, the approaches presented
in NANOS runtime are the mainstream proposals in broad
sense, and some other policies [13-20] are derived from them.
NANOS runtime[11-12] covers two scheduling policies:
work-first scheduling strategy and breadth-first scheduling
strategy, which implement the restrictions about scheduling
of tied tasks.
Work-first scheduling strategy is also named depth-first
scheduling strategy, and it tries to follow the serial execution
path hoping that if the sequential algorithm was well
designed it will potentially benefit better data locality. WF
works as follows: whenever a task is created, the parent task
is suspended and the executing thread switches to the newly
created task. When a task is suspended, the task is placed in
a thread’s local pool. Cilk programming models[7] developed
by MIT, as well as Intel TBB[8], both adopt the similar
scheduling strategy.
WF performs well for some recursive applications in that
the task dependence is freed as soon as possible, yet it has
some deficiencies as follows:
First, the initial task creation is based on the depthfirst
manner, which is incompatible with tied
OpenMP 3.0 tasks. If this scheme were used for tied
tasks in OpenMP 3.0, all the tasks would remain tied
to a single thread, which hampers the application
parallelism.
Second, WF policy is designed for scenarios in
which working-stealing is rare, but overhead related
to work-stealing is obviously increased with the
number of threads increasing.
Breadth-first scheduling strategy is a naive scheduler in
which newly created tasks are placed into the team pool and
execution of the parent task continues. This strategy
guarantees that all tasks in the current recursion level are
generated before a thread executes tasks from the next level.
Microsoft TPL (task parallel library)[9] applies BF
scheduling strategy. BF achieves ideal performance for some
applications with the iterative structure since it creates
enough tasks in the early step. However, it has the following
shortcomings:
The application locality is affected in that the newly
created tasks are put into the global task queue
instead of immediately executed.
Extra synchronization overhead is incurred when
large scale program is involved, which hinders
application scalability.
BF policy spreads out the tasks in the task-tree too
early, which is a waste of valuable memory
resources.
Except for the strategies mentioned above, there are
some scheduling schemes specifically for the Cell
processor[21-22]. Blagojevic et al.[21] investigate the mapping
of parallel computation to the Cell BE, showing the importance of the careful management of the interaction
between the PPE and SPEs. However, the Linux’s scheduler
lacks of knowledge about the asynchronous nature of the
SPE computation, which is likely to lead to the latency of
the response to the events on the SPE side. The same team[22]
proposes an event-driven cooperative scheduling mechanism.
The kernel level implementation is an extension to the Linux
kernel with data structures and system calls which allow the
SPEs to request activation of a specific process on the PPE
side. For the user-level approach shared memory
abstractions and a work-stealing strategy is exploited.