15-12-2012, 04:00 PM
A SURVEY ON INSTRUCTION-LEVEL PARALLELISM PROCESSORS AND TECHNIQUES
1A SURVEY ON INSTRUCTION-LEVEL PARALLELISM.pdf (Size: 73.16 KB / Downloads: 25)
Abstract
From the era when sequential computer was developed have a problem of slow execution speed due to various reason. Some of them are single
operation perform in a cycle and limited functional unit. To overcome from this difficulty some of hardware and soft-ware technique purposed, instruction level
parallelism was one of them. Instruction-level parallelism (ilp) is a technique and type of processor which allow to execute multiple operations (or we can say
multiple instructions of sequential program) in parallel . During the research of various ilp enable computer architecture many technique purposed, instruction
level parallelism was one of them. The example of processor that use the success of instruction-level parallelism are VLIW and superscalar and these processors
can able to expose the parallelism using some software technique such as software pipelining and trace scheduling or region scheduling.
INTRODUCTION
Instruction-level parallelism ( ilp ) is a technique and type processor
which allow to execute multiple operations ( or we can say multiple
instruction of sequential program) in parallel. It increases the speed of
execution by executing individual machine operation such as memory
loads and stores integer addition and floating point multiplication in
parallel. The example of processors that use the success of
instruction-level parallelism are VLIW and superscalar and these
processors can able to expose the parallelism using some software
techniques such as software pipelining and trace scheduling or region
scheduling and so on. We will examine some of them briefly.
ILP Execution
Consider the execution hardware of a simplified ILP processor
consisting of four functional units and a branch unit connected to a
common register file (Table 1). Typically ILP execution hardware
allows multiple-cycle operation to be pipelined, so that in each cycle
a total of four operations can be initiated. Now consider this hardware
have 10 operations “in flight” at once , which would give it a
maximum possible speed-up of 10 over a sequential processor with
similar execution hardware.
ILP Architecture
The end result of instruction-level parallel execution is that multiple
operations are simultaneously in execution as a result of having been
issued simultaneously.
A computer architecture is a contract between the class of programs
that are written for the architecture and the set of processor
implementations of that architecture. Usually this contract is concerned
with the instruction format and the interpretation of the bits that
constitute an instruction, but in the case of ILP architectures it extends
to information embedded in the program pertaining to the available
parallelism between the instructions or operations in the program. With
this in mind, ILP architectures can be classified as follows.
Dependence architectures and dataflow processors
The objective of a dataflow processor is to execute an instruction
at the earliest possible time subject only to the availability of the
input operands and a functional unit upon which to execute the
instruction [29, 30]. To do so it use the information provided by
the program of hardware at run time.Typically, this is
accomplished by including in each instruction a list of successor
instructions.
Each time an instruction completes, it creates a copy of its result
for each of its successor instructions. As soon as all of the input
operands of an instruction are available, the hardware fetches the
instruction, which specifies the operation to be performed and the
list of successor instructions. The instruction is then executed as
soon as a functional unit of the requisite type is available.
Dataflow processors have traditionally counted on using control
parallelism alone to fully utilize the functional units. A dataflow
processor is more successful than the others at looking far down
the execution path to find abundant control parallelism.As far as
the authors are aware, there have been no commercial products
built based on the dataflow architecture, except in a limited sense
[44]. There have, however, been a number of research prototypes
built, for instance, the ones built at the University of Manchester
[31] and at MIT [45].
Global acyclic scheduling
A number of studies have established that basic blocks are quite
short-typically about 5-20 instructions on the average. So,
whereas local scheduling can generate a near-optimal schedule,
data dependences and execution latencies conspire to make the
optimal schedule, itself, rather disappointing in terms of its
speedup over the original sequential code. Further improvements
require overlapping the execution of successive basic blocks,
which is achieved by global scheduling. Early strategies for
global scheduling attempted to automate and emulate the ad hoc
techniques that hand coders practiced of first performing local
scheduling of each basic block and then attempting to move
operations from one block into an empty slot in a neighboring
block [135, 133]. The shortcoming of such an approach is that,
during local compaction, too many arbitrary decisions have
already been made which failed to take into account the needs of
and opportunities in the neighboring blocks. Many of these
decisions might need to be undone before the global schedule can
be improved. In one very important way, the mindset inherited
from microprogramming was an obstacle to progress in global
scheduling.
Register Allocation
In conventional, sequential processors, instruction scheduling is
not an issue. The program's execution time is barely affected by
the order of the instruction, only by the number of instructions.
Accordingly, the emphasis of the code generator is on generating
the minimum number of instructions and using as few registers as
possible[194-199]. However, in the context of pipelined or
multiple-issue processors, where instruction scheduling is
important, the issue of the phase-ordering between it and register
allocation has been a topic of much debate. There are advocates
both for performing register allocation before scheduling
[185,200, 192] as well as for performing it after scheduling [183,
201-203]. Each phase-ordering has its advantages and neither one
is completely satisfactory.