21-09-2013, 02:36 PM
Vector Processors and DSPs
Vector Processors .ppt (Size: 330.5 KB / Downloads: 74)
Review
Speculation: Out-of-order execution, In-order commit (reorder buffer+rename registers)=>precise exceptions
Branch Prediction
Branch History Table: 2 bits for loop accuracy
Recently executed branches correlated with next branch?
Branch Target Buffer: include branch address & prediction
Predicated Execution can reduce number of branches, number of mispredicted branches
Software Pipelining
Symbolic loop unrolling (instructions from different iterations) to optimize pipeline with little code expansion, little overhead
Superscalar and VLIW(“EPIC”): CPI < 1 (IPC > 1)
Dynamic issue vs. Static issue
More instructions issue at same time => larger hazard penalty
# independent instructions = # functional units X latency
Review: Instructon Level Parallelism
High speed execution based on instruction level parallelism (ilp): potential of short instruction sequences to execute in parallel
High-speed microprocessors exploit ILP by:
1) pipelined execution: overlap instructions
2) superscalar execution: issue and execute multiple instructions per clock cycle
3) Out-of-order execution (commit in-order)
Memory accesses for high-speed microprocessor?
Data Cache, possibly multiported, multiple levels
Properties of Vector Processors
Each result independent of previous result
long pipeline, compiler ensures no dependencies
high clock rate
Vector instructions access memory with known pattern
highly interleaved memory
amortize memory latency of over 64 elements
no (data) caches required! (Do use instruction cache)
Reduces branches and branch problems in pipelines
Single vector instruction implies lots of work ( loop)
fewer instruction fetches
Styles of Vector Architectures
memory-memory vector processors: all vector operations are memory to memory
vector-register processors: all vector operations between vector registers (except load and store)
Vector equivalent of load-store architectures
Includes all vector machines since late 1980s:
Cray, Convex, Fujitsu, Hitachi, NEC
We assume vector-register for rest of lectures
Vector Surprise
Use vectors for inner loop parallelism (no surprise)
One dimension of array: A[0, 0], A[0, 1], A[0, 2], ...
think of machine as, say, 32 vector regs each with 64 elements
1 instruction updates 64 elements of 1 vector register
and for outer loop parallelism!
1 element from each column: A[0,0], A[1,0], A[2,0], ...
think of machine as 64 “virtual processors” (VPs)
each with 32 scalar registers! ( multithreaded processor)
1 instruction updates 1 scalar register in 64 VPs
Hardware identical, just 2 compiler perspectives
Exception handling: Page Faults
Option 2: expand memory pipeline to check addresses before send to memory + memory buffer between address check and registers
multiple queues to transfer from memory buffer to registers; check last address in queues before load 1st element from buffer.
Pre Address Iinstruction Queue (PAIQ) which sends to TLB and memory while in parallel go to Address Check Instruction Queue (ACIQ)
When passes checks, instruction goes to Committed Instruction Queue (CIQ) to be there when data returns.
On page fault, only save instructions in PAIQ and ACIQ
Summary
Vector is alternative model for exploiting ILP
If code is vectorizable, then simpler hardware, more energy efficient, and better real-time model than Out-of-order machines
Design issues include number of lanes, number of functional units, number of vector registers, length of vector registers, exception handling, conditional operations
Will multimedia popularity revive vector architectures?