07-05-2012, 04:31 PM
An Optimized Hardware Architecture for the Montgomery Multiplication Algorithm
mm.ppt (Size: 681.5 KB / Downloads: 35)
One PE is in charge of the computation of one column that corresponds to the updating of S with respect to one single bit Xi.
The delay between two contiguous PEs is 2 clock cycles.
The minimum computation time in terms of clock cycle is 2•n+e given (e+1)/2 PEs are implemented to work in parallel.
Avoid the extra clock cycle delay
One singe PE is responsible to update one fixed word in S
It has two branches corresponding to two possibilities of S(i+1)0
The correct results, the carry and the S(i)w-1, is selected from two sets of possible results by S(i+1)0, both available and registered at the same moment
The overall architecture
e PEs are required to compute the e words in S respectively.
Two shift registers, one providing single bits in X and one providing the parities of S(0)0, parallel these PEs.
(n+e-1) clock cycles are required to process the Montgomery multiplication of two n-bit operands.
Conclusion
An optimized hardware architecture to implement MWR2MM algorithm is proposed
The radix-2 version of this architecture takes (n+e-1) clock cycles to process the Montgomery multiplication of two n-bit operands
Compared to original architecture by Tenca & Koç, the new approach takes half time for processing and introduces less than 10% area penalty
The same optimization technique can be applied onto the original architecture by Tenca & Koç, keeping the scalability while reducing the processing latency to half