25-08-2017, 09:32 PM
An Algorithmic and Architectural Study on Montgomery Exponentiation in RNS
1An Algorithmic and Architectural.pdf (Size: 2.21 MB / Downloads: 35)
Abstract
The modular exponentiation on large numbers is computationally intensive. An effective way for performing this operation
consists in using Montgomery exponentiation in the Residue Number System (RNS). This paper presents an algorithmic and
architectural study of such exponentiation approach. From the algorithmic point of view, new and state-of-the-art opportunities that
come from the reorganization of operations and precomputations are considered. From the architectural perspective, the design
opportunities offered by well-known computer arithmetic techniques are studied, with the aim of developing an efficient arithmetic cell
architecture. Furthermore, since the use of efficient RNS bases with a low Hamming weight are being considered with ever more
interest, four additional cell architectures specifically tailored to these bases are developed and the tradeoff between benefits and
drawbacks is carefully explored. An overall comparison among all the considered algorithmic approaches and cell architectures is
presented, with the aim of providing the reader with an extensive overview of the Montgomery exponentiation opportunities in RNS.
INTRODUCTION
THE hardware implementation of modular exponentiation
for very large integers plays an important role in
various fields, being security among the most notable ones.
In recent years, many research efforts specifically focused
on the hardware implementation of Montgomery exponentiation
(ME) in the residue number system (RNS). In RNS,
very long integer multiplications and additions can be split
in independent short integer operations. However, other
operations such as division and modular reduction may be
difficult to execute [1]. ME, which is based on Montgomery
multiplication (MM) [2], can be performed by avoiding the
standard modular reduction approach.
ALGORITHMIC STUDY
In this section, a suitable reorganization of the operations
and of precomputations capable of reducing the number of
multiplications in RNS ME algorithms is described. The RNS
ME corresponds to a classic square and multiply algorithm,
where RNS MMs are iterated in a loop. The RNSMMfollows
the normal Montgomery multiplication approach, but it is
adapted to RNS. The main difference is the presence of two
BEs, which are required by RNS MM in order to execute the
modular reduction. The BE corresponds to an algorithm that
calculates a value represented on an RNS base on a different
base. Since the RNS ME uses the RNS MM that, in turn,
requires the BE, after a brief background on RNS the three
algorithms will be discussed separately by following a topdown
presentation approach.
RNS Montgomery Multiplication
In the state-of-the-art algorithms [4], [7], RNS MM is
performed on two RNS bases, A and B. Since B is used as
the Montgomery constant, B1
A must be precomputed on the
base A, where B1
A is the multiplicative inverse on A of B.
In Fig. 2, the state-of-the-art and reorganized RNS MM
algorithms are presented (and, as before, [4] and [7] are
treated together).Bis used as theMontgomery constant; thus,
by executing the multiplication of Step 3 in B, the modular
reduction byBis immediate. Since a division byBis required,
which must be executed in a base composed by elements
relatively prime to B, the BE to A of Step 4 is performed. The
multiplication in Step 5 and the addition in Step 6 are only
executed onA, since the result of Step 6 onB wouldbe equal to
0. The multiplication by B1 in Step 7, which corresponds to
the division, is only performed onA, as previously described.
Afinal BE to B is required, in order to reach a valid result that
could be passed in input to other MMs.
BE Approach by Bajard et al.
The MM algorithm proposed by Bajard et al. in [7] (and
applied in the context of RNS ME in [5]) uses two BE
algorithms; Algorithm 9 in Fig. 5 shows the first one, which
does not perform the BE correction in order to save time.
Algorithm 11 in Fig. 6, originally proposed by Shenoy and
Kumaresan [8], calculates the correct result. The error in the
result of the first BE does not affect the final result of the
MM, but larger bases are required, i.e., A;B > Nðk þ 2Þ2.
Algorithm 10 in Fig. 5 corresponds to the first reorganized
BE based on the approach by Bajard et al. (BBE1) and
includes the operations of Algorithm 9 and the operations
of Algorithm 3, Steps 3, 5, 6, and 7.
Number of Modular Multiplications
In [4] and [5], the metric used to evaluate the performance of
the proposed approach was based on the number of modular
multiplications needed (as shown in Table 3). The reorganized
RNS MM algorithms requires 3k modular multiplications
less than the state-of-the-art algorithms with the
same BE technique. The reorganized RNS ME requires
2k additional modular multiplications. Considering an RSA
scenario with an exponent of size g ¼ 1;024 and k ¼ 33, the
maximum number of modular multiplications would be
reduced from 4,938,450 to 4,735,566 (95.8 percent). Considering
a smaller base where g ¼ 640 and k ¼ 21, the maximum
number of modular multiplications would be reduced from
1,319,178 to 1,238,454 (93.9 percent). The percentage reduction
is slight better with a smaller base size, since the relative
weight of 3k (i.e., the reduction) over k2 (i.e., main contribution
to the total cost) is greater.
ARCHITECTURAL STUDY
In [6], Nozaki et al. proposed a cell architecture suitable to
the algorithm presented in [4]. The cell is matched to a single
element per RNS base; hence, a set of k cells is required to
execute the exponentiation. A cell can be matched to more
than one element per base, in order to reduce the area
(though increasing the delay). The cell is composed by a
three-stage pipelined Modular Multiplier and Accumulator
Unit (MMAU), a Cox unit (that evaluates and corrects the
BE error) and some memory elements. The MMAU used by
Nozaki et al. is also suitable for the approach proposed by
Bajard et al.; nevertheless, in this case, an additional
redundant cell matched to the redundant base element is
used, instead of the Cox unit.
Proposed Four-Stage MMAU Architecture
An efficient cell can be obtained by introducing redundancy,
increasing the level of pipeline, and moving the accumulation
from the MAU to the FMRU. By using a carry-save
representation, the final adder of the MAU and of the FMRU
can be removed. Based on theoretical observations, the
SMRU can be split in two units, thus augmenting the degree
of pipelining. The area and the delay of the four-stage
architecture can be reduced by moving the addition
operation in the FMRU, thus limiting the number of input
lines of the MAU and of the FMRU.
CONCLUSION
In this paper, an algorithmic and architectural study on
RNS ME has been presented. The opportunities that come
from the reorganization of the operations and of precomputations
have been investigated. In the architectural
study, five new cell architectures have been presented, by
exploiting well-known design techniques emerging from
computer arithmetic and by considering efficient RNS
bases. All the algorithms and the cell architectures have
been compared. Results have shown that, by rearranging
operations and preprocessing, the delay can be significantly
reduced. Moreover, the analysis carried out on the cells for
efficient bases has shown that their performance are
particularly interesting only when their constraints are
strict, hence limiting their applicability.