09-11-2012, 03:36 PM
Effective Search Point Reduction Algorithm and Its VLSI Design for HDTV H.264/AVC Variable Block Size Motion Estimation
Effective Search Point Reduction.pdf (Size: 6.04 MB / Downloads: 59)
Abstract
Variable block size motion estimation (VBSME) in
H.264/AVC has greatly led to achieve an optimal inter frame
encoding. However, the computation burden of the VBSME
becomes the bottleneck of the H.264/AVC encoders. The conventional
architecture in hardware realization is hard to adopt a fast
software algorithm suitable to reduce the VBSME computation
burden. Therefore, this paper presents a search point reduction
(SPR) algorithm with an efficient hardware design, able to
decrease the motion estimation time while maintaining the coding
performance of H.264. The effectiveness of the proposed method
is compared with those of existing methods with respect to chip
area, operation frequency, and throughput rate. The proposed
SPR algorithm increases the coding speed by around 90%;
with a peak signal-to-noise ratio drop of less than 0.1 dB than
that achieved by the JM reference software. The proposed SPR
algorithm can operate at 200MHz with 191 k gate count, which
supports high-definition television 1280 × 720 format.
Introduction
H.264/AVC, THE block-based matching video coding
scheme, incorporated with many advanced coding
schemes, has achieved an unprecedented video coding performance
[1], [2]. Among all the available schemes in H.264/AVC
coder, the inter prediction with the variable block size motion
estimation (VBSME) feature is the most demanding aspect
in terms of computational burden. Achieving the optimum
coding performance depends on the ability of the rate distortion
optimization (RDO) to identify the optimum motion
vector (MV) among all variable block sizes in the inter mode
decision. Therefore, the inter mode decision in the RDO approach
becomes the dominant computation in the H.264/AVC
encoder. Thus, VBSME inter mode decision is the most timeconsuming
task in the H.264/AVC encoder.
Motion Search with Early Termination
In the proposed scheme, we observed that the large block
SADs are generated by means of adding another 4 × 4
block SAD which has a minimum value and become the
candidate position. Based on this observation, we proposed an
improved ME algorithm. In the proposed algorithm, the SAD
of 4×4 blocks is first calculated and each candidate position is
recorded when its SAD is smaller than the minimum matching
error (MME), i.e., stored into a buffer. The first candidate
position of the SAD is always assigned to be the first MME.
The smaller the buffer capacity, the faster the ME will be
performed. The cost of keeping a smaller buffer capacity is
degradation of video quality. The same procedure is performed
for the other 4 × 4 blocks. The motion search for the 4 × 4
block is stopped when the buffer capacity reaches the value
initially assigned. The bufferlarge block capacity is also set when
performing the large block size which is obtained by the
combination of the sixteen 4 × 4 blocks. Such as Fig. 1, the
combination of 4×4a, 4×4b, 4×4e, and 4×4f forms a SAD of
block 8×8a. Therefore, it only needs to search the candidate
positions that are stored in the buffer for the large block ME.
Simulation Results of the SPR Algorithm
Seven video sequences were used in our simulations. These
sequences cover a wide range of motion content with three
different formats. A buffer size below 10 incurs a significant
data loss although a small memory cost is attained. Instead,
at buffer 1000, negligible degradation in BD-PSNR and BDBR
is achieved but with high memory cost. Whereas when
the buffer size is 100, PSNR and bit-rate performance do
not significantly degrade which is a good tradeoff between
performance and hardware cost. Therefore, the buffer size
of 100 was selected for the software simulation and also
for the hardware architecture design. Table III depicts the
performance with different sequences compared to JM11.0.
The (+) sign in BDBR and (−) sign in BDPSNR indicate
the coding loss. Fig. 12 presents the rate-distortion plot for
Foreman sequences with various buffer sizes compared to
JM software. Finally, the comparison of the proposed method
with other fast motion algorithms is summarized in Table IV.
Reference [14] only provides the coding results with single QP.
Compared with [15] and [16], the proposed approach need less
hardware overhead. Notably, the proposed algorithm can save
90% of ME time and yields an acceptable performance, which
is deemed worthy for hardware implementation.
Conclusion
This paper presented an efficient architecture design based
on the SRP algorithm for HDTV variable block size ME
of H.264/AVC. Additionally, the accuracy of the proposed
algorithm made it highly promising for hardware design implementation.
The hardware architecture was implemented with
the 2-D systolic array and its related supporting circuits that
were successfully simulated for the H.264/AVC. The chip can
operate at 200MHz with a gate count of 191 k, including the
memory modules. The proposed hardware IP achieved a better
throughput rate than conventional methods and increased the
efficiency of H.264/AVC hardware design.