the candidate and reference blocks. In this paper, an field-programmable gate-array (FPGA) design is proposed for rapidly computing the minimum SAD. Two goals are achieved due to the use of online arithmetic (OLA): it is possible to implement a full 16!16 macroblock SAD in a single FPGA device; and it allows us to speed up computation by early termination of the SAD calculation when the candidate involved is bigger than the current reference SAD. Reconfigurable devices enable us to change 8!8 or 16!16 pixels per block quickly and easily. For a 16!16 SAD unit 1945 look-up tables (LUTs) are required at 425 MHz. A comparison with other related works is provided.
q2006 Elsevier B.V. All rights reserved.
Keywords:Motion estimation; FPGA; Sum of absolute differences; Online arithmetic
1. Introduction
Motion estimation (ME) plays an important role in today’s video coding and processing systems since motion vectors provide critical information for temporal redundancy reduction. It has been widely used in the H.26x, MPEG-1, -2 and -4 video compression standards. Motion estimation is defined as searching for the best motion vector, being the displacement of the coordinates of the most similar block in the previous frame compared to the block in the current frame.
Full-search block-matching is the most popular algorithm to perform ME, and it searches through every candidate location to find the best match. To do this, the current frame is partitioned into two-dimensional blocks (typically 8!8 or
16!16 pixel blocks) and a search window in the reference
frame is defined. Each block of the current frame is compared with all the blocks of a previous frame within the same window. The final motion vector corresponds to the block with minimum distortion within the search window. The most
commonly used metric to calculate the distortion is the sum of absolute differences (SAD) [1], which adds up the absolute differences between corresponding elements in the candidate and reference block.
The heavy computational cost of block matching algorithms (BMAs) can be a significant problem in real-time coding applications. To reduce computational complexity many fast algorithms have been proposed, which search a subset of candidate blocks [2,3]. Besides this, different architectures have been designed to speed up the associated massive arithmetic calculation[1,4].
However, the need for specialized hardware contradicts the flexibility demanded by current video coding systems. A feasible solution to this problem is to use a programmable processor core along with a field-programmable gate-array (FPGA) device which is in charge of performing critical tasks. The reasons for using FPGAs include the following advan-tages: increased flexibility and rapid adaptation to new developments; appropriate performance; and faster design times achieved by re-using IP cores and high-level design languages (such as VHDL). In this context, our design is intended to speed up computation of the minimum SAD by its implementation in an FPGA (SAD processorinFig. 1), while a data dispatcher supplies the reference and candidate blocks to the FPGA device (Fig. 1). An FPGA architecture to compute the minimum SAD is proposed in this paper. This design can be integrated with any BMA (full search or another efficient search strategy).
0141-9331/$ - see front matterq2006 Elsevier B.V. All rights reserved.
doi:10.1016/j.micpro.2005.12.006
* Corresponding author. Address: Dept. Arquitectura de Computadores, Univ. de Malaga, Malaga, Spain. Tel.:C34 952132787; fax:C34 952132790. E-mail addresses: [email protected] (J. Olivares), [email protected] (J. Hormigo), [email protected] (J. Villalba), [email protected] (I. Bena-vides), [email protected] (E.L. Zapata).
Despite the parallelism inherent to SAD, full parallel implementation has proved difficult since it requires a large number of operands for typical block sizes (an 8!8 pixel
block requires 128 8-bit operands, and a 16!16 pixel
macroblock needs 512 8-bit operands). Due to the large amount of hardware, the computation of the SAD on only one row of a macroblock (16!1) is implemented on an FPGA
device in Ref. [1], who propose replicating or pipelining the design to obtain the 16!16 computation. Four FPGA chips
with 1234 I/O pins each are used in Ref.[5]for a completely parallel design. On the other hand, the use of online arithmetic (OLA) for motion estimation is proposed in Ref.[6] to speed up the computation by early termination of the SAD calculation. A serial architecture (pixel by pixel) for 4!4
blocks is proposed in Ref.[6]based on ASIC implementation. This paper is organized as follows: in Section 2 a brief description of the OLA techniques is provided; in Section 3 we deal with the computation of the minimum SAD using OLA; Section 4 presents the implementation of the proposed design in FPGA devices; the results of several simulations are shown in Section 5 to illustrate the clock cycles saved with early termination; a comparison with other works is described in Section 6; and finally, the most relevant results of this paper are summarized in Section 7.
2. Online arithmetic
Online arithmetic techniques have been considered as the solution to many signal processing problems, such as digital filtering, Fourier transform, and others[7–10]. Recent works have presented the suitability of OLA for FPGAs designs[11].
The basic idea of OLA is to perform computations, which overlap with the digit-by-digit communications of operands/-results [7]. OLA algorithms operate in a digit-serial manner, beginning with the most significant digit (MSD). To generate the first digit of the result,dC1 digits of the input operands are
needed. Thus, afterd digits of the operands are received, for each new digit of the operands, a new digit of the result is
obtained. For this reason,dis known as online delay. Due to the online delay, after the last digits of the inputs are introduced into the system, a number of zero digits equal to the online delay have to be introduced to ensure a correct result.
The most-significant-digit-first mode of computation requires flexibility in computing digits on the basis of partial information about inputs. This is achieved by using a redundant representation system. In a redundant representation with radix
r, each digit has more than r possible values. This permits several representations of a given value. Therefore, there is flexibility in choosing an output digit at a given step, so that a compensation can be introduced if needed.
A signed-digit (SD) representation system [12] is used in this paper. In radix-2 SD representation, the digit set is {K1, 0, 1}. Two bits are required to represent each digit, as
shown inTable 1. The first bit is negatively weighted and the second one is positively weighted. This number represen-tation system eliminates the long carry propagation chains in the addition operation, although it requires the carry of the two previous digits.
In short, the advantages of using online arithmetic are as follows: it reduces the number of signal lines connecting modules due to its serial-digit character; the MSD-first computation allows subsequent calculations to occur at a much earlier stage; and it eliminates carry propagation chains, since it uses a redundant number representation system.
3. Online computation of the minimum SAD
The goal of our FPGA design is to find which of the candidate blocks (supplied by the dispatcher) best matches the
Fig. 1. Motion estimation system.
Table 1
Digit codification in radix-2 signed-digit representation
Digit value Digit representation
C1 01
0 00
0 11
to ensure that SADc is greater than SADr.
3.1. Online SAD computation
The SAD adds up the absolute differences between corresponding elements in the candidate and reference block
SADZX
N
iZ1
XN
jZ1
jci;jKri;jj; (1)
whereri,j are the elements of the reference block, andci,jthe
elements of the candidate block. Thus, the computation of the SAD is divided into three steps:
– Compute the differences between corresponding elements
di,jZci,jKri,j
– Determine the absolute value of each differencejdi,jj
– Add all absolute values
We now describe how each of these operations is performed using online arithmetic, and how the pixel values are converted into radix-2 SD representation.
3.1.1. Conversion to SD representation and difference computation
In radix-2 signed-digit representation, each digit is composed of two bits, the first one negatively weighted and the second positively weighted. Thus, a signed-digit number can be interpreted as the difference between two unsigned numbers, one composed of positively weighted bits for each digit, minus the one composed of negatively weighted bits. In fact, this difference must be computed to convert an SD number into a non-redundant representation.
This property is used to simultaneously convert each pixel value into SD representation and compute the difference between the pixels of the reference block and the current block at no computational cost. In this way, each digit of the value
di,jZci,jKri,j is obtained in SD representation by only taking
the corresponding bit ofci,jas the positively weighted one and
the corresponding bit of ri,j as the negatively weighted one,
sinceci,jandri,jare unsigned numbers.
no online delay.
3.1.3. Sum of absolute differences
The absolute difference of all the pixels corresponding to the current and reference blocks is computed in parallel. Thus,
N2absolute difference blocks are required. An online adder tree is used to obtain the sum of alldi,j values. InFig. 2this
structure is shown for 4!4 pixels per block (NZ4). Each
OLA-adder in this figure corresponds to a standard SD online adder (Fig. 3).
The number of addition steps of the complete adder tree is log2(N2). In radix-2 signed-digit representation, the online
delay of the addition is two, i.e. the MSD of the result is obtained two cycles after the MSD of the inputs has been sent to the adder. Nevertheless in our case, the carry bit is used as the MSD of the results and this digit is obtained one cycle before. Therefore, the online delay of the complete adder tree is 2 log2(N
2
), but the first digit of the results is obtained log2(N 2
) cycles earlier.
3.2. Signed-digit online comparison
Once the first digit of the SAD corresponding to the current block is obtained, the comparison between the current SAD and the minimum SAD can begin. Thanks to the fact that the MSD-first mode of computation is used, an efficient comparison algorithm can be applied. Nevertheless, since SD representation allows several representations for a given value, the comparison operation between two values is not as simple as in conventional representations.
In Ref. [6,13] a comparison algorithm and its hardware implementation are proposed. The two SD numbers are first converted to sign-magnitude format and then a standard comparison is used. The magnitude computation and compari-son are performed on-the-fly in an MSD-first manner. Nevertheless, this comparator has an online delay of two.
We propose a comparison algorithm with no online delay. This is based on the analysis of the sign of the difference operation between the two values to be compared. Thus, the online delay of two is avoided due to the substraction operation.
Let us define the SD numbersAandB, where
AZ
XnK1
iZ0
ai!2i; ai2fK1;0;1g (2)
andBhave a similar expression. The result of operationAKB
is
AKBZ XnK1
iZ0
ðaiKbiÞ!2i; ai;bi2fK1;0;1g (3)
LetRbe the result of the difference
RZAKBZ XnK1
iZ0
ri!2i; ri2fK2;K1;0;1;2g (4)
Let us assume that when using an online comparator, the sign ofRcan be determined at digitk, if the partial accumulated sum
Rkcomplies with
jRkjZ X
nK1
iZk ri!2i
R2
kC1
(5)
Given the previous definition ofRk,Rcan be redefined as
RZAKBZRkC XkK1
iZ0
ri!2i (6)
Fig. 2. Online design for the sum of the absolute differences.
jR jZ iZk
ri!2
R2 (10)
The valueRkcan be computed using an online recurrence (note
thatkranges fromNK1 to 0)
Rk
Z2!RkC1CðakKbkÞ (11)
The valueRkonly depends on its previous value and the current digits, thus an online comparator, as well as minimum or maximum algorithms, can be implemented with no online delay based on this computation.
An online comparator requires the valueRkto be computed
in each iteration, starting atkZnK1 (MSD), untiljRkjR2 or
kZ0. At this point, the decision is determined based on the sign
ofRk.
In Ref.[14]we evaluate different hardware designs for the comparator. Faster implementation is accomplished if the design is implemented as a state machine following the state-flow diagram represented in Fig. 4. Each state represents a possible value of Rk, i.e. equal, possibly greater, greater,
possibly less or less. The transitions between states are determined by the digits ak and bk. The design used in this
paper is a simplification of this state machine.
4. FPGA implementation of the SAD processor
Fig. 5presents the architecture of the design corresponding to the SAD processor. The absolute value of the differences is computed for each pair of pixels (jci,jKri,jj) and their
summation is calculated on the N2-Operand OLA adder. The result is stored digit-by-digit in a SADc register and is
absolute value blocks and the comparator have no online delay, eight zeroes are required in this case.
The worst case occurs when a new minimum SAD is found, and then 21 cycles are required for the full process, where the last cycle is run to store SADc in SADr. However, asFig. 6
shows, a new SAD computation can start after 16 cycles (after the eight digits and eight zeroes are introduced) in which case this period of time is the maximum between two consecutive SAD computations. This period is reduced if the candidate SAD is rejected before. In the best case, this happens after analysing the MSD of the candidate SAD, i.e. after nine cycles. Therefore, the number of cycles for a SAD computation and comparison is between 9 and 16 for a 4!4 SAD processor.
This period ranges from 13 to 20 cycles for an 8!8 block, and
from 17 to 24 cycles for a 16!16 block.
The design has been implemented on the Xilinx SPARTAN-II and VIRTEX-SPARTAN-II FPGA families for three different block sizes. For compilation, simulation and implementation, we use the Xilinx ISE Series 5.2i. The main results of the implementation are shown in Table 2. The area/number of pixels ratio is relatively low, due to the serial-digit character of online computation. The maximum clock frequency is independent of block size because when the number of operators increases, only the number of parallel operations and the number of steps in the adder-tree increase. Although this value strongly depends on the technology used (as shown inTable 2), our results are very promising.
Table 3shows how the area and delay are distributed among the different parts of the design for the 16!16 SAD processor.
Note that the percentage given refers to the total number of
look-up tables (LUTs) of the SAD processor. The maximum clock frequency of the global system is determined either by the delay of the comparator or the adder (although both values are similar), depending on the FPGA family since the basic cells are slightly different. The area is mainly occupied by the absolute value blocks and the adder-tree due to the large amount of operands for this block size.
The general performance of these implementations is shown in Table 4, where the number of SADs per second and the number of frames per second (fps) are given for a 640!
480 pixels per frame image.
5. Early termination of SAD calculation
Several video sequences have been processed to estimate the number of clock cycles saved. The parameters used are:
– 16!16 block size.
– 24!24 search window.
– Full-search block matching algorithm.
– 150 frames of each video have been evaluated.
The traditional model shown in Fig. 5 uses a final comparator for the SAD comparison. A new model is proposed (as shown in Fig. 7), which introduces several comparison levels into the adder tree to evaluate partial SAD information. It is possible that partial SADs of 64 pixels or 128 pixels of a 16!16 block are greater than the reference SAD; if so, the
SAD calculation can be stopped before running the entire number of cycles, which cannot be done with the traditional model. Fig. 7 shows the new model for partial comparison. This property is demonstrated in the present section. The added cost for the new model is the area occupied of six new comparators. Nevertheless, each comparator only requires 6 LUTs and involves less than 2% of the final area.
Fig. 8 shows the results obtained for three versions of the implemented algorithm: one with only one final comparator for ‘256 PIXELS PROCESSED LEVEL’ called C256P; one with a final comparator plus two comparators for ‘128 PIXELS PROCESSED LEVEL’ called C128P; and one with a final comparator plus two comparators for ‘128 PIXELS PRO-CESSED LEVEL’ and four comparators for ‘64 PIXELS PROCESSED LEVEL’ called C64P.
Fig. 5. SAD processor architecture.
The videos tested were:
– hall_monitor.mpeg – flower.mpeg – tennis.mpeg – coast_guard.mpeg
The number of clock cycles saved for the C64P model ranges from 4.5 to 13%, in contrast to the conventional C256P model with only one comparator, which saves between 3.3 and 4.53% clock cycles. Introducing partial comparators allows us to improve the efficiency of the system.
6. Comparison with other works
In this section, we compare our design to other recent works, the main ones being Refs.[1,5,6,13].
The use of online arithmetic to compute the minimum SAD was proposed in Ref.[6,13]for ASIC implementation. An SD-adder was used for the computation of the differences, whereas our approach does not use such hardware since we merge this computation and the SD conversion, saving both time and area. Note that since a difference computation is required for each pixel, the amount of hardware saved is considerable.
SD numbers are first converted to sign-magnitude format, and then a standard comparison is used. The magnitude compu-tation and comparison are performed on-the-fly in an MSD-first manner. Nevertheless, this comparator has an online delay of two and relatively high complexity. The main advantage of our design is that no online delay is required for the comparison operation, thus speeding up computation. Furthermore, our design is based on a simpler method involving less hardware cost.
The authors do not provide enough data regarding their ASIC implementation to enable us to perform a quantitative comparison in terms of area and delay. According to Refs.
[6,13], the cycle time corresponds to one SD-adder plus one 2-to-1 MUX, one AND and one three-input OR gate. The cycle time of our design is only one SD-adder. Despite the fact that our design is intended for an FPGA implementation, we estimate that an ASIC implementation of our design will significantly improve the performance of the design[6,13].
In Ref. [1], the computation of the SAD for 16 pixels (SAD16), which is equivalent to a macroblock row for MPEG, is implemented on an FPGA device. The design is based on carry-save adders, which perform the computation in parallel over all the digits of the data. According to the authors, the design is synthesized using FPGA Express from Synopsys by targeting the FLEX20KE family from Altera, obtaining an area of 1699 LUTs, and a maximum frequency of 197 MHz, with a latency of 19 cycles (96 ns). The estimated bandwidth for this design is 50.4 Gbps and the estimated throughput is 197 million SADs per second. The results of our implementation using the VIRTEX-II family is used for comparison, since it provides similar performance. The worst case for our equivalent design (4!4 or 16 pixels) occurs when a new
SPARTAN-II
VIRTEX-II LUTs %
Absolute difference
3.675 1.839 1024 52.7
Adder-tree 4.325 2.353 768 39.5
Comparator 4.887 2.048 6 0.3
Control and connectivity
– – 146 7.5
Table 4
Number of SAD calculations and frames per second
Block size Window size SPARTAN-II VIRTEX-II
SAD (millions per second)
fps SAD (millions per
second)
fps
4!4 8!8 14.45 77.08 26.56 141.66
8!8 16!16 11.56 30.50 21.25 56.06
minimum SAD is found, and then 21 cycles are required to complete the full process (Section 4); that is, to compute SAD16, compare the result with the previous minimum and store it; this lasts 49 ns at a frequency of 425 MHz. The bandwidth of our design is 27.2 Gbps, which is less than in Ref. [1] since data are serially transmitted. As shown in
Table 4, the throughput is 26.56 million SADs per second, which is about seven times less than in Ref.[1]. Besides this,
our design only requires 241 LUTs, which is seven times less area than in Ref. [1]. However, the current compression standard systems require 16!16 blocks (and also 8!8 for
MPEG-4).
The authors of Ref.[1]state briefly how to extend the design to compute a 16!16 SAD in two ways. The first one is based
on using 16 SAD16 units (one for each row) and a final adder tree. They estimate that 27 clock cycles are required.
Fig. 7. New comparators for partial SAD comparison.
pins. This design uses 7765 LCs and requires 29 cycles for a SAD computation at 380 MHz. This means that our design obtains better performance regarding time while requiring far less hardware.
We would like to emphasize that the previous comparisons refer to our worst case (16 cycles for 4!4 SAD and 24 cycles
for 16!16 SAD). However, the best case means that after
analysing the MSD of the candidate SAD we then reject it; this involves only nine cycles for 4!4 SAD and 17 cycles for 16!
16 SAD (Section 6). Moreover, the TIMING results used for our design include the comparison OPERATION (which involves a few more clock cycles due to carry propagation) whereas the designs referred in Refs.[1,5]do not include this operation time.
7. Conclusion
An FPGA implementation of a motion estimation core based on the computation of the minimum SAD has been presented in this paper. The proposed core can be integrated with a full-search algorithm or any more efficient full-search strategy. The computation is carried out by using online arithmetic. The different operations involved in the SAD computation have been efficiently adapted to online arithmetic, and a new comparator design with no online delay has been proposed. This allows us to implement the design on a single FPGA device. The proposed core can speed up the computation by early termination of the SAD calculation when the candidate involved is bigger than the current SAD reference. Furthermore, the FPGA implementation of the design makes it possible to reconfigure the hardware to deal with 8!8 and 16!16 pixel blocks, according to the
MPEG-4 standard requirements.
algorithms using systolic arrays, IEEE Trans. Circuits Syst. Video Technol. 6 (1996) 67–73.
[5] S. Wong, B. Stougie, S. Cotofana, Alternatives in FPGA-based SAD implementations, Proceedings of the IEEE International Conference on Field-Programmable Technology, 2002, pp. 449–452.
[6] C. Su, C. Jen, Motion estimation using MSD-first processing, IEE Proc. Circuits Devices Syst. 150 (2) (2003) 124–133.
[7] M. Ercegovac, T. Lang, On-line arithmetic for DSP applications, 32nd Midwest Symposium on Circuits and Systems, 1989, pp. 365–368. [8] M.D. Ercegovac, T. Lang, On-line arithmetic: a design methodology and
applications in digital signal processing, in: VLSI Signal Processing III, 1988, pp. 252–263 (Reprinted in E.E. Swartzlander, Computer Arithmetic, vol. 2, IEEE Computer Society Press Tutorial, Los Alamitos, CA, 1990).
[9] D. Lau, A. Schneider, M.D. Ercegovac, J. Villasenor, FPGA-based structures for on-line FFT and DCT, Proceedings of the Seventh IEEE Symposium Field-Programmable Custom Computing Machines, 1999, pp. 310–311.
[10] S. Rajagopal, J. Cavallaro, On-line arithmetic for detection in digital communication receivers, 15th IEEE Symposium on Computer Arith-metic, 2001, pp. 257–265.
[11] R. McIlhenny, M.D. Ercegovac, On the design of an on-line FFT network for FPGA’s, 33rd Asilomar Conference on Signals, Systems, and Computers, vol. 2, 1999, pp. 1484–1488.
[12] A. Avizienis, Signed digit number representation for fast parallel arithmetic, IRE Trans. Electron. Comput. EC-10 (1961) 389–400. [13] C. Su, C. Jen, Motion estimation using on-line arithmetic, IEEE
International Symposium on Circuits and Systems (ISCAS-2000), May 28–31, 2000, pp. 683–686.
[14] J. Hormigo, J. Olivares, J. Villalba, I. Benavides, New on-line comparator with no on-line delay, Eighth World Multiconference on Systemics, Cybernetics and Informatics, 2004.
[15] J. Villalba, J. Hormigo, Analysis of the mistakes in the paper motion estimation using MSD-first processing, IEE Circuits Devices Syst. 150 (2) (2003), Internal Report Department of Computer Architecture, University of Ma´laga, December 2004, http://www.ac.uma.es/cgi-bin/htgrep/pub-search.cgi?isindexZVillalba