14-07-2012, 02:47 PM
Design and Implementation of a Field Programmable CRC Circuit Architecture
crc.doc (Size: 788 KB / Downloads: 66)
1. INTRODUCTION
Cyclic redundancy check (CRC) is an error detecting code that is widely used to detect corruption in blocks of data that have been transmitted or stored. A standalone intellectual property (IP) core is ideal for accelerating CRC computation in many network and server applications. Hardware configurability that will allow unrestricted CRC sizes and polynomials to be deployed enables a wide range of network transmission, storage and security applications to be supported at a low cost. The cost of chip design continues to increase due to factors such as high mask and respin costs. Next generation system-on-chip (SoC) designs are highly expensive and therefore must be configurable to a range of applications and future proof where either product updates or protocol migration can occur. Adding flexibility through in-field hardware configurability is a key method that enables the cost of designs to be reduced.
In this paper, we derive a fully field programmable, parallel architecture for a CRC computation circuit. The objective was to explore a domain specific programmable architecture capable of supporting 5 GB/s line rates at a minimal area cost. The resulting architecture is able to support all types and sizes of CRC polynomial, for all types of protocols and data encryption. Furthermore, the circuit can handle a variable number of input octets in runtime for byte orientated variable sized protocols. An embedded self-reconfiguration controller allows any CRC function to be configured, while minimizing programming time and complexity. This paper explores the architecture and functions of the field programmable CRC computation circuit and analyses its performance when implemented using standard cell UMC 130-nm technology.
II. CYCLIC REDUNDANCY CHECK
Data integrity is imperative for many network protocols, especially data-link layer protocols. [1]Techniques using parity codes and Hamming codes can be used for data verification, but CRC is the preferred and most efficient method used for detecting bit errors produced from medium related noise. For example, Ethernet uses a 32-bit CRC polynomial for error detection. Data storage is another area where CRC error detection is becoming increasingly important. iSCSI [2] implementations that utilize the TCP/IP protocol to implement Storage Area Networks (SANs) require error detection to be deployed. These operate using multi-gigabit connection speeds and thus require CRC checks to be executed at high speed as well. In such systems, it is becoming common to offload TCP/IP operations to hardware. CRC computation faces similar computational overhead constraints in software and is thus an ideal candidate for offloading to specialized hardware. As this is an evolving area of research, new standards and protocols are likely to emerge in the short and medium term future. The ability of CRC implemented in hardware to be reconfigurable to handle new protocols will offer a key advantage in this fast developing area.
The fact that the reconfigurable CRC circuit that has been implemented can quickly switch between any polynomial gives it a key Advantage over the other circuits referenced, in terms of flexibility and ease of upgrade for new and emerging applications and standards. Advanced NPUs have to support most communications protocols, while maintaining a throughput performance of 1–10 GB/s. Such performance cannot be achieved without a highly flexible CRC offload engine capable of parallel computation. iSCSI in particular is a good illustration of an application where new standards or more suitable polynomials can emerge. None of the publications referenced here offer the ability to reconfigure hardware in the field to operate with an entirely new CRC polynomial.
A. CRC Related Background
A large number of CRC polynomials of various lengths are available to use over a range of applications. Reference [3] investigates a total of 48 polynomials, ranging in length from 3- to 16-bits; those are suitable for embedded network applications utilizing CRC error detection. The paper shows how the various polynomials have been assessed for their ability to detect error patterns in messages. It shows that for different data word lengths, different CRC polynomials can be more suitable than others. This assessment is carried out based on maximum hamming distances. Similarly [4] investigates a number of 32-bit CRC Polynomials, all suitable for network applications such as Ethernet and iSCSI.
CRC functions have been widely implemented in software using methods such as lookup tables [5] and shift and addition [6]. Further research has investigated hardware architectures that can better exploit parallelism. The fundamental work on parallel CRC computation was introduced by Pei in 1992 [7]. Braun [8] addressed the hardware mapping problem of the parallel CRC algorithm by introducing a slightly different matrix computation technique than Pei. Braun incorporated Pre- and post- CRC computation circuits to achieve a 64-bit checksum word at 450 Mbps using FPGA technology in 1996. [9] Addresses a technique that allows pipelining to increase the circuit speed, independent of the underlying technology.
Reference [10] derives a VLSI implementation of a 64-bit CRC generator circuit based on Galois field arithmetic and look-ahead blocks. With an eighth-order look-ahead function this circuit can operate at 100 MHz despite the dated 0.6-micron technology. The circuits are flexible in terms of the number of input bits processed at a time, up to 32-bits, but they are restricted to using one CRC polynomial.
Reference [11] addresses the problem of processing variable sized packets in parallel by simply duplicating circuits and multiplexing between multiple custom implementations as required, i.e., if processing 32 bits and the last cycle of data is only 8 bits wide then this implementation multiplexes the data from a 64-bit circuit to an 8-bit circuit. The research details the VLSI implementation of a CRC-64 circuit for Ethernet. A standard cell and full custom implementation are presented using 180- and 350-nm technology respectively; operating at 1.09 GHz and 625 MHz. The circuits presented are highly customized and targeted for the CRC-64 polynomial selected. Although they operate very fast, the designs are not flexible or adaptable as they are intended for a single polynomial.
[12] Describes a pipelined and parallel implementation for an FPGA based CRC function. The level of parallelism can be varied between 8- to 32-bits and claims performance results of 1 to 4 GB/s (depending on the level of parallelism selected). Any polynomial can be selected before synthesis, but not after.
[13] Describes the derivation of VHDL code with a generic construct that allows a designer to synthesize CRC circuits for any desired polynomial of length up to 64 bits. Word widths of 8, 12, 16, 32 and 64 bits have been analyzed. The research concentrates on generating code in a generic style that includes parallelism in its structure, which is based on the linear feedback shift register (LFSR) presented by Pei. While this generic description is useful in terms of design reusability, it is only configurable presynthesis, after which the hardware is fixed and the CRC function is not configurable.
[14] Uses a recursive mathematical formula to derive parallel CRC circuits that can be generated automatically. The examples Use MATLAB code to generate the VHDL code for the circuit. The polynomial and number of bits to be processed in parallel can be specified separately. The method is flexible and is likely to save both time and cost in the design phase, yet like the other circuits, this one will be fixed to a single polynomial as the circuit itself is inflexible post-synthesis.
Reference [15] is a commercially available core that operates on FPGA. Again, this uses a fixed CRC polynomial that cannot be reconfigured after deployment. The CRC-64 core is able to support 10/40 GB/s line speeds by utilizing 64-/256-bit data buses, respectively. It is the wide data buses that allow this performance to be achieved. However, using wide input buses adds complexity to the CRC calculation where the end of a word does not fully fill the input bus. If the end of a word is 16-bits wide then the CRC must be computed for 16 bits, this cannot be done using a 32-bit input configuration.
Reference [16] presents a software implementation of the iSCSI protocol that includes implementing CRC error detection, which is recognized as the key bottleneck in the system. The overall implementation operates on a 1.7 GHz Pentium M processor, which supports 3.6 GB/s.
None of the aforementioned state-of-the-art options support full in-field configuration flexibility at high speed specifically on hardware. Some allow flexibility in the design phase and others offer very high line-speed performance, however none offer high line-speed with full flexibility, such as the support of different data-path widths and CRC generator polynomials. Although the software option [16] is likely to be very flexible, it comes at the expense of a Pentium processor.
The next section outlines the derivation of a CRC circuit implementation that fulfils the outlined flexibility criteria.
III. DERIVATION AND IMPLEMENTATION OF THE FIELD PROGRAMMABLE CRC COMPUTATION CIRCUIT
CRC is a polynomial-based block coding method for detecting errors in blocks or frames of data. A set of check digits is computed for each frame scheduled for transmission over a medium that may introduce error and is appended to its end. The computed check digits are known as the frame check sequence (FCS). A CRC value is calculated as a remainder of the modulo-2 division of the original transmitted data with a specific CRC generator polynomial. For example, Ethernet uses the 32-bit polynomial value
To find the FCS, first a number of zeroes equal to the number of FCS digits to be generated are appended to the message M(x)This is equivalent to multiplying M(x) by 2n where n is the number of FCS digits. This value is then divided by the generator polynomial G(x), which contains one more digit than the FCS. The division uses modulo-2 arithmetic, where each digit is independent of its neighbor and numbers are not carried or borrowed, thus addition and subtraction are performed via an exclusive-OR (XOR) function. The remainder R(x) is appended to the end of the message before transmission. At the receiver, the message plus the FCS is divided by the same polynomial. If the remainder is zero then it can be assumed that no error has occurred.
The field programmable CRC design is based on the fundamentals established in [7], which derives the_ matrix (1) that effectively forms the logic array of the CRC calculator circuit. Due to space restrictions the full derivation is not included here, but can be found in the reference
Each value in G is multiplied (ANDed) with the corresponding value in T. The results are XORed together, producing a D matrix of 0’s or 1’s. The position of the 1’s in D determines the position of XOR gates within the logic array while is the width of the input port in bits. Enabling programmability for parallel CRC computation requires the D matrix to be configurable for all known generator polynomials G(X).Furthermore, configuration logic for the configurable XOR array is required to adjust the input and CRC sizes.
Fig. 1 shows a diagram illustrating the architecture of the fully programmable, reconfigurable CRC circuit. The circuit is composed of six main components, the programmable input and feedback multiplexers, the configurable XOR array, the array configuration circuit and the CRC
Fig. 1. Field programmable CRC architecture.
Configuration processor. The top of the diagram shows the logic associated with the CRC cell array. The input data enters the array down the columns and the outputs are formed along the rows. The current CRC value is held in a register at the array output, which is fed back and XORed with the input data of the next clock cycle as part of the CRC computation process. The outputs are then stored in the registers for the next clock cycle.
The CRC configuration parameters are passed via a microprocessor interface. The desired CRC polynomial G(X) and the input port size are stored in registers. By selecting a signal Generate Matrix, a process is initiated that computes the configuration data and configures the CRC cell array with the required data. The configuration circuitry computes and configures one row of data every clock cycle. This reduces the memory required since the calculation of each row of the D matrix is based on the result of the calculation of the previous row. The configuration data is broadcast to every column of the array, with one column enabled for configuration at a time, via one-hot encoding using a counter. When the CRC cells are fully configured, the configuration Processor is not used. The Port Size Configure and CRC Size Configure signals control a set of multiplexers that enable/disable input and CRC feedback data to cater for the size of the CRC polynomial and the size of the input port. The Port Size Configure signal is also responsible for reconfiguring the circuit to process various input word sizes, e.g., if the port size is configured as 64-bit and the last cycle of the payload data contains only 32-bits to be processed, the Port Size Configure signal; switches input bits 32 to 63 to the “0” input of each multiplexer; switches the bottom 16 multiplexers to the previous CRC data input at the left-hand side of the array and finally routes bits 32 to 63 of the previous CRC data to rows 32 to 63 of the array.
The configurable XOR array is comprised of interconnected cells, corresponding to the D matrix. Each cell can be configured as an XOR gate (a “1” in D), or as a basic input output connection (a “0” in
D). Fig. 2 shows a diagram of one of the field programmable elements that facilitate this function in the array. The data-path can be configured so that the output will XOR the two inputs, or simply output Input 1. The control-path contains a configuration register which selects the data-path function and is programmed via the Config Data input when the Config Enable input is set high.
The computation of the D matrix is an iterative process, where the computation of each row is based on the result from the previous row. Therefore, one row is computed every clock cycle and configuring the whole matrix requires 33 clock cycles. This is the minimum time possible due to the feedback required. Fig. 3 show the logic used to compute a D matrix row in one clock cycle for an example 4-bit CRC polynomial, using the T matrix data and the current D matrix row signal. The diagram shows how a single bit of the D matrix is computed on the first loop using the rows and columns shaded grey in the D and T matrices. Subsequent loops progress through the remaining rows and columns.
For a CRC-64 or higher, two or more of the data-path circuits (top of Fig. 1) are combined in a diagonal cascade with additional feedback wiring, so that column and row 0 of the second circuit become column and row 64 of the CRC-64. Similarly for circuits smaller than 64-bit, e.g., 32-bit, the XOR array is configured so that four CRC-8’s run diagonally from the top left. The XOR array and wiring is used to connect the four blocks together with the routing necessary to compute the CRC-8’s. This means 64-bits can be calculated in parallel for all CRC sizes and all CRC sizes support the same line-speed.
Both the programmable architecture and an optimized implementation of the Ethernet CRC-64 polynomial were synthesized for the Altera Stratix II FPGA, to allow the cost of full programmability to be established. The results showed that the programmable circuit operated at 117 MHz, while the Ethernet polynomial circuit operated at 233 MHz, nearly twice the frequency. The increase in area cost was approximately 700%, which is due to the introduction of feedback logic, the configurable array and the reconfiguration controller. Overall this is a significant increase in hardware, but the circuit now has significantly improved capabilities to operate an extensive range of CRC polynomials and widths instead of just one.
IV. SYNTHESIS RESULTS AND PERFORMANCE ANALYSIS
The circuit has been described in VHDL and synthesized for a 64x64 cell array using Synopsis Physical Compiler and Cadence SoC Encounter EDAtools. The post layout synthesis results generated using UMC 130-nm technology libraries are shown in Table I. Fig. 4 shows the circuit layout.
Table II compares the circuit with other designs included in Section II. The flexibility of [13]–[15] is limited to configuration options in the design phase but the circuits themselves are not reprogrammable.[12] allows flexibility by multiplexing between predefined CRC circuits, but does not allow full reconfigurability. [16] Utilizes software, which is likely to be fully reconfigurable in-field; however this comes at the rather high expense of operating on a Pentium processor. On their 1.7 GHz test-bench, the CRC computation used 40%–50% of the processor overhead, which also represents a high cost in terms of power consumption. Given that CRC computation can account for 29% of the total computational cost in storage area networks [17], the power dissipation of the field programmable CRC circuit, which is less than 6 mW, can be considered very low. Unfortunately, power figures for the other references are not available for comparison.
Although it is difficult to directly compare performance and area parameters for different technologies, some valid comparisons can be made. Comparing [14] with our initial Ethernet polynomial implementation on FPGA, the number of LUTs used are of the same order, so it can be expected that the same trade-offs that apply in our test implementations can also be directly applied here. Furthermore, to add flexibility to [14] would require taking the FPGA offline for reconfiguration, whereas the field programmable CRC circuit can be reconfigured in less than 1 _s, even on FPGA.
In terms of area, the circuit of [11] is approximately 20 times larger than the programmable CRC circuit, with a similar throughout speed.
Notes:
[11] Normally 64-bit parallelism, selects 8, 16, 24 or 32-bit if last word of data is not 64-bits.
[14] 350 nm platform example implementation: 4.38 Gb/s when utilizing 64-bits in parallel.
[15] 64-bit/256-bit parallelism.
[16] Specifically supports iSCSI at this line speed, 1.7 GHz Pentium M processor.
The difference in area cost could be expected to be around 7:1 purely on account of the different process technologies (350/130 nm). The difference in area cost is therefore comparatively insignificant, given that the programmable CRC has significantly more flexibility.
The total area of 0.15 mm