26-09-2016, 11:17 AM
1456160093-mereport.doc (Size: 474 KB / Downloads: 5)
1.1 Introduction of VLSI
VLSI stands for "Very Large Scale Integration". This is the field which involves packing more and more logic devices into smaller and smaller areas. VLSI, circuits that would have taken board furls of space can now be put into a small space few millimeters across! VLSI circuits are everywhere your computer, your car, your brand new state-of-the-art digital camera, the cell-phones, and what have you. All this involves a lot of expertise on many fronts within the same field, which we will look at in later sections. The way normal blocks like latches and gates are implemented is different from what students have seen so far, but the behavior remains the same. All the miniaturization involves new things to consider. A lot of thought has to go into actual implementations as well as design.
Large complicated circuits running at very high frequencies have one big problem to tackle - the problem of delays in propagation of signals through gates and wire even for areas a few micrometers across! The operation speed is so large that as the delays add up, they can actually become comparable to the clock speeds.
Another effect of high operation frequencies is increased consumption of power. This has two-fold effect - devices consume batteries faster, and heat dissipation increases. Coupled with the fact that surface areas have decreased, heat poses a major threat to the Stability of the circuit itself.
Laying out the circuit components is task common to all branches of electronics. What’s so special in our case is that there are many possible ways to do this; there can be multiple layers of different materials on the same silicon, there can be different arrangements of the smaller parts for the same component and soon. The choice between the two is determined by the way we chose the layout the circuit components. Layout can also affect the fabrication of VLSI chips, making it either easy or difficult to implement the components on the silicon.
1.2 Introduction to VHDL
A digital system can be described at different levels of abstraction and from different points of view. An HDL should faithfully and accurately model and describe a circuit, whether already built or under development, from either the structural or behavioral views, at the desired level of abstraction. Because HDLs are modeled after hardware, their semantics and use are very different from those of traditional programming languages.
1.3 Limitations of traditional programming languages
There are wide varieties of computer programming languages, from Frontend to C to Java. Unfortunately, they are not adequate to model digital hardware. To understand their limitations, it is beneficial to examine the development of a language. A programming language is characterized by its syntax and semantics. The syntax comprises the grammatical rules used to write a program, and the semantics is the “meaning” associated with language constructs. When a new computer language is developed, the designers first study the characteristics of the underlying processes and then develop syntactic constructs and their associated semantics to model and express these characteristics.
Most traditional general-purpose programming languages, such as C, are modeled after a sequential process. In this process, operations are performed in sequential order, one operation at a time. Since an operation frequently depends on the result of an earlier operation, the order of execution cannot be altered at will. The sequential process model has two major benefits. At the abstract level, it helps the human thinking process to develop an algorithm step by step. At the implementation level, the sequential process resembles the operation of a basic computer model and thus allows efficient translation from an algorithm to machine instructions.
The characteristics of digital hardware, on the other hand, are very different from those of the sequential model. A typical digital system is normally built by smaller parts, with customized wiring that connects the input and output ports of these parts. When signal changes, the parts connected to the signal are activated and a set of new operations is initiated accordingly. These operations are performed concurrently, and each operation will take a specific amount of time, which represents the propagation delay of a particular part, to complete. After completion, each part updates the value of the corresponding output port. If the value is changed, the output signal will in turn activate all the connected parts and initiate another round of operations. This description shows several unique characteristics of digital systems, including the connections of parts, concurrent operations, and the concept of propagation delay and timing. The sequential model used in traditional programming languages cannot capture the characteristics of digital hardware, and there is a need for special languages (HDLs) that are designed to model digital hardware.
VHDL includes facilities for describing logical structure and function of digital systems at a number of levels of abstraction, from system level down to the gate level. It is intended, among other things, as a modeling language for specification and simulation. We can also use it for hardware synthesis if we restrict ourselves to a subset that can be automatically translated into hardware.
VHDL arose out of the United States government’s Very High Speed Integrated Circuits (VHSIC) program. In the course of this program, it became clear that there was a need for a standard language for describing the structure and function of integrated circuits (ICs). Hence the VHSIC Hardware Description Language (VHDL) was developed. It was subsequently developed further under the auspices of the Institute of Electrical and Electronic Engineers (IEEE) and adopted in the form of the IEEE Standard 1076, Standard VHDL Language Reference Manual, in 1987. This first standard version of the language is often referred to as VHDL-87.
After the initial release, various extensions were developed to facilitate various design and modeling requirements. These extensions are documented in several IEEE standards:
i. IEEE standard 1076.1-1999, VHDL Analog and Mixed Signal Extensions (VHDL-AMS): defines the extension for analog and mixed-signal modeling.
ii. IEEE standard 1076.2-1996, VHDL Mathematical Packages: defines extra mathematical functions for real and complex numbers.
iii. IEEE standard 1076.3- 1997, Synthesis Packages: defines arithmetic operations over a collection of bits.
iv. IEEE standard 1076.4-1995, VHDL Initiative towards ASK Libraries (VITAL): defines a mechanism to add detailed timing information to ASIC cells.
v. IEEE standard 1076.6-1999, VHDL Register Transfer Level (RTL) Synthesis: defines a subset that is suitable for synthesis.
vi. IEEE standard 1 164- 1993 Multivalve Logic System for VHDL Model Interoperability (std-logicJl64): defines new data types to model multivalve logic.
vii. IEEE standard 1029.1-1998, VHDL Waveform and Vector Exchange to Support Design and Test Verification (WAVES): defines how to use VHDL to exchange information in a simulation environment.
It was also used to show fast mixing for the non-backtracking random walk as a Markov chain, and to rigorously analyze the behavior of belief propagation for clustering problems on regular graphs. However, using this operator as a foundation for spectral clustering algorithms appears to be novel. We show that the resulting algorithms are optimal for networks generated by the stochastic block model, finding communities all the way down to the detectability transition.
1.4 Field-Programmable Gate Array
A field-programmable gate array (FPGA) is a semiconductor device that can be configured by the customer or designer after manufacturing—hence the name "field-programmable". To program an FPGA one must specify how they want the chip to work with a logic circuit diagram or a source code in a hardware description language (HDL). FPGAs can be used to implement any logical function that an application-specific integrated circuit (ASIC) could perform, but the ability to update the functionality after shipping offers advantages for many applications.
FPGAs contain programmable logic components called "logic blocks", and a hierarchy of reconfigurable interconnects that allow the blocks to be "wired together"—somewhat like a one-chip programmable breadboard. Logic blocks can be configured to perform complex combinational functions, or merely simple logic gates like AND and XOR. In most FPGAs, the logic blocks also include memory elements, which may be simple flip-flops or more complete blocks of memory.
For any given semiconductor process, FPGAs are usually slower than their fixed ASIC counterparts. They also draw more power, and generally achieve less functionality using a given amount of circuit complexity. But their advantages include a shorter time to market, ability to re-program in the field to fix bugs, and lower non-recurring engineering costs. Vendors can also take a middle road by developing their hardware on ordinary FPGAs, but manufacture their final version so it can no longer be modified after the design has been committed.
Field Programmable Gate Array (FPGA) devices were introduced by Xilinx in the mid-1980s. They differ from CPLDs in architecture, storage technology, number of built-in features, and cost, and are aimed at the implementation of high performance, large-size circuits.
The basic architecture of an FPGA is illustrated in figure 2. It consists of a matrix of CLBs (Configurable Logic Blocks), interconnected by an array of switch matrices.
The internal architecture of a CLB is different from that of a PLD First, instead of implementing SOP expressions with AND gates followed by OR gates (like in SPLDs), its operation is normally based on a LUT (lookup table). Moreover, in an FPGA the number of flip-flops is much more abundant than in a CPLD, thus allowing the construction of more sophisticated sequential circuits. Besides JTAG support and interface to diverse logic levels, other additional features are also included in FPGA chips, like SRAM memory, clock multiplication (PLL or DLL), PCI interface, etc. Some chips also include dedicated blocks, like multipliers, DSPs, and microprocessors.
Another fundamental difference between an FPGA and a CPLD refers to the storage of the interconnects. While CPLDs are non-volatile (that is, they make use of antifuse, EEPROM, Flash, etc.), most FPGAs use SRAM, and are therefore volatile. This approach saves space and lowers the cost of the chip because FPGAs present a very large number of programmable interconnections, but requires an external ROM. There are, however, non-volatile FPGAs (with antifuse), which might be advantageous when reprogramming is not necessary.
FPGAs can be very sophisticated. Chips manufactured with state-of-the-art0.09mmCMOS technology, with nine copper layers and over 1,000 I/O pins, are currently available. A few examples of FPGA packages are illustrated in figure A6, which shows one of the smallest FPGA packages on the left (64 pins), a medium-size package in the middle (324 pins), and a large package (1,152 pins) on the right. Several companies manufacture FPGAs, like Xilinx, Actel, Altera, and Quick Logic.
Notice that all Xilinx FPGAs use SRAM to store the interconnects, so are reprogrammable, but volatile (thus requiring external ROM). On the other hand, Actel FPGAs are non-volatile (they use antifuse), but are non-reprogrammable (except one family, which uses Flash memory). Since each approach has its own advantages and disadvantages, the actual application will dictate which chip architecture is most appropriate.
Content-addressable memory (CAM) circuits and architectures: A tutorial and survey K. Pagiamtzis and A. Sheikholeslami
We survey recent developments in the design of large-capacity content-addressable memory (CAM). A CAM is a memory that implements the lookup-table function in a single clock cycle using dedicated comparison circuitry. CAMs are especially popular in network routers for packet forwarding and packet classification, but they are also beneficial in variety of other applications that require high-speed table lookup. The main CAM-design challenge is to reduce power consumption associated with the large amount of parallel active circuitry, without sacrificing speed or memory density. In this paper, we review CAM-design techniques at the circuit level and at the architectural level. At the circuit level, we review low-power matchline sensing techniques and search line driving approaches. At the architectural level we review three methods for reducing power consumption.
[2]Nearly-optimal associative memories based on distributed constant weight codes V. Gripon and C. Berrou
A new family of sparse neural networks achieving nearly optimal performance has been recently introduced. In these networks, messages are stored as cliques in clustered graphs. In this paper, we interpret these networks using the formalism of error correcting codes. To achieve this, we introduce two original codes, the thrifty code and the clique code, that are both sub-families of binary constant weight codes. We also provide the networks with an enhanced retrieving rule that enables a property of answer correctness and that improves performance. Index Terms—associative memory, classification, constant weight codes, clique code, thrifty code, sparse neural networks, One can split the family of memories into two main branches. The first one contains indexed memories. In an indexed memory, data messages are stored at specific indexed. Thus, messages are not overlapping, and directly accessing a stored message requires to know its address. It is a convenient paradigm as far as data itself is not useful a priority. For example, a postman just needs to know your address to bring you mails, and does not care about the content of the mail nor the color of your front door. The second branch is that of associative memories. An associative memory is such that a previously learned message can be retrieved from part of its content. It is tricky to define how large is the “part” of the content that is necessary to retrieve the data. A reasonable definition is to consider this “part” to be close to the minimum amount of data required to unambiguously address a unique previously learned message. Contrary to indexed memories, it is likely that messages
overlap one another in associative memories.
This paradigm is convenient when trying to find data from other data. For example, a detective might be interested in remembering the name of that woman he questioned who owns a car of the same brand as that of the murderer .It is obviously possible to simulate one memory using the other if given unlimited computational power. Indeed, to obtain an associative memory, one can read all the stored messages in an indexed memory and compare them with the part of messages it is given as input. It then selects the one that matches the best the input.
[3]Architecture and implementation of an associative memory using sparse clustered networks H. Jarollahi, N. Onizawa, V. Gripon, and W. J. Gross,
Associative memories are alternatives to indexed memories that when implemented in hardware can benefit many applications such as data mining. The classical neural network based methodology is impractical to implement since in order to increase the size of the memory, the number of information bits stored per memory bit (efficiency) approaches zero. In addition, the length of a message to be stored and retrieved needs to be the same size as the number of nodes in the network causing the total number of messages the network is capable of storing (diversity) to be limited. Recently, a novel algorithm based on sparse clustered neural networks has been proposed that achieves nearly optimal efficiency and large diversity. In this paper, a proof-of-concept hardware implementation of these networks is presented. The limitations and possible future research areas are discussed.
[4]A low-power contentaddressable memory (CAM) using pipelined hierarchical search scheme. Pagiamtzis and A. Sheikholeslami
This paper presents two techniques to reduce power consumption in content-addressable memories (CAMs). The first technique is to pipeline the search operation by breaking the match-lines into several segments. Since most stored words fail to match in their first segments, the search operation is discontinued for subsequent segments, hence reducing power. The second technique is to broadcast small-swing search data on less capacitive global search-lines, and only amplify this signal to full swing on a shorter local search-line. As few match-line segments are active, few local search-lines will be enabled, again saving power. We have employed the proposed schemes in a 1024 x 144-bit ternary CAM in 1.8-V 0.18-μm CMOS, illustrating an overall power reduction of 60% compared to a non pipelined, nonhierarchical architecture. The ternary CAM achieves a 7-ns search cycle time at 2.89 fJ/bit/search
[5]Use of selective pre charge for low power on the match lines of content-addressable memories C. Zukowski and S.-Y. Wang
With current architectures, CAMs typically take more area, power, and sometimes delay compared o location addressed memories of the same capacity. Ifthese penalties are traded against each other, there will be many new applications for CAMs that are not feasible or practical today. Our work is aiming to combine various CAM design methods used in literature which only aim to improve a single aspect of the problem, with our further improvements in a way to address multiple problems simultaneously to meet the requirements of to day's new applications. In this report, we overview the most active methods found in our survey at a circuit level and an architectural level. By combining the current race scheme with pre computation and selective precharge we can achieve considerable power savings
EXISTING AND PROPOSED SYSTEM
3.1 EXISTING SYSTEM
A content-addressable memory (CAM) is a type of memory that can be accessed using its contents rather than an explicit address. In order to access a particular entry in such memories, a search data word is compared against previously stored entries in parallel to find a match. Each stored entry is associated with a tag that is used in the comparison process. Once a search data word is applied to the input of a CAM, the matching data word is retrieved within a single clock cycle if it exists. This prominent feature makes CAM a promising candidate for applications where frequent and fast look-up operations are required, such as in translation look-aside buffers (TLBs), network routers database accelerators, image processing, parametric curve extraction, Hough transformation, Huffman coding/decoding, virus detection Lempel–Ziv compression, and image coding.
A new family of associative memories based on sparse clustered networks (SCNs) has been recently introduced, and implemented using field-programmable gate arrays (FPGAs). Such memories make it possible to store many short messages instead of few long ones as in the conventional Hopfield with significantly lower level of computational complexity. Furthermore, a significant improvement is achieved in terms of the number of information bits stored per memory bit (efficiency). In this paper, a variation of this approach and a corresponding architecture are introduced to construct a classifier that can be trained with the association between a small portion of the input tags and the corresponding addresses of the output data. The term CAM refers to binary CAM (BCAM) throughout this paper. Originally included in preliminary results were introduced for architecture with particular parameters conditioned on uniform distribution of the input patterns. In this paper, an extended version is presented that elaborates the effect of the design’s degrees of freedom, and the effect of non-uniformity of the input patterns on energy consumption and the performance.
The architecture (SCN-CAM) of this paper consists of an SCN-based classifier coupled to a CAM-array. The CAM-array is divided into several equally sized sub-blocks, which can be activated independently. For a previously trained network and given an input tag, the classifier only uses a small portion of the tag and predicts very few sub-blocks of the CAM to be activated. Once the sub-blocks are activated, the tag is compared against the few entries in them while keeping the rest deactivated and thus lowers the dynamic energy dissipation.
Content Addressable Memories
We now take a more detailed look at CAM architecture. A small model is shown in Fig. 4. The figure shows a CAM consisting of 4 words, with each word containing 3 bits arranged horizontally (corresponding to 3 CAM cells). There is a match line corresponding to each word (ML0, ML1, etc.) feeding into match line sense amplifiers (MLSAs), and there is a differential search line pair corresponding to each bit of the search word ( SL0, SL1, SL0’, SL1’ etc.).
A CAM search operation begins with loading the search-data word into the search-data registers followed by recharging all match lines high, putting them all temporarily in the match state. Next, the search line drivers broadcast the search word onto the differential search lines, and each CAM core cell compares its stored bit against the bit on its corresponding search lines. Match lines on which all bits match remain in the recharged-high state. Match lines that have at least one bit that misses, discharge to ground. The MLSA then detects whether its match line has a matching condition or miss condition. Finally, the encoder maps the match line of the matching location to its encoded address.
A CAM cell serves two basic functions: bit storage (as in RAM) and bit comparison (unique to CAM). Fig. 6 shows a NOR-type CAM cell [Fig. 4(a)] and the NAND-type CAM cell [Fig. 4(b)]. The bit storage in both cases is an SRAM cell where cross-coupled inverters implement the bit-storage nodes D and D’. To simplify the schematic, we omit the nMOS access transistors and bit lines which are used to read and write the SRAM storage bit. Although some CAM cell implementations use lower area DRAM cells, typically, CAM cells use SRAM storage. The bit comparison, which is logically equivalent to an XOR of the stored bit and the search bit is implemented in a somewhat different fashion in the NOR and the NAND cells.
Sparse clustered networks
Spectral algorithms are classic approaches to clustering and community detection in networks. However, for sparse networks the standard versions of these algorithms are suboptimal, in some cases completely failing to detect communities even when other algorithms such as belief propagation can do so. Here we introduce a new class of spectral algorithms based on a non-backtracking walk on the directed edges of the graph. The spectrum of this operator is much better-behaved than that of the adjacency matrix or other commonly used matrices, maintaining a strong separation between the bulk eigen values and the eigen values relevant to community structure even in the sparse case. We show that our algorithm is optimal for graphs generated by the stochastic block model, detecting communities all the way down to the theoretical limit. We also show the spectrum of the non-backtracking operator for some real-world networks, illustrating its advantages over traditional spectral clustering.
Detecting communities or modules is a central task in the study of social, biological, and technological networks. Two of the most popular approaches are statistical inference, where we fix a generative model such as the stochastic block model to the network; and spectral methods, where we classify vertices according to the eigenvectors of a matrix associated with the network such as its adjacency matrix or Laplacian. Both statistical inference and spectral methods have been shown to work well in networks that are sufficiently dense, or when the graph is regular. However, for sparse networks the community detection problem is harder. Indeed, it was recently shown that there is a phase transition below which communities present in the underlying block model are impossible for any algorithm to detect. While standard spectral algorithms succeed down to this transition when the network is sufficiently dense, with an average degree growing as a function of network size, in the case where the average degree is constant these methods fail significantly above the transition. Thus there is a large regime in which statistical inference succeeds in detecting communities, but where current spectral algorithms fail.
3.2.3 Recurrent Neural Networks
The fundamental feature of a Recurrent Neural Network (RNN) is that the network contains at least one feed-back connection, so the activations can flow round in a loop. That enables the networks to do temporal processing and learn sequences, e.g., perform sequence recognition/reproduction or temporal association/prediction. Recurrent neural network architectures can have many different forms. One common type consists of a standard Multi-Layer Perceptron (MLP) plus added loops. These can exploit the powerful non-linear mapping capabilities of the MLP, and also have some form of memory. Others have more uniform structures, potentially with every neuron connected to all the others, and may also have stochastic activation functions.
For simple architectures and deterministic activation functions, learning can be achieved using similar gradient descent procedures to those leading to the back-propagation algorithm for feed-forward networks. When the activations are stochastic, simulated annealing approaches may be more appropriate. The following will look at a few of the most important types and features of recurrent networks.