27-10-2016, 12:17 PM
AUTHENTICATION OF EXPLOSIVE DETECTION ROBOT THROUGH FACE DETECTION AND CONTROLLED THROUGH ZIGBEE
1461663078-majordocument.docx (Size: 1.4 MB / Downloads: 10)
INTRODUCTION
The project is aimed at evaluating the performance of an operating system on an embedded system. Before delving into its implementation, an introduction is needed to the parts involved in the project. The whole report is centered around the field of embedded systems and the use of Linux to run applications on them. Hence an introduction to Embedded Systems and using Linux as an OS in them is provided.
1.1 Embedded Systems
An embedded system is a special purpose computer system that is designed to perform very small sets of designated activities. Embedded systems date back as early as the late 1960s where they used to control electromechanical telephone switches. The first recognizable embedded system was the Apollo Guidance Computer developed by Charles Draper and his team. Later they found their way into the military, medical sciences and the aerospace and automobile industries.
Today they are widely used to serve various purposes like:
• Network equipment such as firewall, router, switch, and so on.
• Consumer equipment such as MP3 players, cell phones, PDAs, digital cameras, camcorders, home entertainment systems and so on.
• Household appliances such as microwaves, washing machines, televisions and so on.
• Mission-critical systems such as satellites and flight control.
The key factors that differentiate an embedded system from a desktop computer:
• They are cost sensitive.
• Most embedded systems have real time constraints.
• There are multitudes of CPU architectures such as ARM, MIPS, PowerPC that are used in embedded systems. Application-specific processors are employed in embedded systems.
• Embedded Systems have and require very few resources in terms of ROM or other I/O devices as compared to a desktop computer.
1.1.1 Types of Setup
Embedded systems generally have a setup that includes a host which is generally a personal computer, and a target that actually executes all the embedded applications. The various types of host/ desktop architectures that are used in embedded systems are:
Linked Setup
In this setup, the target and the host are permanently linked together using a physical cable. This link is typically a serial cable or an Ethernet link. The main property of this setup is that no physical hardware storage device is being transferred between the target and the host. The host contains the cross-platform development environment while the target contains an appropriate boot loader, a functional kernel, and a minimal root file system.
Removable Storage Setup:
In the removable setup, there are no direct physical links between the host and the target. Instead, a storage device is written by the host, is then transferred into the target, and is used to boot the device. The host contains the cross-platform development environment. The target, however, contains only a minimal boot loader. The rest of the components are stored on a removable storage media, such as a Compact Flash IDE device, MMC Card, or any other type of removable storage device.
Standalone Setup:
The target is a self-contained development system and includes all the required software to boot, operate, and develop additional software. In essence, this setup is similar to an actual workstation, except the underlying hardware is not a conventional workstation but rather the embedded system itself. This one does not require any cross-platform development environment, since all development tools run in their native environments. Furthermore, it does not require any transfer between the target and the host, because all the required storage is local to the target.
1.1.2 Operating Systems
In an embedded system, when there is only a single task that is to be performed, then only a binary is to loaded into the target controller and is to be executed. However, when there are multiple tasks to be executed or multiple events to be handled, then there has to be a program that handles and prioritizes these events. This program is the Operating System (OS), which one is very familiar with, in desktop PCs.
Embedded Operating Systems are classified into two categories:
1. Real-time Operating Systems (RTOS)
Real Time Operating Systems are those which guarantee responses to each event within a defined amount of time. This type of operating system is mainly used by time-critical applications such as measurement and control systems. Some commonly used RTOS for embedded systems are: VxWorks, OS-9, Symbian, RTLinux.
2. Non-Real-time Operating Systems
Non-Real Time Operating Systems do not guarantee defined response times. These systems are mostly used if multiple applications are needed. Windows CE and PalmOS are examples for such embedded operating systems.
There are a wide range of motivations for choosing Linux over a traditional embedded OS.
The following are the criteria due to which Linux is preferred:
1. Quality and Reliability of Code
Quality and reliability are subjective measures of the level of confidence in the code that comprises software such as the kernel and the applications that are provided by distributions. Some properties that professional programmers expect from a “quality” code are modularity and structure, readability, extensibility and configurability. “Reliable” code should have features like predictability, error recovery and longevity. Most programmers agree that the Linux kernel and other projects used in a Linux system fit this description of quality and reliability. The reason is the open source development model, which invites many parties to contribute to projects, identify existing problems, debate possible solutions, and fix problems effectively.
2. Availability of Code
Code availability relates to the fact that the Linux source code and all build tools are available without any access restrictions. The most important Linux components, including the kernel itself, are distributed under the GNU General Public License (GPL).Access to these components’ source code is therefore compulsory (at least to those users who have purchased any system running GPL-based software, and they have the right to redistribute once they obtain the source in any case). Code availability has implications for standardization and commoditization of components, too. Since it is possible to build Linux systems based entirely upon software for which source is available, there is a lot to be gained from adopting standardized embedded software platforms.
3. Hardware Support
Broad hardware support means that Linux supports different types of hardware platforms and devices. Although a number of vendors still do not provide Linux drivers, considerable progress has been made and more is expected. Because a large number of drivers are maintained by the Linux community itself, you can confidently use hardware components without fear that the vendor may one day discontinue driver support for that product line. Linux also provides support for dozens of hardware architectures. No other OS provides this level of portability.
Hardware
Linux normally requires at least a 32-bit CPU containing a memory management unit (MMU).A sufficient amount of RAM must be available to accommodate the system. Minimal I/O capabilities are required if any development is to be carried out on the target with reasonable debugging facilities. The kernel must be able to load a root filesystem through some form of permanent storage, or access it over a network.
2. Linux Kernel
Immediately above the hardware sits the kernel, the core component of the operating system. Its purpose is to manage the hardware in a coherent manner while providing familiar high-level abstractions to user-level software. It is expected that applications using the APIs provided by a kernel will be portable among the various architectures supported by this kernel with little or no changes
. The low-level interfaces are specific to the hardware configuration on which the kernel runs and provide for the direct control of hardware resources using a hardware-independent API. Higher-level components provide the abstractions common to all UNIX systems, including processes, files, sockets, and signals. Since the low-level APIs provided by the kernel are common among different architectures, the code implementing the higher-level abstractions is almost constant, regardless of the underlying architecture. Between these two levels of abstraction, the kernel sometimes needs what could be called interpretation components to understand and interact with structured data coming from or going to certain devices. Filesystem types and networking protocols are prime examples of sources of structured data the kernel needs to understand and interact with in order to provide access to data going to and coming from these sources.
3. Applications and Libraries
Applications do not directly operate above the kernel, but rely on libraries and special system daemons to provide familiar APIs and abstract services that interact with the kernel on the application’s behalf to obtain the desired functionality. The main library used by most Linux applications is the GNU C library, glibc. For Embedded Linux systems, substitutes to this library that are much less in size than g libc are preferred.
RASPBERRY PI
2.1 Introduction
The Raspberry Pi is a series of credit card–sized single-board computers developed in the United Kingdom by the Raspberry Pi Foundation with the intent to promote the teaching of basic computer science in schools and developing countries. The original Raspberry Pi and Raspberry Pi 2 are manufactured in several board configurations through licensed manufacturing agreements with Newark element14 (Premier Farnell), RS Components and Ego man. The hardware is the same across all manufacturers.
Several generations of Raspberry Pi's have been released. The first generation (Pi 1) was released in February 2012 in basic model A and a higher specification model B. A+ and B+ models were released a year later. Raspberry Pi 2 model B was released in February 2015 and Raspberry Pi 3 model B in February 2016. These boards are priced between US$20 and US$35. A cut down compute model was released in April 2014 and a Pi Zero with smaller footprint and limited IO (GPIO) capabilities released in November 2015 for US$5.
All models feature a Broadcom system on a chip (SOC) which include an ARM compatible CPU and an on chip graphics processing unit GPU (aVideoCore IV). CPU speed range from 700 MHz to 1.2 GHz for the Pi 3 and on board memory range from 256 MB to 1 GB RAM. Secure Digital SD cards are used to store the operating system and program memory in either the SDHC or MicroSDHC sizes. Most boards have between one and four USB slots, HDMI and composite video output, and a 3.5 mm phono jack for audio. Lower level output is provided by a number of GPIO pins which support common protocols like I2C. Some models have an RJ45 Ethernet port and the Pi 3 has on board WiFi 802.11n and Bluetooth.
2.2 Hardware
The Raspberry Pi hardware has evolved through several versions that feature variations in memory capacity and peripheral-device support.
2.3 Processor
The system on a chip (SoC) used in the first generation Raspberry Pi is somewhat equivalent to the chip used in older smart phones (such as iPhone, 3G, 3GS). The Raspberry Pi is based on the Broadcom BCM2835 SoC which includes an 700 MHz ARM1176JZF-S processor, VideoCore IV graphics processing unit (GPU),and RAM. It has a Level 1 cache of 16 KB and a Level 2 cache of 128 KB. The Level 2 cache is used primarily by the GPU. The SoC is stacked underneath the RAM chip, so only its edge is visible.
The Raspberry Pi 2 uses a Broadcom BCM2836 SoC with a 900 MHz 32-bit quad-core ARM Cortex-A7 processor, with 256 KB shared L2 cache.
The Raspberry Pi 3 uses a Broadcom BCM2837 SoC with a 1.2 GHz 64-bit quad-core ARM Cortex-A53 processor, with 512 KB shared L2 cache.
2.4 Performance of first generation models
While operating at 700 MHz by default, the first generation Raspberry Pi provided a real-world performance roughly equivalent to 0.041 GFLOPS On the CPU level the performance is similar to a 300 MHz Pentium II of 1997–99. The GPU provides 1 Gpixel/s or 1.5 Gtexel/s of graphics processing or 24 GFLOPS of general purpose computing performance. The graphics capabilities of the Raspberry Pi are roughly equivalent to the level of performance of the Xbox of 2001.
The LINPACK single node compute benchmark results in a mean single precision performance of 0.065 GFLOPS and a mean double precision performance of 0.041 GFLOPS for one Raspberry Pi Model-B board. A cluster of 64 Raspberry Pi Model-B computers, labeled "Iridis-pi", achieved a LINPACK HPL suite result of 1.14 GFLOPS (n=10240) at 216 watts for c. US$4,000.
Raspberry Pi 2 is based on Broadcom BCM2836 SoC, which includes a quad-core Cortex-A7 CPU running at 900 MHz and 1 GB RAM. It is described as 4–6 times more powerful than its predecessor. The GPU is identical to the original.
2.5 Overlocking
The first generation Raspberry Pi chip operated at 700 MHz by default, and did not become hot enough to need a heat sink or special cooling unless the chip was overclocked. The second generation runs at 900 MHz by default; it also does not become hot enough to need a heatsink or special cooling, although overclocking may heat up the SoC more than usual.
Most Raspberry Pi chips could be overclocked to 800 MHz and some even higher to 1000 MHz. There are reports the second generation can be similarly overclocked, in extreme cases, even to 1500 MHz (discarding all safety features and over voltage limitations). In the Raspbian Linux distro the overclocking options on boot can be done by a software command running "sudo raspi-config" without voiding the warranty.In those cases the Pi automatically shuts the overclocking down in case the chip reaches 85 °C (185 °F), but it is possible to overrule automatic over voltage and overclocking settings (voiding the warranty). In that case, an appropriately sized heatsink is needed to keep the chip from heating up far above 85 °C.
Newer versions of the firmware contain the option to choose between five overclock ("turbo") presets that when turned on try to get the most performance out of the SoC without impairing the lifetime of the Pi. This is done by monitoring the core temperature of the chip, and the CPU load, and dynamically adjusting clock speeds and the core voltage. When the demand is low on the CPU, or it is running too hot, the performance is throttled, but if the CPU has much to do, and the chip's temperature is acceptable, performance is temporarily increased, with clock speeds of up to 1 GHz, depending on the individual board, and on which of the turbo settings is used. The seven settings are:
• none; 700 MHz ARM, 250 MHz core, 400 MHz SDRAM, 0 overvolt,
• modest; 800 MHz ARM, 250 MHz core, 400 MHz SDRAM, 0 overvolt,
• medium; 900 MHz ARM, 250 MHz core, 450 MHz SDRAM, 2 overvolt,
• high; 950 MHz ARM, 250 MHz core, 450 MHz SDRAM, 6 overvolt,
• turbo; 1000 MHz ARM, 500 MHz core, 600 MHz SDRAM, 6 overvolt,
• Pi2; 1000 MHz ARM, 500 MHz core, 500 MHz SDRAM, 2 overvolt,
• Pi3; 1100 MHz ARM, 550 MHz core, 500 MHz SDRAM, 6 overvolt. In system information CPU speed will appear as 1200 MHz. When in idle speed lowers to 600 MHz.
In the highest (turbo) preset the SDRAM clock was originally 500 MHz, but this was later changed to 600 MHz because 500 MHz sometimes causes SD card corruption. Simultaneously in high mode the core clock speed was lowered from 450 to 250 MHz, and in medium mode from 333 to 250 MHz.
The Raspberry Pi Zero runs at 1 GHz.
2.6 RAM
On the older beta model B boards, 128 MB was allocated by default to the GPU, leaving 128 MB for the CPU.On the first 256 MB release model B (and model A), three different splits were possible. The default split was 192 MB (RAM for CPU), which should be sufficient for standalone 1080p video decoding, or for simple 3D, but probably not for both together. 224 MB was for Linux only, with only a 1080p frame buffer, and was likely to fail for any video or 3D. 128 MB was for heavy 3D, possibly also with video decoding (e.g. XBMC).Comparatively the Nokia 701 uses 128 MB for the Broadcom VideoCore IV. For the new model B with 512 MB RAM initially there were new standard memory split files released( arm256_start.elf, arm384_start.elf, arm496_start.elf) for 256 MB, 384 MB and 496 MB CPU RAM (and 256 MB, 128 MB and 16 MB video RAM). But a week or so later the RPF released a new version of start.elf that could read a new entry in config.txt (gpu_mem=xx) and could dynamically assign an amount of RAM (from 16 to 256 MB in 8 MB steps) to the GPU, so the older method of memory splits became obsolete, and a single start.elf worked the same for 256 and 512 MB Raspberry Pi.
The Raspberry Pi 2 and the Raspberry Pi 3 have 1 GB of RAM. The Raspberry PI Zero has 512 MB of RAM.
2.7 Networking
Though the model A and A+ and Zero do not have an 8P8C ("RJ45") Ethernet port, they can be connected to a network using an external user-supplied USB Ethernet or Wi-Fi adapter. On the model B and B+ the Ethernet port is provided by a built-in USB Ethernet adapter. The Raspberry Pi 3 is equipped with 2.4 GHz WiFi 802.11n and Bluetooth 4.1 in addition to the 10/100 Ethernet port.
2.8 Real-time clock
The Raspberry Pi does not come with a real-time clock, which means it cannot keep track of the time of day while it is not powered on. As alternatives, a program running on the Pi can get the time from a network time server or user input at boot time. A real-time clock (such as the DS1307, which is fully binary coded) with battery backup may be added (often via the I²C interface).
ARM ARCHITECTURE: AN OVERVIEW
3.1 Introduction
ARM is a 32-bit RISC processor architecture developed by the ARM corporation. ARM processors possess a unique combination of features that makes ARM the most popular embedded architecture today. First, ARM cores are very simple compared to most other general-purpose processors, which means that they can be manufactured using a comparatively small number of transistors, leaving plenty of space on the chip for application specific macro cells. A typical ARM chip can contain several peripheral controllers, a digital signal processor, and some amount of on-chip memory, along with an ARM core. Second, both ARM ISA and pipeline design are aimed at minimising energy consumption — a critical requirement in mobile embedded systems.
Third, the ARM architecture is highly modular: the only mandatory component of an ARM processor is the integer pipeline; all other components, including caches, MMU, floating point and other co-processors are optional, which gives a lot of flexibility in building application-specific ARM-based processors. Finally, while being small and low-power, ARM processors provide high performance for embedded applications.
For example, the PXA255 XScale processor running at 400MHz provides performance comparable to Pentium 2 at 300MHz, while using fifty times less energy.
3.1.1ARM vs RISC
In most respects, ARM is a RISC architecture. Like all RISC architectures, the ARM ISA is a load-store one, that is, instructions that process data operate only on registers and are separate from instructions that access memory. All ARM instructions are 32-bit long and most of them have a regular three-operand encoding. Finally, the ARM architecture features a large register file with 16 general-purpose registers. All of the above features facilitate pipelining of the ARM architecture. However, the ARM architecture deviated from the RISC architecture in some respects to improve its performance. The ARM did not include register windows that were used by original RISC architectures to reduce complexity. The ARM architecture introduced an auto-indexing addressing mode, where the value of an index register is incremented or decremented while a load or store is in progress. ARM supports multiple register- transfer instructions that allow loading or storing up to 16 registers at once.
Thumb instruction set extension
The Thumb instruction set was introduced in the fourth version of the ARM architecture in order to achieve higher code density for embedded applications. Thumb provides a subset of the most commonly used 32-bit ARM instructions which have been compressed into 16-bit wide opcodes. On execution, these 16-bit instructions can be either decompressed to full 32- bit ARM instructions or executed directly using a dedicated Thumb decoding unit. Although Thumb code uses 40% more instructions than equivalent 32-bit ARM code, it typically requires 30% less space. Thumb code is 40% slower than ARM code; therefore Thumb is usually used only in non-performance-critical routines in order to reduce memory and power consumption of the system.
The 3-stage pipeline
It is a classical fetch-decode-execute pipeline, which, in the absence of pipeline hazards and memory accesses, completes one instruction per cycle. The first pipeline stage reads an instruction from memory and increments the value of the instruction address register, which stores the value of the next instruction to be fetched. This value is also stored in the PC register. The next stage decodes the instruction and prepares control signals required to execute it on. The third stage does all the actual work: it reads operands from the register file, performs ALU operations, reads or writes memory, if necessary, and finally writes back modified register values. In case the instruction being executed is a data processing instruction, the result generated by the ALU is written directly to the register file and the execution stage completes in one cycle.
If it is a load or store instruction, the memory address computed by the ALU is placed on the address bus and the actual memory access is performed during the second cycle of the execute stage. This pipeline remained unchanged from the first ARM processor to the ARM7TDMI core.
The 5 stage pipeline
The 3-stage pipeline has the problem of pipeline stall when a memory read or write operation is going on, and the next instruction is to be fetched. The solution to this problem was to use a separate instruction and data cache. First, to make the pipeline more balanced, ARM9TDMI moved the register read step to the decode stage, since instruction decode stage was much shorter than the execute stage. Second, the execute stage was split into 3 stages. The first stage performs arithmetic computations, the second stage performs memory accesses (this stage remains idle when executing data processing instructions) and the third stage writes the results back to the register file. This results in a much better balanced pipeline, which can run at faster clock rate, but there is one new complication — the need to forward data among pipeline stages to resolve data dependencies between stages without stalling the pipeline. The ARM10 and ARM11 came up with the 6-stage and the 8-stage pipeline.The ARM1176JZF-S processor incorporates an integer core that implements the ARM11 ARM architecture v6. It supports the ARM and Thumb™ instruction sets, Jazelle technology to enable direct execution of Java bytecodes, and a range of SIMD DSP instructions that operate on 16-bit or 8-bit data values in 32-bit registers.
3.2 ARM Bus Technology :
Embedded systems use different bus technologies. Most common PC bus technology is the Peripheral Component Interconnect (PCI) bus. Which connects devices such as video card and disk controllers to the X 86 processor buses. This type of technology is called External or off chip bus technology.Embedded devices use an on-chip bus that is internal to the chip and allows different peripheral devices to be inter connected with an ARM core.
There are two different types of devices connected to the bus
• Bus Master: A logical device capable of initiating a data transfer with another device across the same bus (ARM processor core is a bus Master).
• Bus Slave : A logical device capable only of responding to a transfer request from a bus master device ( Peripherals are bus slaves )
Generally a Bus has two architecture levels
Physical lever: Which covers electrical characteristics an bus width (16, 32, 64 bus).
Protocol level: which deals with protocol
NOTE: - ARM is primarily a design company. It seldom implements the electrical characteristics of the bus, but it routinely specifies the bus protocol
3.2.1 AMBA (Advanced Microcontroller Bus Architecture )Bus protocol :
AMBA Bus was introduced in 1996 and has been widely adopted as the On Chip bus architecture used for ARM processors.
The first AMBA buses were
• ARM System Bus ( ASB )
• ARM Peripheral Bus ( APB )
Later ARM introduced another bus design called the ARM High performance Bus (AHB)
Using AMBA
Peripheral designers can reuse the same design on multiple projects
A Peripheral can simply be bolted on the On Chip bus without having to redesign an interface for each different processor architecture.
This plug-and-play interface for hardware developers improves availability and time to market.
AHB provides higher data throughput than ASB because it is based on centralized multiplexed bus scheme rather than the ASB bidirectional bus design. This change allows the AHB bus to run at widths of 64 bits and 128 bits
ARM introduced two variations on the AHB bus
• Multi-layer AHB
• AHB-Lite
In contrast to the original AHB, which allows a single bus master to be active on the bus at any time, the Multi-layer AHB bus allows multiple active bus masters.
AHB- Lite is a subset of the AHB bus and it is limited to a single bus master. This bus was developed for designs that do not require the full features of the standard AHB bus.
AHB and Multiple-layer AHB support the same protocol for master and slave but have different interconnects. The new interconnects in Multi-layer AHB are good for systems with multiple processors. They permit operations to occur in parallel and allow for higher through
3.3 The ARM1176JZF-S processor features:
•TrustZone™ security extensions
• Provision for Intelligent Energy Management (IEM™)
• High-speed Advanced Microprocessor Bus Architecture (AMBA) Advanced Extensible Interface (AXI) level two interfaces supporting prioritized multiprocessor implementations.
• An integer core with integral EmbeddedICE-RT logic
• An eight-stage pipeline
• Branch prediction with return stack
• Low interrupt latency configuration
• Internal coprocessors CP14 and CP15
• Vector Floating-Point (VFP) coprocessor support
• External coprocessor interface
• Instruction and Data Memory Management Units (MMUs), managed using MicroTLB
structures backed by a unified Main TLB
• Instruction and data caches, including a non-blocking data cache with Hit-Under-Miss
(HUM)
virtually indexed and physically addressed caches
• 64-bit interface to both caches
• Level one Tightly-Coupled Memory (TCM) that you can use as a local RAM with DMA
• Trace support
• JTAG-based debug
ARM1176JZF-S architecture with Jazelle technology:
The ARM1176JZF-S processor has three instruction sets:
• The 32-bit ARM instruction set used in ARM state, with media instructions
• The 16-bit Thumb instruction set used in Thumb state
• The 8-bit Java byte codes used in Jazelle state.
3.3.1 Thumb instruction set
The Thumb instruction set is a subset of the most commonly used 32-bit ARM instructions. Thumb instructions are 16 bits long, and have a corresponding 32-bit ARM instruction that has the same effect on the processor model. Thumb instructions operate with the standard ARM register configuration, enabling excellent interoperability between ARM and Thumb states.
Thumb has all the advantages of a 32-bit core:
• 32-bit address space
• 32-bit registers
• 32-bit shifter and Arithmetic Logic Unit (ALU)
Thumb therefore offers a long branch range, powerful arithmetic operations, and a large address space.The availability of both 16-bit Thumb and 32-bit ARM instruction sets, gives you the flexibility to emphasize performance or code size on a subroutine level, according to the requirements of their applications. For example, you can code critical loops for applications such as fast interrupts and DSP algorithms using the full ARM instruction set, and linked with Thumb code.
Java byte codes
ARM architecture v6 with Jazelle technology executes variable length Java byte codes. Java byte codes fall into two classes:
Hardware execution
Byte codes that perform stack-based operations. Introduction
ARM DDI 0301H Copyright © 2004-2009 ARM Limited. All rights reserved. 1-7 ID012310 Non-Confidential, Unrestricted Access
Software execution
Byte codes that are too complex to execute directly in hardware are executed in software. An ARM register is used to access a table of exception handlers to handle these particular byte codes
Integer core
The ARM1176JZF-S processor is built around the ARM11 integer core. It is an implementation of the ARMv6 architecture, that runs the ARM, Thumb, and Java instruction sets. The processor contains Embedded ICE-RT™ logic and a JTAG debug interface to enable hardware debuggers to communicate with the processor.
3.3.2 Instruction set categories
The main instruction set categories are:
• Branch instructions
• Data processing instructions
• Status register transfer instructions
• Load and store instructions
• Coprocessor instructions.
• Exception-generating instructions.
Conditional execution
The processor conditionally executes nearly all ARM instructions. You can decide if the condition code flags, Negative, Zero, Carry, and Overflow, are updated according to their result.
Registers
The ARM1176JZF-S core contains:
• 33 general-purpose 32-bit registers
• 7 dedicated 32-bit registers.
Modes and exceptions:
The core provides a set of operating and exception modes, to support systems combining Complex operating systems, user applications, and real-time demands. There are eight operating Modes, six of them are exception processing modes:
•User
• Supervisor
• Fast interrupt
• Normal interrupt
• Abort
• System
• Undefined Introduction
DSP instructions:
The DSP extensions to the ARM instruction set provide:
• 16-bit data operations
• saturating arithmetic
• MAC operations.
The processor executes multiply instructions using a single-cycle 32x16 implementation. The processor can perform 32x32, 32x16, and 16x16 multiply instructions (MAC).
Data path:
The data path consists of three pipelines:
• ALU, shift and Sat pipeline
• MAC pipeline
• load or store pipeline
Memory system
The level-one memory system provides the core with:
• Separate instruction and data caches
• Separate instruction and data Tightly-Coupled Memories
• 64-bit data paths throughout the memory system
• Virtually indexed, physically tagged caches
• Memory access controls and virtual memory management
• Support for four sizes of memory page
• Two-channel DMA into TCMs
• I-fetch, D-read/write interface, compatible with multi-layer AMBA AXI
• 32-bit dedicated peripheral interface
Instruction and data caches:
The core provides separate instruction and data caches. The cache has the following features:
• Independent configuration of the instruction and data cache during synthesis to sizes between 4KB and 64KB.
• 4-way set-associative instruction and data caches. You can lock each way independently.
• Pseudo-random or round-robin replacement.
• Eight word cache line length.
• The Micro TLB entry determines whether cache lines are write-back or write-through.
• Ability to disable each cache independently, using the system control coprocessor.
• Data cache misses that are non-blocking. The processor supports up to three outstanding Data cache misses.
• Streaming of sequential data from LDM and LDRD operations, and sequential instruction fetches.
3.3.3 DMA features:
The DMA controller has the following features:
• Runs in background of CPU operations
• Enables CPU priority access to TCM during DMA
• Programmed with Virtual Addresses
• Controls DMA to either the instruction or data TCM
• Allocated by a privileged process (OS)
• Software can check and monitor DMA progress
• Interrupts on DMA event
• Ability to configure each channel to transfer data between Secure TCM and Secure external memory.
3.3.4 Memory Management Unit:
The Memory Management Unit (MMU) has a unified Translation Lookaside Buffer (TLB) for both instructions and data. The MMU includes a 4KB page mapping size to enable a smaller RAM and ROM footprint for embedded systems and operating systems such as WindowsCE that have many small mapped objects. The ARM1176JZF-S processor implements the Fast Context Switch Extension (FCSE).The MMU is responsible for protection checking, address translation, and memory attributeand some of these can be passed to an external level two memory system. The memory translations are cached in MicroTLBs for each of the instruction and data caches.
Main TLB backing the MicroTLBs.The MMU has the following features:
• matches Virtual Address, ASID, and NSTID
• Each TLB entry is marked with the NSTID
• Checks domain access permissions
• Checks memory attributes
• Translates virtual-to-physical address
• Supports four memory page sizes
• Maps accesses to cache, TCM, peripheral port, or external memory
• Hardware handles TLB misses
Paging
Four page sizes are supported:
• 16MB super sections
• 1MB sections
• 64KB large pages
• 4KB small pages.
The ARM Architecture Version 6 (ARMv6)
A microprocessor’s architecture defines the instruction set and programmer’s model for any processor that will be based on that architecture. Different processor implementations
may be built to comply with the architecture. Each processor may vary in performance and features, and be optimized to target different applications.
Future processors, based on the new ARMv6 architecture will provide developers of embedded systems with higher levels of system performance, whilst maintaining excellent power and area efficiency. The Evolution of the ARM Architecture the ARM architecture has evolved steadily to respond to the changing needs of ARM’s partners, and of the design community in general.
At each major revision of the ARM architecture, significant features have been added. Between major architecture revisions, new features have been included as variants on the architectures. The key letters appended to the core names indicate specific architecture enhancements within each implementation.
• V3 introduced 32-bit addressing, and architecture variants:
T – Thumb state: 16-bit instruction execution.
M – long multiply support (32 x 32 => 64 or 32 x 32 + 64 => 64). This feature
became standard in architecture V4 onwards.
• V4 added half word load and store.
• V5 improved ARM and Thumb inter working, count leading-zeroes (CLZ) instruction, and architecture variants:
E – enhanced DSP instructions including saturated arithmetic operations and 16- bit multiply operations
J – support for new Java state, offering hardware and optimized software acceleration of byte code execution.
In order to maintain backwards compatibility, ARMv6 also includes ARMv5 compliant memory management and exception handling. This enables the significant third-party developer community to exploit existing development effort, and supports the reuse of existing software and design experience. The introduction of a new architecture does not replace existing architectures, or make them redundant. Where the provisions of ARMv4 or ARMv5 meet market needs, new cores and derivative products will continue to be based on these architectures, whilst tracking technology and process trends. For example, the ARM7TDMI core based on the V4T architecture is still being ‘designed-in’ to many new products, where a performance level of 100MIPS or so is adequate. Processors based on the ARMv5 architecture continue in development.
3.4 Key ARMv6 Improvements
In developing the ARMv6 architecture, effort has been focused on five key areas:
Memory Management
System design and performance is heavily affected by the way that memory is managed.
The memory management architectural enhancements improve the overall processor performance significantly – especially for platform-type applications where operating systems need to manage frequent task changes. With the changes in ARMv6, average instruction fetch and data latency is greatly reduced; the processor has to spend less time waiting for instructions or data cache misses to be loaded. The memory management improvements will provide a boost in overall system performance by as much as 30%.
Multiprocessing
Application convergence is driving system implementations towards the need for multiprocessor systems. Wireless platforms, especially for 2.5G and 3G, are typical applications that demand integration between ARM processors, ARM and DSPs, or other application accelerators. Multiprocessor systems share data efficiently by sharing memory. New ARMv6 capabilities in data sharing and synchronization will make it easier to implement multiprocessor systems, as well as improving their performance. New instructions enable more complex synchronization schemes, greatly improving system efficiency.
Multimedia Support
Single Instruction Multiple Data (SIMD) capabilities enable more efficient software implementation of high-performance media applications such as audio and video encoders. Over sixty SIMD instructions are added to the ARMv6 Instruction Set Architecture (ISA). Adding the SIMD instructions will provide performance improvements of between 2x and 4x, depending on the multimedia application. The SIMD capabilities will enable developers to implement high-end features such as video codecs, speaker-independent voice recognition and 3D graphics, especially relevant for next generation wireless applications.
Data Handling
A system’s endianism refers to the way data is referenced and stored in a processor’s memory.With increasing system on a chip (SoC) integration, a single chip is more likely to contain little-endian OS environments and interfaces (such as USB, PCI), but with big-endian data (TCP/IP packets, MPEG streams). With ARMv6, support for mixed-endian systems has been improved. As a result, handling data in mixed-endian systems under ARMv6 is far more efficient. Unaligned data is data that is not aligned to its natural size boundary. For example, within DSP applications there is sometimes a requirement to treat words with half-word data alignment. For a processor to handle this situation efficiently requires that it be able to load a word aligned to any half-word boundary. Current versions of the architecture require a number of instructions to manage unaligned data. ARMv6 compliant architectures will manage unaligned data more efficiently in hardware. In algorithms that rely heavily on DSP operations with unaligned data, ARMv6 implementations will have a performance advantage and may also benefit from reduced code size. Unaligned support also makes it more efficient for ARM to emulate other processors, such as Motorola’s 68000 family.
Similar to recent ARMv5 implementations such as ARM10 and XScale, ARMv6 is based on a 32-bit processor. ARMv6 will support implementations based on bus widths of 64-bits and above - ARM10 and XScale support 64-bit buses today. This provides bus throughput equivalent to, or even better than a 64-bit machine, but without the power and area overhead of a full 64-bit CPU.