ARM (Advance RISC Machine) is one of the most licensed and thus widespread processor cores in the world.Used especially in portable devices due to low power consumption and reasonable performance.Several interesting extension available like THUMB instruction set and Jazelle Java Machine.
5. Why ARM ???
ARM is one of the most licensed and
thus widespread processor cores in the
world.
Used especially in portable devices due
to low power consumption and
reasonable performance
Several interesting extension available
like THUMB instruction set and
Jazelle Java Machine
6. Computer Architecture
Describes Users View of the Computer
Eg.
Instruction Set
Visible Registers
Memory Management Table Structure
Exception Handling Models etc.
7. Computer Organization
Describes User Invisible
Implementation of the Architecture
Eg.
Pipeline Struture
Transparent Cache
Translation Look Aside Buffers etc
8. RISC vs. CISC Architecture
RISC CISC
Fixed width Instructions Variable length instructions
Few Formats of Instructions Several formats of instructions
Load/Store Architecture Memory Values can be used as operands in
instructions
Large Register Banks Small Register Bank
Instructions are pipelinable Pipelining is complex
Single Cycle execution of all instructions Multi cycle execution on instructions
9. RISC Advantages
A small Die Size
A Shorter Development Time
Higher Performance
Smaller things have higher natural
frequencies.
11. ARM History
ARM – Acron RISC Machine(1983-1985)
Acron Computers Limited ,Cambridge,
England.
ARM – Advanced RISC Machine 1990
ARM Limited ,1990
ARM has been licensed to many
semiconductor manufacturers
12. Architecture Revisions
1998 2000 2002 2004
time
version
ARMv5
ARMv6
1994 1996 2006
V4
StrongARM® ARM926EJ-S™
XScaleTM
ARM102xE ARM1026EJ-S™
ARM9x6E
ARM92xT
ARM1136JF-S™
ARM7TDMI-S™
ARM720T™
XScale is a trademark of Intel Corporation
ARMv7
SC100™
SC200
™
ARM1176JZF-S™
ARM1156T2F-S™
13. Features used from RISC
A Load/Store Architecture
Fixed Length 32 bit Instructions
3-Address Instruction Formats
14. Load Store Architecture
Memory can be accessed only through two
dedicated instructions
LDR ; Move word from memory to register
STR ; Move word from register to memory
All other instructions have to work on
registers only
15. 3 Address Instruction Format
Function Dest. Addr. Op2 Addr. Op1 Addr.
f bits n bits n bits n bits
Example
Add d, s1, s2 ; d =s1+s2
16. Pipelining
Break instructions into steps
Work on instructions like in an assembly line
Allows for more instructions to be executed in
less time
A n-stage pipeline is n times faster than a non
pipeline processor (in theory)
18. Without Pipelining
Normally, you would peform the fetch, decode,
execute, operate, and write steps of an instruction
and then move on to the next instruction
20. With Pipelining
The processor is able to perform each stage
simultaneously.
If the processor is decoding an instruction, it may
also fetch another instruction at the same time.
22. Pipeline (cont.)
Length of pipeline depends on the longest step
Thus in RISC, all instructions were made to be the
same length
Each stage takes 1 clock cycle
In theory, an instruction should be finished each
clock cycle
23. Pipeline changes for ARM9TDMI
Instruction
Fetch
Shift + ALU Memory
Access
Reg
WriteReg
Read
Reg
Decode
FETCH DECODE EXECUTE MEMORY WRITE
ARM9TDMI
ARM or Thumb
Inst Decode
Reg Select
Reg
Read
Shift ALU Reg
Writ
e
Thumb→ Α
RM
decompress
ARM
decode
Instruction
Fetch
FETCH DECODE EXECUTE
ARM7TDMI
24. ARM10 vs. ARM11 Pipelines
ARM11
Fetch
1
Fetch
2
Decode Issue
Shift ALU Saturate
Write
back
MAC
1
MAC
2
MAC
3
Address
Data
Cache
1
Data
Cache
2
Shift + ALU
Memory
Access Reg
Write
FETCH DECODE EXECUTE MEMORY WRITE
Reg Read
Multiply
Branch
Prediction
Instruction
Fetch
ISSUE
ARM or
Thumb
Instruction
Decode Multipl
y Add
ARM10
25. ARM Design Policy
ARM core uses RISC Architecture
Reduced Instruction Set
Load Store Architecture
Large No of General Purpose Registers.
Parallel execution with Pipelines
But some differences from RISC
Enhanced instructions for
DSP instructions
THUMB State
Conditional Execution Instructions
32 bit Barrel Shifter
26. Registers
ARM has Load Store Architecture
General Purpose Registers can hold
data or address
Total of 37 Registers each of 32 bit
There are 17 or 18 active registers
16 data registers
2 status registers
27. Registers
Registers R0-R12 are General Purpose
Registers
R13 is used as Stack Pointer(SP)
R14 is used as Link Register(LR)
R15 is used as Program Counter(PC)
CPSR is Current Program Status Register
SPSR is Saved Program Status Register
R0
R1
R2
R3
R4
R5
R6
R7
R8
R9
R10
R11
R12
R13
R14
R15
28. CPSR
N Z C V J U n d e f i n e d I F T M o d e
hold information about the most recently performed ALU operation
set the processor operating mode
• Condition code flags
– N = Negative result from ALU
– Z = Zero result from ALU
– C = ALU operation Carried
out
– V = ALU operation
overflowed
• Interrupt Disable bits.
– I = 1: Disables the IRQ.
– F = 1: Disables the FIQ.
• T Bit
– Architecture xT only
– T = 0: Processor in ARM
state
– T = 1: Processor in Thumb
state
• Mode bits
– Specify the processor mode
• J bit
– Architecture 5TEJ only
– J = 1: Processor in Jazelle
state
29. Operation Modes
Mode Registers CPSR[4:0]
User User 10000
FIQ _fiq 10001
IRQ _irq 10010
Supervisor Mode _svc 10011
Abort _abt 10111
Undefined
Instruction
_und 11011
System User 11111
31. Processor Modes
Processor modes are execution modes which determines
active registers and privileges
List of Modes
Abort
Fast Interrupt
Interrupt
Supervisor
System
Undefined
User
All except user mode are privileged
User mode is for normal execution of programs and applications
Privileged modes allow full Read/Write to CPSR.
32. Processor Modes
User Unprivileged mode for most applications
to run
FIQ Fast Interrupt Routine
IRQ Interrupt Request
Supervisor Entered on reset an when there is a exception
Abort Entered when data or instruction prefetch
aborted
Undefined When an undefined instructions is executed
System Privileged user mode for operating system
34. Exceptions
Generated by internal and external events
Support 7 types of exceptions
Reset - Only in Supervisor Mode
Software Interrupt – in Supervisor Mode
IRQ – on IRQ interrupt
FIQ – on FIQ interrupt
Data Abort – in Abort Mode
Undefined Instruction – in Undefined Mode
Prefetch Abort – in Abort Mode
37. ARM Processor Families
Naming Convention
ARM[x][y][z][T][D][M][I][E][J][F][S]
X – Family
Y - Memory management /protection
Z – Cache
T - Thumb Mode
D – JTAG Debugging
M – Multiplier
I – Embedded ICE Macrocell
E – Enhanced Instruction (implies TDMI)
J – Jazelle hardware accelerated java
F – Floating point unit
S – Synthesizable Version
39. Introduction to ARM7TDMI
Version 4
Von Neumann Architecture
32 bit data bus
Data size can be byte , half word or word
Word : 4 bytes aligned
Half Word : 2 byte aligned
Supports
Thumb : 16 bit compressed instruction set
Debug: On chip debug support
Enhanced Multiply : Higher performance ,Long multiply
Embedded ICE Hardware
40. Cortex Family
ARM Cortex family comprises three series, which
all implement the Thumb2 instruction set to
address the increasing demands of various
markets:
1 ARM Cortex – A Series: application processors
for complex OS and user applications
2 ARM Cortex – R Series : embedded processors
for real time systems
3 ARM Cortex – M Series : deeply embedded
processors optimized for cost sensitive
applications, as Mobile Devices.
41. Provide hardware support for two separate address
spaces i.e. code executing in the non secure world cannot
gain access to any address space marked as secure
A new mode ‘Secure Monitor’ within the core acts as a
gatekeeper and reliably switches the system between
secure and no secure states
Protection of on and off chip memory and peripherals
from software attack
Services such as network virus protection, m-commerce
transactions and the protection of user secrets such as
keys
43. Thumb State
Subset of the ARM instructions
Higher code density (35% reduction)
Better performance than 16 bit processors
Suitable for use with 16 bit memory
devices(160 % better performance)
Transparently decompressed to 32 bit
instructions
44. ARM State
Able to access more large memories
efficiently
32 bit integer arithmetic in a single
cycle
More number of instructions
Better performance
45. Switching States
ARM to Thumb
Execute the BX instruction with state
bit=1
Thumb to ARM
Execute the BX instruction with state
bit =0
An interrupt or exception cccurs
46. Which State to Use
Low memory system : use thumb
16 bit memory : use thumb
Performance is critical :use ARM
Example : in execution of interrupt
routines
Performance is critical and Memory is low :
use both ARM and thumb
Example : In interrupt routines
47. ARM Debug Architecture
ARM
core
ETM
TAP
controller
Trace PortJTAG port
Ethernet
Debugger (+
optional
trace tools)
EmbeddedICE Logic
Provides breakpoints and processor/system
access
JTAG interface (ICE)
Converts debugger commands to JTAG signals
Embedded trace Macrocell (ETM)
Compresses real-time instruction and data access
trace
Contains ICE features (trigger & filter logic)
Trace port analyzer (TPA)
Captures trace in a deep buffer
EmbeddedICE
Logic
Versions mostly refer to the instruction set that the ARM core executes.
The ARM7, which is still the most often used core in a low-power design, executes the version 4T instruction set. Architectural extensions were added for version 5TE to include DSP instructions, such as 16-bit signed MLA instructions, saturation arithmetic, etc. The ARM926EJ-S and ARM1026EJ-S cores are examples of Version 5 architectures. Version 6 added instructions for doing byte manipulations and graphics algorithms more efficiently. The ARM11 family implemented the Version 6 architecture. Version 7 architectures (which include the Cortex family of cores, such as the Cortex A8, Cortex M3 and Cortex R4) extended the functionality by adding things such as Thumb2, low-power features, and more security.
Pipeline Comparison
The point of this foil is to show that with the ARM7TDMI a lot of work was carried out in the execute stage of the pipeline. Now with ARM9TDMI the execute stage has been split out into three stages to allow greater throughput.
This then means the CPI is about 1.5 compared against 1.9 for ARM7TDMI, and the operating frequency is approximately double for ARM9TDMI over ARM7TDMI on the same fabrication process. Therefore, at least double the processing power is available.
It is possible for the pipeline to interlock. Forwarding paths have been provided to minimise this as much as possible, but they can still occur. By using a bit of consideration when writing code they can almost be eliminated.
ARM10 - It just illustrates that another stage was added to the ARM9’s pipeline to provide additional time to handle coprocessor instruction decode and handle branch prediction. The Multiplier is now broken up over two stages, execute and memory, since the multiplier is also pipelined.
Note that the ARM9E multiplier is also pipeline (like ARM10) so the upper diagram strictly only applies to the ARM9TDMI.
ARM11 - The processor is a single issue processor, meaning that only one instruction per cycle can be issued from the issue stage to one of the 3 backend pipeline stages.
While the instructions are issued in order they may complete out of order. This will be depend on availability of data, length of execution and memory access times.
Debugger trace tools
Have copy of the code image
Configure ETM trace via JTAG
Receive compressed trace from ETM
Decompress ETM trace using code image