True 64 bit processor (contrary to AMD64 and EM64T, being 64 bit extensions to the IA-32 architecture).
EPIC architecture (Explicitly Parallel Instruction Computing): tight coupling between hardware (processor) and software (compiler).
Performance depending on compilers generating efficient code (Instruction Level Parallellism, ILP), requiring more intelligence in the compiler back-end.
Itanium Instruction Set Architecture (ISA), Itanium System Environment (ISE).
Initiated by HP in 1990 (code name PA-WideWord, PA-WW), jointly developed by Intel and HP from 1993, originally as a follow-up to their respective X86 and PA-RISC processors. Other processors that would be obsoleted by Itanium were DEC/Compaq Alpha and SGI MIPS.
Itanium has problems gaining market share since its introduction in 2001. Back then, the first items were two years delayed. Its performance was poor, in particular of the IA-32 compatibility unit, which was buggy as well. SMP configurations (Symmetric Multi-Processing) were limited to four processors. And not enough native Itanium applications were available.
This forced HP to extend its PA-RISC roadmap with an extra generation (8900 series), and forced SGI to add even another two generations to its MIPS roadmap.
Sun cancelled its Solaris port for Itanium. And with Linux emerging, IBM, SCO, and Sequent cancelled their joint effort to combine their respective AIX, UnixWare, and PTX UNIXes into a single operating system (Monterey-64).
Finally, after AMD introduced 64 bits extensions (AMD64) to its processors, Intel was forced to follow with EM64T, pushing Itanium further up and away from the server market segment dominated by Intel Xeon and AMD Opteron processors.
Nowadays, all HP engineers working on Itanium have been transferred to Intel.
Itanium is only sold in high-end server and High Performance Computing (HPC) markets.
Since IBM and Dell dropped the Itanium servers from their product portfolios in 2005, by far most of the Itanium systems are being sold by HP. That makes Itanium mostly a replacement for HPs Alpha (DEC/Compaq) and PA-RISC processors, competing against IBM Power and Sun/Fujitsu UltraSparc. Other manufacturers selling Itanium systems are Bull, Fujitsu, Hitachi, NEC (together with Unisys), and SGI.
Today, Itanium is supported by Linux (Trillian, Red Hat, and SuSE), Compaq Tru64 Unix (HP), and HP-UX of course. Currently SGI is in a transition from its traditional IRIX on MIPS systems to Linux on Itanium.
Longhorn Server will support Itanium only for specific workloads: databases, custom jobs, and line-of-business applications.
In 2005, the companies currently manufacturing or selling Itanium hardware founded the Itanium Solutions Alliance, to promote the availability and acceptance of Itanium solutions in the market.
256 application registers:
64 predicate registers (PR0 - PR63): contain predicate test (compare) results, for conditional execution of instructions,
first 16 registers static: available to all programs,
rest rotating: can be renamed to accelerate loops.
8 branch registers (BR0 - BR7).
128 application registers (AR0 - AR127): special-purpose data and control registers.
4 Privilege Levels (PL): 0-3.
Current Privilege Level (CPL) in PSR.cpl (Processor Status Register, PSR).
Bi-endian memory access: controlled by UM.be bit (User Mask, UM).
Memory mapped I/O.
Processor virtualization: enabled by PSR.vm bit, managed by PAL.
Virtual Machine Monitor (VMM): managing and virtualizing system resources, creating a virtual environment (Virtual Processor Descriptor, VPD).
IA-32 compatibility mode: IA-32 System Environment, i.e. Pentium III.
16 bit Real Mode, 16 bit VM86, 16/32 bit Protected Mode, memory segmentation.
Multimedia instruction sets: MMX, SSE.
Switch between Itanium and IA-32 instruction sets using JMPE, br.ia, and rtfi.
All interruptions handled by Itanium instruction set code.
Current execution mode in PSR.is.
PA-RISC supported through Aries emulator.
Operating system: supported through Extensible Firmware Interface (EFI).
System Abstraction Layer (SAL): firmware providing platform initialization, configuration, and test, operating system boot, run-time functionality (i.e. BIOS (Basic Input Output System), Machine Checks, and Platform Management Interruptions (PMI, successor IA-32 System Management Mode (SMM))).
Processor Abstraction Layer (PAL): firmware providing processor specific Machine Checks, initialization, PMI, power management, configuration, and error recovery.
Developer's Interface Guide for IA-64 Servers (DIG64): design guidelines for building blocks and interfaces of IA-64 systems, providing an interoperable and stable baseline hardware interface for software developers.
On-die L1 cache (Harvard architecture):
On-die, unified L2 cache:
96 kbyte,
6-way set-associative, 64 byte line size,
write back,
6 cycles minimum latency for integers, 9 cycles minimum latency for FPs,
max. 2 requests per clock (banks).
Cache coherency through MESI protocol.
L3 cache:
2 or 4 Mbyte, apart in package, connected through Front Side Bus (FSB),
4-way set-associative, 64 byte line size,
21 cycles minimum latency for integers, 24 cycles minimum latency for FPs,
bandwidth 16 bytes per core cycle (64 bit DDR, 128 bit bus to core),
12 Gbyte/s max. throughput.
Cache coherency through MESI protocol.
Translation Look-aside Buffer (TLB) and Virtual Hash Page Table (VHPT):
Hardware Page Walker (HPW): loads VHPT from L2 cache / L3 cache / memory at TLB misses.
64 entry, fully associative,
32 entry, fully associative,
10 cycles penalty at miss,
96 entry, fully associative,
page size 4 kbyte - 256 Mbyte supported.
Advanced Load Address Table (ALAT): between L1 data cache (L1D) and DTLB, keeps track of speculative data loads,
32 entry, two-way set-associative.
Double pipeline: 10 stage in-order, 6 instructions wide.
Split issue dispersal: three instructions (16 bytes) per bundle.
Scoreboarding, non-blocking caches (for compile-time non-determinism).
17 execution units:
9 issue ports:
Dynamic prefetch, branch prediction, speculative execution.
Branch prediction:
512 entry, two-level.
Branch Target Address Cache (BTAC): 64 entry.
Interval Time Counter (ITC): register for timing ticks.
In 32 bit compatibility mode: Time Stamp Counter (TSC).
Streamlined Advanced PIC (SAPIC): based on IA-32 APIC (Advanced Programmable Interrupt Controller),
for Aborts, Interrupts, Faults, and Traps:
Virtual address space: 64 bit, no segmentation.
Multiple Address Space (MAS): each process has its own unique Virtual Region (flat linear address space).
8 61 bit Virtual Regions (Virtual Region Number, VRN; Region Identifier, RID), 224 Virtual Address Spaces of 261 bits.
4 kbyte - 256 Mbyte pages (Virtual Page Number, VPN).
Physical address space: 63 bit.
Up to 50 bits supported in page tables.
Write Coalescing (WC): streams of non-cachable writes can be combined into a single bus write transaction.
WC Buffer (WCB): two-entry, 64 byte.
Enhanced Machine Check Architecture (EMCA): parity and ECC (Error-Correcting Code) on all major address and data busses.
44 bit address bus.
Physical addressing:
133 MHz DDR bus (Merced bus): 64 bit data.
Source Synchronous Signaling (SSS).
2.1 Gbyte/s max. throughput.
Assisted Gunning Transceiver Logic signaling (AGTL+),
based on GTL+ bus of Intel Pentium III and Pentium III Xeon processors.
1.5 V ± 1.5 %.
Power pod connector.
Tests:
Processor performance monitoring and profiling:
SMP (Symmetric Multi-Processing): glueless up to four processors (max. 16 in IA-32 compatibility mode).
Shared memory, cache coherency through MESI protocol.
Multiplier (Phase Lock Loop, PLL):
set through pins during reset:
| multiplier\pin | LINT[1] LINT[0] IGNNE# A20M# |
|---|---|
| 2/11 | 0000 |
| 2/12 | 0111 |
Power and performance management:
P-states:
Performance:
Thermal management: via on-die thermal diode:
System management: System Management Bus (SMBus):
Marking:
CPUID: 8 byte registers:
CPUID return values:
| 0x10 | L1D: 16 kbyte, 4-way set-associative, 32 byte line size |
| 0x15 | L1I: 16 kbyte, 4-way set-associative, 32 byte line size |
| 0x1A | L2: 96 kbyte, 6-way set-associative, 64 byte line size |
| 0x88 | L3: 2 Mbyte, 4-way set-associative, 64 byte line size |
| 0x89 | L2: 4 Mbyte, 4-way set-associative, 64 byte line size |
| 0x8A | L2: 8 Mbyte, 4-way set-associative, 64 byte line size |
| 0x90 | ITLB: 64 entry, fully associative, 4 kbyte - 256 Mbyte pages |
| 0x96 | DTLB0: 32 entry, fully associative, 4 kbyte - 256 Mbyte pages |
| 0x9B | DTLB1: 96 entry, fully associative, 4 kbyte - 256 Mbyte pages |
Set EAX register to 2, then returned in EAX, EBX, ECX, EDX registers (MSB - LSB):
| EAX | 0x00 | 0x15 | 0x10 | 0x00 |
| EBX | 0x00 | 0x00 | 0x88/0x89 | 0x00 |
| ECX | 0x00 | 0x9B | 0x00 | 0x00 |
| EDX | 0x80 | 0x00 | 0x00 | 0x00 |
Used by HP as PA-RISC replacement, and in High Performance Computing (HPC).
Only a few thousand delivered.
Succeeded by Itanium 2 in 2002.
Successor of Itanium processor.
Major revision:
256 application registers:
64 predicate registers (PR0 - PR63): contain predicate test (compare) results, for conditional execution of instructions,
first 16 registers static: available to all programs,
rest rotating: can be renamed to accelerate loops.
8 branch registers (BR0 - BR7).
128 application registers (AR0 - AR127): special-purpose data and control registers.
4 Privilege Levels (PL): 0-3.
Current Privilege Level (CPL) in PSR.cpl (Processor Status Register, PSR).
Bi-endian memory access: controlled by UM.be bit (User Mask, UM).
Memory mapped I/O.
Processor virtualization: enabled by PSR.vm bit, managed by PAL.
Virtual Machine Monitor (VMM): managing and virtualizing system resources, creating a virtual environment.
From Montecito: Intel VT for Itanium (VT-i),
Virtual Processor Descriptor (VPD): description of resources of a single virtual processor,
rest of Virtual Processor State (VPS) maintained by VMM.
IA-32 compatibility mode: IA-32 System Environment, i.e. Pentium III.
16 bit Real Mode, 16 bit VM86, 16/32 bit Protected Mode, memory segmentation.
Multimedia instruction sets: MMX, SSE, SSE2 (from Intel Itanium 2 9000 series processor).
Switch between Itanium and IA-32 instruction sets using JMPE, br.ia, and rtfi.
All interruptions handled by Itanium instruction set code.
Current execution mode in PSR.is.
From Madison, IA-32 support implemented in software, as part of operating system (IA-32 Execution Layer, EL),
IA-32 EL provided by Intel for Linux and Windows,
erratum: segmentation not supported in IA-32 EL versions 4.3, 4.4, 5.3, and 6.5,
erratum: 16 bit application mode not supported in IA-32 EL versions 4.3, 4.4, 5.3, and 6.5,
note: CPUID returns only manufacturer and family of emulated processor model.
PA-RISC supported through Aries emulator.
Operating system: supported through Extensible Firmware Interface (EFI).
System Abstraction Layer (SAL): firmware providing platform initialization, configuration, and test, operating system boot, run-time functionality (i.e. BIOS (Basic Input Output System), Machine Checks, and Platform Management Interruptions (PMI, successor IA-32 System Management Mode (SMM))).
Processor Abstraction Layer (PAL): firmware providing processor specific Machine Checks, initialization, PMI, power management, configuration, and error recovery.
Developer's Interface Guide for IA-64 Servers (DIG64): design guidelines for building blocks and interfaces of IA-64 systems, providing an interoperable and stable baseline hardware interface for software developers.
On-die L1 cache (Harvard architecture):
On-die, unified L2 cache:
256 kbyte,
8-way set-associative, 128 byte line size,
write back, write-allocate,
non-blocking, out-of-order,
5 cycles minimum latency for integers, 6 cycles minimum latency for FPs,
16 byte banks,
32 Gbyte/s max. reading speed.
Cache coherency through MESI protocol.
From Itanium 2 9000 series processors: on-die L2 cache (Harvard architecture):
Cache coherency through MESI protocol.
8-way set-associative, 128 byte line size,
7 cycle latency,
8-way set-associative, 128 byte line size,
write-back, write-allocate,
non-blocking, out-of-order,
5 cycles minimum latency for integers, 6 cycles minimum latency for FPs,
16 byte banks,
32 Gbyte/s max. reading speed.
Intel Cache Safe Technology: protection of data and tags: double bit detection, single bit correction (ECC, Error-Correcting Code).
On-die, unified L3 cache:
up to 2x 12 = 24 Mbyte,
McKinley and Madison: 4-way set-associative per Mbyte, Madison 9M: 2-way set-associative per Mbyte; 128 byte line size,
fully pipelined, non-blocking,
McKinley: 12 cycles minimum latency for integers, 13 cycles minimum latency for FPs; Madison and Madison 9M: 14 cycles minimum latency for integers, 15 cycles minimum latency for FPs; Montecito: 14 cycles latency,
bandwidth to core 32 bytes per core cycle (256 bit bus to core),
6.2 Gbyte/s max. traffic speed to memory,
providing data to core at up to 48 Gbyte/s.
Cache coherency through MESI protocol,
Intel Cache Safe Technology: protection of data and tags: double bit detection, single bit correction (ECC, Error-Correcting Code).
Translation Look-aside Buffer (TLB) and Virtual Hash Page Table (VHPT):
Hardware Page Walker (HPW): loads VHPT from L2 cache / L3 cache / memory at TLB misses.
64 entry, fully associative,
only page size of 4 kbyte supported,
2 cycles latency,
128 entry, fully associative,
page size 4 kbyte - 256 Mbyte supported.
32 entry, fully associative,
2 cycles penalty at miss,
only page size of 4 kbyte supported,
128 entry, fully associative,
page size 4 kbyte - 4 Gbyte supported.
Advanced Load Address Table (ALAT): between L1 data cache (L1D) and DTLB, keeps track of speculative data loads,
32 entry, fully associative.
Double pipeline: 8 stage in-order, 6 instructions wide.
Split issue dispersal: three instructions (16 bytes) per bundle.
Scoreboarding, non-blocking caches (for compile-time non-determinism).
Execution units, all fully pipelined:
1 cycle latency,
2 cycles latency,
executing 1 SIMD FP operation per cycle,
only a single issue port for PMUL and POPCNT,
ANSI/IEEE-754,
FMAC: Floating Point Multiply Add Calculation: multiply and add of 82 bit floating point values in one cycle (for matrix calculations),
4 cycles latency,
4 cycles latency,
11 issue ports:
serving the execution units above.
Dynamic prefetch, optimized branch prediction, speculative execution.
Branch prediction:
512 entry, two-level.
Branch Target Address Cache (BTAC): 64 entry.
Interval Time Counter (ITC): register for timing ticks.
In 32 bit compatibility mode: Time Stamp Counter (TSC).
Streamlined Advanced PIC (SAPIC): based on IA-32 APIC (Advanced Programmable Interrupt Controller),
for Aborts, Interrupts, Faults, and Traps:
Virtual address space: 64 bit, no segmentation.
Multiple Address Space (MAS): each process has its own unique Virtual Region (flat linear address space).
8 61 bit Virtual Regions (Virtual Region Number, VRN; Region Identifier, RID), 224 Virtual Address Spaces of 261 bits.
4 kbyte - 4 Gbyte pages (Virtual Page Number, VPN).
Physical address space: 63 bit.
Up to 50 bits supported in page tables.
Write Coalescing (WC): streams of non-cachable writes can be combined into a single bus write transaction.
WC Buffer (WCB): two-entry, 128 byte.
Enhanced Machine Check Architecture (EMCA): parity and ECC (Error-Correcting Code) on all major address and data busses.
50 bit address bus.
Physical addressing:
200/266/333 MHz DDR bus (McKinley bus, Scalability Port): 128 bit data.
Source Synchronous Signaling (SSS).
6.4/8.5/10.6 Gbyte/s max. throughput.
Assisted Gunning Transceiver Logic signaling (AGTL+),
based on GTL+ bus of Intel Pentium III and Pentium III Xeon processors.
1.5 V ± 1.5 %.
Power pod connector.
Tests:
Processor performance monitoring and profiling:
Hyper-Threading Technology (HTT) (from Montecito),
Temporal Multi-Threading (TMT; Switch-on-Event Multi-Threading, SoEMT): threads not running simultaneously, core switches in case of high-latency event.
SMP (Symmetric Multi-Processing): glueless up to four processors (max. 16 in IA-32 compatibility mode).
Max. four processors at 200 MHz, max. two processors at 266 or 333 MHz.
Shared memory, cache coherency through MESI protocol.
Multiplier (Phase Lock Loop, PLL):
set through pins during reset:
multiplier\pin
A21# - A17#
2/9
10110
2/10
10101
2/13
10010
2/14
10001
2/15
10000
2/16
01111
Power and performance management:
P-states:
Performance:
Thermal management: via on-die thermal diode:
System management: System Management Bus (SMBus):
Marking:
CPUID: 8 byte registers:
Family number
Model number
Processor
0x07
0x00
Itanium Merced
0x1F
0x00
Itanium 2 McKinley
0x1F
0x01
Itanium 2 Madison, Deerfield, Hondo
0x1F
0x02
Itanium 2 Madison 9M, Fanwood
0x20
0x00
Itanium 2 9000 series Montecito, Millington
CPUID return values:
| 0x10 | L1D: 16 kbyte, 4-way set-associative, 32 byte line size |
| 0x15 | L1I: 16 kbyte, 4-way set-associative, 32 byte line size |
| 0x1A | L2: 96 kbyte, 6-way set-associative, 64 byte line size |
| 0x88 | L3: 2 Mbyte, 4-way set-associative, 64 byte line size |
| 0x89 | L2: 4 Mbyte, 4-way set-associative, 64 byte line size |
| 0x8A | L2: 8 Mbyte, 4-way set-associative, 64 byte line size |
| 0x90 | ITLB: 64 entry, fully associative, 4 kbyte - 256 Mbyte pages |
| 0x96 | DTLB0: 32 entry, fully associative, 4 kbyte - 256 Mbyte pages |
| 0x9B | DTLB1: 96 entry, fully associative, 4 kbyte - 256 Mbyte pages |
Set EAX register to 2, then returned in EAX, EBX, ECX, EDX registers (MSB - LSB):
| EAX | 0x00 | 0x15 | 0x10 | 0x00 |
| EBX | 0x00 | 0x00 | 0x88/0x89 | 0x00 |
| ECX | 0x00 | 0x9B | 0x00 | 0x00 |
| EDX | 0x80 | 0x00 | 0x00 | 0x00 |
IA-32 CPUID cache returm values:
0x67
L1D: 64 kbyte, 4-way set-associative, 64 byte line size
0x77
L1I: 64 kbyte, 4-way set-associative, 64 byte line size
0x7E
L2: 256 kbyte, 8-way set-associative, 128 byte line size
0x7E
L3: 3Mbyte, 12-way set-associative, 128 byte line size
Used by HP as PA-RISC replacement, and in High Performance Computing (HPC).
|
|
|
Page viewed 6387 times since Sun 1 Mar 2009, 0:00.