The ChipList, by Adrian Offerman; The Processor Portal

new
Processor Selector

Platform:
Segment:
Tree: collapse / expand
View: show / edit

bookmark bookmark site
bookmark permalink

Intel Itanium 2 processor

Successor of Itanium processor.
Major revision:

  • improvements in number of pipeline units, number of clock cycles,
  • according to Intel and HP one-and-a-half to two times faster than the Itanium processor,
  • brl instruction: 64 bit long branch,
  • scoreboarding of multi-cycle instructions, e.g. L1D misses, multimedia, FP.

Compatibility


256 application registers:

  • 128 64 bit general purpose registers (integer and multimedia):
    32 general registers (GR0 - GR31): static, available to all programs,
    rest (GR32 - GR127) stacked: available per program,
    managed by Register Stack Engine (RSE) (stack pointer (SP): Current Frame Marker, CFM),
  • 128 82 bit floating point registers (FR0 - FR127):
    first 32 registers static: available to all programs,
    rest rotating: can be renamed to accelerate loops.

64 predicate registers (PR0 - PR63): contain predicate test (compare) results, for conditional execution of instructions,
first 16 registers static: available to all programs,
rest rotating: can be renamed to accelerate loops.

8 branch registers (BR0 - BR7).

128 application registers (AR0 - AR127): special-purpose data and control registers.

4 Privilege Levels (PL): 0-3.
Current Privilege Level (CPL) in PSR.cpl (Processor Status Register, PSR).

Bi-endian memory access: controlled by UM.be bit (User Mask, UM).

Memory mapped I/O.

Processor virtualization: enabled by PSR.vm bit, managed by PAL.
Virtual Machine Monitor (VMM): managing and virtualizing system resources, creating a virtual environment.
From Montecito: Intel VT for Itanium (VT-i),
Virtual Processor Descriptor (VPD): description of resources of a single virtual processor,
rest of Virtual Processor State (VPS) maintained by VMM.

IA-32 compatibility mode: IA-32 System Environment, i.e. Pentium III.
16 bit Real Mode, 16 bit VM86, 16/32 bit Protected Mode, memory segmentation.
Multimedia instruction sets: MMX, SSE, SSE2 (from Intel Itanium 2 9000 series processor).
Switch between Itanium and IA-32 instruction sets using JMPE, br.ia, and rtfi.
All interruptions handled by Itanium instruction set code.
Current execution mode in PSR.is.
From Madison, IA-32 support implemented in software, as part of operating system (IA-32 Execution Layer, EL),
IA-32 EL provided by Intel for Linux and Windows,
erratum: segmentation not supported in IA-32 EL versions 4.3, 4.4, 5.3, and 6.5,
erratum: 16 bit application mode not supported in IA-32 EL versions 4.3, 4.4, 5.3, and 6.5,
note: CPUID returns only manufacturer and family of emulated processor model.

PA-RISC supported through Aries emulator.

Operating system: supported through Extensible Firmware Interface (EFI).
System Abstraction Layer (SAL): firmware providing platform initialization, configuration, and test, operating system boot, run-time functionality (i.e. BIOS (Basic Input Output System), Machine Checks, and Platform Management Interruptions (PMI, successor IA-32 System Management Mode (SMM))).
Processor Abstraction Layer (PAL): firmware providing processor specific Machine Checks, initialization, PMI, power management, configuration, and error recovery.

Developer's Interface Guide for IA-64 Servers (DIG64): design guidelines for building blocks and interfaces of IA-64 systems, providing an interoperable and stable baseline hardware interface for software developers.

Cache


On-die L1 cache (Harvard architecture):

  • 16 kbyte instruction cache (L1I):
    4-way set-associative, 64 byte line size,
    1 cycle latency,
    32 Gbyte/s max. reading speed.
  • 16 kbyte data cache (L1D):
    4-way set-associative, 64 byte line size,
    write-through, no write-allocate,
    1 cycle latency for integers, FPs and semaphores bypass the L1 data cache,
    32 Gbyte/s max. reading speed, 16 Gbyte/s max. writing speed.

On-die, unified L2 cache:
256 kbyte,
8-way set-associative, 128 byte line size,
write back, write-allocate,
non-blocking, out-of-order,
5 cycles minimum latency for integers, 6 cycles minimum latency for FPs,
16 byte banks,
32 Gbyte/s max. reading speed.
Cache coherency through MESI protocol.

From Itanium 2 9000 series processors: on-die L2 cache (Harvard architecture):

  • 1 Mbyte instruction cache (L2I):
    8-way set-associative, 128 byte line size,
    7 cycle latency,
  • 256 kbyte data cache (L2D):
    8-way set-associative, 128 byte line size,
    write-back, write-allocate,
    non-blocking, out-of-order,
    5 cycles minimum latency for integers, 6 cycles minimum latency for FPs,
    16 byte banks,
    32 Gbyte/s max. reading speed.
Cache coherency through MESI protocol.
Intel Cache Safe Technology: protection of data and tags: double bit detection, single bit correction (ECC, Error-Correcting Code).

On-die, unified L3 cache:
up to 2x 12 = 24 Mbyte,
McKinley and Madison: 4-way set-associative per Mbyte, Madison 9M: 2-way set-associative per Mbyte; 128 byte line size,
fully pipelined, non-blocking,
McKinley: 12 cycles minimum latency for integers, 13 cycles minimum latency for FPs; Madison and Madison 9M: 14 cycles minimum latency for integers, 15 cycles minimum latency for FPs; Montecito: 14 cycles latency,
bandwidth to core 32 bytes per core cycle (256 bit bus to core),
6.2 Gbyte/s max. traffic speed to memory,
providing data to core at up to 48 Gbyte/s.
Cache coherency through MESI protocol,
Intel Cache Safe Technology: protection of data and tags: double bit detection, single bit correction (ECC, Error-Correcting Code).

Translation Look-aside Buffer (TLB) and Virtual Hash Page Table (VHPT):

  • Two-level instruction TLB (ITLB): between instruction fetch and decode:
    • L1 ITLB (ITLB1):
      64 entry, fully associative,
      only page size of 4 kbyte supported,
      2 cycles latency,
    • L2 ITLB (ITLB2):
      128 entry, fully associative,
      page size 4 kbyte - 256 Mbyte supported.
  • two-level data TLB: between data caches and registers:
    • L1 DTLB (DTLB1):
      32 entry, fully associative,
      2 cycles penalty at miss,
      only page size of 4 kbyte supported,
    • L2 DTLB (DTLB2):
      128 entry, fully associative,
      page size 4 kbyte - 4 Gbyte supported.
Hardware Page Walker (HPW): loads VHPT from L2 cache / L3 cache / memory at TLB misses.

Advanced Load Address Table (ALAT): between L1 data cache (L1D) and DTLB, keeps track of speculative data loads,
32 entry, fully associative.

Architecture


Double pipeline: 8 stage in-order, 6 instructions wide.
Split issue dispersal: three instructions (16 bytes) per bundle.
Scoreboarding, non-blocking caches (for compile-time non-determinism).

Execution units, all fully pipelined:

  • 6 Integer units (ALU, Arithmetic Logic Unit): ALU0-6,
    1 cycle latency,
  • 6 Multimedia units (ALU): PALU0-5 (compare HP MAX-2, Intel MMX and SSE),
    2 cycles latency,
  • 1 SHIFT unit: ISHIFT,
  • 2 parallel shift units: PSMU0, PSMU1,
  • 1 parallel multiply unit: PMUL,
    executing 1 SIMD FP operation per cycle,
  • 1 population count unit (for popcnt instruction): POPCNT,
    only a single issue port for PMUL and POPCNT,
  • 2 Extended Precision Floating Point (FP) units: FMAC0, FMAC1,
    ANSI/IEEE-754,
    FMAC: Floating Point Multiply Add Calculation: multiply and add of 82 bit floating point values in one cycle (for matrix calculations),
    4 cycles latency,
  • 2 FPUs for other FP operations: FMISC0, FMISC1,
    4 cycles latency,
  • 4 memory ports in Data Cache Unit (DCU): 2 load units, 2 store units,
  • 3 branch units: B0, B1, B2.

11 issue ports:

  • 4 memory/ALU/multimedia: M0, M1, M2, M3
  • 2 integer/ALU/multi-media: I0, I1,
  • 2 FP: F0, F1,
  • 3 branch: B0, B1, B2,
serving the execution units above.

Dynamic prefetch, optimized branch prediction, speculative execution.

Branch prediction:
512 entry, two-level.
Branch Target Address Cache (BTAC): 64 entry.

Interval Time Counter (ITC): register for timing ticks.
In 32 bit compatibility mode: Time Stamp Counter (TSC).

Streamlined Advanced PIC (SAPIC): based on IA-32 APIC (Advanced Programmable Interrupt Controller),
for Aborts, Interrupts, Faults, and Traps:

  • handled by operating system: to Interrupt Vector Address (IVA) through Interrupt Vector Table (IVT),
  • handled by PAL firmware.
Interruption Status Register (ISR).
256 interrupt vectors:
  • 0 - 15: special, high priority,
  • 16 - 255: freely assignable.
Support for Intel 8259A interrupt controllers.

Virtual address space: 64 bit, no segmentation.
Multiple Address Space (MAS): each process has its own unique Virtual Region (flat linear address space).
8 61 bit Virtual Regions (Virtual Region Number, VRN; Region Identifier, RID), 224 Virtual Address Spaces of 261 bits.
4 kbyte - 4 Gbyte pages (Virtual Page Number, VPN).

Physical address space: 63 bit.
Up to 50 bits supported in page tables.

Write Coalescing (WC): streams of non-cachable writes can be combined into a single bus write transaction.
WC Buffer (WCB): two-entry, 128 byte.

Enhanced Machine Check Architecture (EMCA): parity and ECC (Error-Correcting Code) on all major address and data busses.

50 bit address bus.
Physical addressing:

  • 32 bit: 0-4 Gbyte,
  • 36 bit: 4-64 Gbyte,
  • 44 bit: 64 Gbyte - 16 Tbyte.
Virtual addressing: 54 bit.
Page sizes: 4 kbyte - 4 Gbyte.

200/266/333 MHz DDR bus (McKinley bus, Scalability Port): 128 bit data.
Source Synchronous Signaling (SSS).
6.4/8.5/10.6 Gbyte/s max. throughput.

Assisted Gunning Transceiver Logic signaling (AGTL+),
based on GTL+ bus of Intel Pentium III and Pentium III Xeon processors.
1.5 V ± 1.5 %.

Power pod connector.

Tests:

  • Build-In Self Test (BIST),
  • Test Access Port (TAP): IEEE 1149.1 (JTAG),
  • In-Target Probe (ITP): debugging interface for board integration,
    JTAG TAP, access to registers, memory, and I/O,
    ITP700 Debug Port (DB): command and control interface for ITP,
    max. 16 MHz,
  • Logic Analyzer Interface (LAI),
  • code debugging:
    Instruction and Data Breakpoint Registers (IBR, DBR),
    single stepping (through PSR.ss),
    breaks, taken branches (through PSR.tb), privileges,
    instruction and data debugging.

Processor performance monitoring and profiling:

  • Performance Monitor Configuration (PMC),
  • Performance Monitor Data Registers (PMD),
  • 4 48 bit performance counters,
    Montecito: 12 48 bit performance counters per thread.
Dynamic processor behaviour (instruction execution, caches, branch prediction, virtual memory translation) can be monitored with real-world operating systems, applications, and systems, and be fed back into the code generation process.

Multi-processing


Hyper-Threading Technology (HTT) (from Montecito),
Temporal Multi-Threading (TMT; Switch-on-Event Multi-Threading, SoEMT): threads not running simultaneously, core switches in case of high-latency event.

SMP (Symmetric Multi-Processing): glueless up to four processors (max. 16 in IA-32 compatibility mode).
Max. four processors at 200 MHz, max. two processors at 266 or 333 MHz.
Shared memory, cache coherency through MESI protocol.

Multiplier


Multiplier (Phase Lock Loop, PLL):
set through pins during reset:

multiplier\pin A21# - A17#
2/9 10110
2/10 10101
2/13 10010
2/14 10001
2/15 10000
2/16 01111

Power management


Power and performance management:
P-states:

  • P0: maximum performance, maximum power (highest utilization),
  • P15 (lowest utilization),
set for all logical processors (multi-threading, multiple cores), per dependency domain (depending on distribution network for clock and power),
managed by PAL.

Performance


Performance:

  • improved integer performance,
  • very fast FP units,
  • IA-32 compatibility by emulator: comparable to X86 processors.

Thermal management


Thermal management: via on-die thermal diode:

  • Thermal Alert: thresholds set through SMBus,
    THRMALERT# pin active when threshold crossed,
  • Enhanced Thermal Management (ETM): thresholds set through SMBus,
    when maximum exceeded, entering low power mode and Correctable Machine Check Interrupt (CMCI),
    when within normal range again, after one second back to normal mode and another CMCI,
  • Thermal Trip: processor shutdown when overheated,
    THRMTRIP# pin active, reset processor to resume.

System management


System management: System Management Bus (SMBus):

  • Processor Information EEPROM (PIROM): manufacturing and features information,
    permanently write-protected:
    • processor: s-spec / QDF number, sample/production,
    • core: architecture revision, family, model, stepping/revision,
      maximum core frequency, maximum bus frequency, voltage, voltage tolerance,
    • L3 cache: size, voltage, tolerance, stepping,
    • package: cartridge revision, substrate revision,
    • part numbers: processor part number (McKinley: 80542KC; Madison, Madison 9M, and Fanwood: 80543KC; Deerfield and Fanwood LV: 80544KC; Fanwood @ 266 MHz: 80533KE; 9000 series: 80549KC),
      processor electronic signature (64 bit serial number),
    • thermal reference (upper limits: Madison @ 1.3 GHz: 107 °C; Madison @ 1.4/1.5 GHz: 105 °C; Madison @ 1.6 GHz: 113 °C; Madison 9M: 113 °C; Fanwood and Fanwood LV: 105 °C; 9000 series: 92 °C),
    • features, IA-32 features, cartridge features,
  • scratch EEPROM: for OEM system designer information,
  • thermal sensing device (A/D converter), connected to on-die thermal diode.
3.3 V ± 5% (3.14-3.47 V).

Marking


Marking:

  • Intel brand,
  • legal mark,
  • product ID,
  • Finish Process Order (FPO),
  • serial number,
  • s-spec,
  • country of origin (not for 9000 series),
  • Assembly Process Order (APO),
  • 2D matrix mark (not for McKinley).

CPUID


CPUID: 8 byte registers:

  • registers 0-4: fixed region,
  • region 5 and further: variable region.

  • registers 0 and 1: vendor id information,
  • register 2: ignored,
  • register 3: processor implementation information:
    • bits 7:0: largest CPUID register number,
    • bits 15:8: processor revision number,
    • bits 23:16: processor model number,
    • bits 31:24: processor family number (McKinley, Madison, and Madison 9M: 0x1F; Montecito: 0x20),
    • bits 39:32: processor architecture revision number (McKinley: 0x00; Madison: 0x01; Madison 9M: 0x02; Montecito: 0x00),
    • bits 63:40: reserved,
  • register 4: processor features:
    • bit 0: long branch instruction (brl) implemented, no need to emulate by operating system (from McKinley),
    • bit 1: spontaneous deferral implemented,
    • bit 2: 16-byte atomic operations implemented,
    • bits 63:3: reserved.

Family number Model number Processor
0x07 0x00 Itanium Merced
0x1F 0x00 Itanium 2 McKinley
0x1F 0x01 Itanium 2 Madison, Deerfield, Hondo
0x1F 0x02 Itanium 2 Madison 9M, Fanwood
0x20 0x00 Itanium 2 9000 series Montecito, Millington

CPUID return values:

0x10 L1D: 16 kbyte, 4-way set-associative, 32 byte line size
0x15 L1I: 16 kbyte, 4-way set-associative, 32 byte line size
0x1A L2: 96 kbyte, 6-way set-associative, 64 byte line size
0x88 L3: 2 Mbyte, 4-way set-associative, 64 byte line size
0x89 L2: 4 Mbyte, 4-way set-associative, 64 byte line size
0x8A L2: 8 Mbyte, 4-way set-associative, 64 byte line size
0x90 ITLB: 64 entry, fully associative, 4 kbyte - 256 Mbyte pages
0x96 DTLB0: 32 entry, fully associative, 4 kbyte - 256 Mbyte pages
0x9B DTLB1: 96 entry, fully associative, 4 kbyte - 256 Mbyte pages

Set EAX register to 2, then returned in EAX, EBX, ECX, EDX registers (MSB - LSB):

EAX 0x00 0x15 0x10 0x00
EBX 0x00 0x00 0x88/0x89 0x00
ECX 0x00 0x9B 0x00 0x00
EDX 0x80 0x00 0x00 0x00

IA-32 CPUID cache returm values:

0x67 L1D: 64 kbyte, 4-way set-associative, 64 byte line size
0x77 L1I: 64 kbyte, 4-way set-associative, 64 byte line size
0x7E L2: 256 kbyte, 8-way set-associative, 128 byte line size
0x7E L3: 3Mbyte, 12-way set-associative, 128 byte line size

Market


Used by HP as PA-RISC replacement, and in High Performance Computing (HPC).


Intel Itanium 2 processor (McKinley)

Intel Itanium 2 processor (Madison)

Intel Itanium 2 DP LV processor (Deerfield)

HP Itanium 2 mx2 processor module (Hondo)

Intel Itanium 2 processor (Madison 9M)

Intel Itanium 2 DP processor (Fanwood)

Intel Itanium 2 DP LV processor (Fanwood LV)

Intel Itanium 2 9000 series Dual-Core processor (Montecito)

Intel Itanium 2 DP 9000 series Dual-Core processor (Millington)

Intel Itanium 2 DP LV 9000 series Dual-Core processor (Millington LV)

Page viewed 20377 times since Sun 1 Mar 2009, 0:00.