The ChipList, by Adrian Offerman; The Processor Portal

new
Processor Selector

Platform:
Segment:
Tree: collapse / expand
View: show / edit

bookmark bookmark site
bookmark permalink

Intel Itanium "classic" processor

Compatibility


256 application registers:

  • 128 64 bit general purpose registers (integer and multimedia):
    32 general registers (GR0 - GR31): static, available to all programs,
    rest (GR32 - GR127) stacked: available per program,
    managed by Register Stack Engine (RSE) (stack pointer (SP): Current Frame Marker, CFM),
  • 128 82 bit floating point registers (FR0 - FR127):
    first 32 registers static: available to all programs,
    rest rotating: can be renamed to accelerate loops.

64 predicate registers (PR0 - PR63): contain predicate test (compare) results, for conditional execution of instructions,
first 16 registers static: available to all programs,
rest rotating: can be renamed to accelerate loops.

8 branch registers (BR0 - BR7).

128 application registers (AR0 - AR127): special-purpose data and control registers.

4 Privilege Levels (PL): 0-3.
Current Privilege Level (CPL) in PSR.cpl (Processor Status Register, PSR).

Bi-endian memory access: controlled by UM.be bit (User Mask, UM).

Memory mapped I/O.

Processor virtualization: enabled by PSR.vm bit, managed by PAL.
Virtual Machine Monitor (VMM): managing and virtualizing system resources, creating a virtual environment (Virtual Processor Descriptor, VPD).

IA-32 compatibility mode: IA-32 System Environment, i.e. Pentium III.
16 bit Real Mode, 16 bit VM86, 16/32 bit Protected Mode, memory segmentation.
Multimedia instruction sets: MMX, SSE.
Switch between Itanium and IA-32 instruction sets using JMPE, br.ia, and rtfi.
All interruptions handled by Itanium instruction set code.
Current execution mode in PSR.is.

PA-RISC supported through Aries emulator.

Operating system: supported through Extensible Firmware Interface (EFI).
System Abstraction Layer (SAL): firmware providing platform initialization, configuration, and test, operating system boot, run-time functionality (i.e. BIOS (Basic Input Output System), Machine Checks, and Platform Management Interruptions (PMI, successor IA-32 System Management Mode (SMM))).
Processor Abstraction Layer (PAL): firmware providing processor specific Machine Checks, initialization, PMI, power management, configuration, and error recovery.

Developer's Interface Guide for IA-64 Servers (DIG64): design guidelines for building blocks and interfaces of IA-64 systems, providing an interoperable and stable baseline hardware interface for software developers.

Cache


On-die L1 cache (Harvard architecture):

  • 16 kbyte instruction cache (L1I):
    4-way set-associative, 32 byte line size,
    1 cycle latency,
  • 16 kbyte data cache (L1D):
    4-way set-associative, 32 byte line size,
    write-through, no write-allocate,
    2 cycles latency for integers, FPs bypass the L1 data cache.

On-die, unified L2 cache:
96 kbyte,
6-way set-associative, 64 byte line size,
write back,
6 cycles minimum latency for integers, 9 cycles minimum latency for FPs,
max. 2 requests per clock (banks).
Cache coherency through MESI protocol.

L3 cache:
2 or 4 Mbyte, apart in package, connected through Front Side Bus (FSB),
4-way set-associative, 64 byte line size,
21 cycles minimum latency for integers, 24 cycles minimum latency for FPs,
bandwidth 16 bytes per core cycle (64 bit DDR, 128 bit bus to core),
12 Gbyte/s max. throughput.
Cache coherency through MESI protocol.

Translation Look-aside Buffer (TLB) and Virtual Hash Page Table (VHPT):

  • Instruction TLB (ITLB): between instruction fetch and decode,
    64 entry, fully associative,
  • two-level data TLB: between data caches and registers:
    • L1 DTLB (DTLB1):
      32 entry, fully associative,
      10 cycles penalty at miss,
    • L2 DTLB (DTLB2):
      96 entry, fully associative,
      page size 4 kbyte - 256 Mbyte supported.
Hardware Page Walker (HPW): loads VHPT from L2 cache / L3 cache / memory at TLB misses.

Advanced Load Address Table (ALAT): between L1 data cache (L1D) and DTLB, keeps track of speculative data loads,
32 entry, two-way set-associative.

Architecture


Double pipeline: 10 stage in-order, 6 instructions wide.
Split issue dispersal: three instructions (16 bytes) per bundle.
Scoreboarding, non-blocking caches (for compile-time non-determinism).

17 execution units:

  • 4 Integer units (ALU, Arithmetic Logic Unit),
  • 4 Multimedia units (ALU),
  • 2 Extended Precision Floating Point (FP) units: F0, F1,
    ANSI/IEEE-754,
    FMAC: Floating Point Multiply Add Calculation: multiply and add of 82 bit floating point values in one cycle (for matrix calculations),
  • 2 Single Precision FPUs,
    each executing two calculations per clock,
  • all FP units together delivering max. throughput of 6.4 GFLOPS,
  • 2 load/store units: M0, M1,
  • 3 branch units: B0, B1, B2.

9 issue ports:

  • 2 memory: M0, M1,
  • 2 integer: I0, I1,
  • 2 FP: F0, F1,
  • 3 branch: B0, B1, B2,
serving the 17 execution units above.

Dynamic prefetch, branch prediction, speculative execution.

Branch prediction:
512 entry, two-level.
Branch Target Address Cache (BTAC): 64 entry.

Interval Time Counter (ITC): register for timing ticks.
In 32 bit compatibility mode: Time Stamp Counter (TSC).

Streamlined Advanced PIC (SAPIC): based on IA-32 APIC (Advanced Programmable Interrupt Controller),
for Aborts, Interrupts, Faults, and Traps:

  • handled by operating system: to Interrupt Vector Address (IVA) through Interrupt Vector Table (IVT),
  • handled by PAL firmware.
Interruption Status Register (ISR).
256 interrupt vectors:
  • 0 - 15: special, high priority,
  • 16 - 255: freely assignable.
Support for Intel 8259A interrupt controllers.

Virtual address space: 64 bit, no segmentation.
Multiple Address Space (MAS): each process has its own unique Virtual Region (flat linear address space).
8 61 bit Virtual Regions (Virtual Region Number, VRN; Region Identifier, RID), 224 Virtual Address Spaces of 261 bits.
4 kbyte - 256 Mbyte pages (Virtual Page Number, VPN).

Physical address space: 63 bit.
Up to 50 bits supported in page tables.

Write Coalescing (WC): streams of non-cachable writes can be combined into a single bus write transaction.
WC Buffer (WCB): two-entry, 64 byte.

Enhanced Machine Check Architecture (EMCA): parity and ECC (Error-Correcting Code) on all major address and data busses.

44 bit address bus.
Physical addressing:

  • 32 bit: 0-4 Gbyte,
  • 36 bit: 4-64 Gbyte,
  • 44 bit: 64 Gbyte - 16 Tbyte.
Virtual addressing: 54 bit.
Page sizes: 4 kbyte - 256 Mbyte.

133 MHz DDR bus (Merced bus): 64 bit data.
Source Synchronous Signaling (SSS).
2.1 Gbyte/s max. throughput.

Assisted Gunning Transceiver Logic signaling (AGTL+),
based on GTL+ bus of Intel Pentium III and Pentium III Xeon processors.
1.5 V ± 1.5 %.

Power pod connector.

Tests:

  • Build-In Self Test (BIST),
  • Test Access Port (TAP): IEEE 1149.1 (JTAG),
  • In-Target Probe (ITP): debugging interface for board integration,
    JTAG TAP, access to registers, memory, and I/O,
    ITP700 Debug Port (DB): command and control interface for ITP,
    max. 16 MHz,
  • Logic Analyzer Interface (LAI),
  • code debugging:
    Instruction and Data Breakpoint Registers (IBR, DBR),
    single stepping (through PSR.ss),
    breaks, taken branches (through PSR.tb), privileges,
    instruction and data debugging.

Processor performance monitoring and profiling:

  • Performance Monitor Configuration (PMC),
  • Performance Monitor Data Registers (PMD),
  • 4 32 bit performance counters.
Dynamic processor behaviour (instruction execution, caches, branch prediction, virtual memory translation) can be monitored with real-world operating systems, applications, and systems, and be fed back into the code generation process.

Multi-processing


SMP (Symmetric Multi-Processing): glueless up to four processors (max. 16 in IA-32 compatibility mode).
Shared memory, cache coherency through MESI protocol.

Multiplier


Multiplier (Phase Lock Loop, PLL):
set through pins during reset:

multiplier\pin LINT[1] LINT[0] IGNNE# A20M#
2/11 0000
2/12 0111

Power management


Power and performance management:
P-states:

  • P0: maximum performance, maximum power (highest utilization),
  • P15 (lowest utilization),
set for all logical processors (multi-threading, multiple cores), per dependency domain (depending on distribution network for clock and power),
managed by PAL.

Performance


Performance:

  • slow integer (comparable to X86 processors),
  • very fast FP units,
  • very slow IA-32 unit,
  • L3 cache slow due to small memory bus,
resulting in poor over-all performance.

Thermal management


Thermal management: via on-die thermal diode:

  • Thermal Alert: thresholds set through SMBus,
    THRMALERT# pin active when threshold crossed,
  • no Enhanced Thermal Management (ETM),
  • Thermal Trip: processor shutdown when overheated,
    THRMTRIP# pin active, reset processor to resume.

System management


System management: System Management Bus (SMBus):

  • Processor Information EEPROM (PIROM): manufacturing and features information,
    permanently write-protected:
    • processor: s-spec / QDF number, sample/production,
    • core: architecture revision, family, model, stepping/revision,
      maximum core frequency, maximum bus frequency, voltage, voltage tolerance,
    • L3 cache: size, voltage, tolerance, stepping,
    • package: cartridge revision, substrate revision,
    • part numbers: processor part number (80542KC),
      processor electronic signature (64 bit serial number),
    • thermal reference,
    • features, IA-32 features, cartridge features,
  • scratch EEPROM: for OEM system designer information,
  • thermal sensing device (A/D converter), connected to on-die thermal diode.
3.3 V ± 5% (3.14-3.47 V).

Marking


Marking:

  • Intel brand,
  • legal mark,
  • product ID,
  • Finish Process Order (FPO),
  • serial number,
  • s-spec,
  • country of origin (not for 9000 series),
  • Assembly Process Order (APO).

CPUID


CPUID: 8 byte registers:

  • registers 0-4: fixed region,
  • region 5 and further: variable region.

  • registers 0 and 1: vendor id information,
  • register 2: ignored,
  • register 3: processor implementation information:
    • bits 7:0: largest CPUID register number,
    • bits 15:8: processor revision number,
    • bits 23:16: processor model number,
    • bits 31:24: processor family number (0x07),
    • bits 39:32: processor architecture revision number (0x00),
    • bits 63:40: reserved,
  • register 4: processor features:
    • bit 0: long branch instruction (brl) implemented, no need to emulate by operating system,
    • bit 1: spontaneous deferral implemented,
    • bit 2: 16-byte atomic operations implemented,
    • bits 63:3: reserved.

CPUID return values:

0x10 L1D: 16 kbyte, 4-way set-associative, 32 byte line size
0x15 L1I: 16 kbyte, 4-way set-associative, 32 byte line size
0x1A L2: 96 kbyte, 6-way set-associative, 64 byte line size
0x88 L3: 2 Mbyte, 4-way set-associative, 64 byte line size
0x89 L2: 4 Mbyte, 4-way set-associative, 64 byte line size
0x8A L2: 8 Mbyte, 4-way set-associative, 64 byte line size
0x90 ITLB: 64 entry, fully associative, 4 kbyte - 256 Mbyte pages
0x96 DTLB0: 32 entry, fully associative, 4 kbyte - 256 Mbyte pages
0x9B DTLB1: 96 entry, fully associative, 4 kbyte - 256 Mbyte pages

Set EAX register to 2, then returned in EAX, EBX, ECX, EDX registers (MSB - LSB):

EAX 0x00 0x15 0x10 0x00
EBX 0x00 0x00 0x88/0x89 0x00
ECX 0x00 0x9B 0x00 0x00
EDX 0x80 0x00 0x00 0x00

Market


Used by HP as PA-RISC replacement, and in High Performance Computing (HPC).

Only a few thousand delivered.
Succeeded by Itanium 2 in 2002.


Intel Itanium processor (Merced)

Page viewed 15098 times since Sun 1 Mar 2009, 0:00.