Cluster: 4 Cores
L1 Cache: 64 KiB L1D, 32 KiB L1I
L2 Cache: 2 or 4 MiB/Cluster (17 cycle latency)
Decoders: 2× 3
Branch Entry Cache: 5000 entries
Reorder Buffer: 256 (Up from 208)
Allocation: 5-wide
Retire: 8-wide
Ports: 17 (Up from 10)
ALUs: 4 Integer, 2 Floating Point, 3 Vector
Load/Store: 2 Integer, 2 Floating Point/Vector
Load/Store: 2 Load, 2 Store AGUs
Jump Ports: 2
ISA: Supports AVX, AVX2 and VNNI-INT8
Supports CET and VT-rp
IPC: +19% over Cypress Cove
L2 Cache: 1.25 MiB (Client), 2 MiB (Server)
Higher Average Frequency
FP16 support for AVX3-512 (AVX512_FP16)
L1 iTLB: 256 4K Entries (Up from 128), 32 2M/4M Entries (Up from 16)
L2 BTB: 12K Branch Targets (Up from 5K) (Variable Size) [Enormous]
μop Cache: 4K μops (Up from 2.5K)
Decoders: 6 Simple, 1 Complex (8 μops/cycle from μop cache)
Fetch Bandwidth: 32b/cycle (Up from 16b)
μop Queue: 72 Entries/Thread (Up from 72), 144 Single Thread (Up from 70)
Allocation: 6-wide
Reorder Buffer: 512 (Up from 352) [256 on Zen 3]
Ports: 12 (Up from 10)
ALUs: Merged 5 Integer, 3 Floating Point/Vector
Load/Store: 3 Load (3× 256b/2× 512b), 2 Store AGUs
L1 DTLB: 96 Entries (Up from 64)
L1D Cache Fill Buffers: 16 (Up from 12)
Page Walkers: 4 (Up from 2)
Mispredict Penalty: 17 cycles (Up from 16)
Transparent to Software
Monitors each thread and each core’s state (Microarchitecture Telemetry)
Provides feedback to the OS for the OS scheduler to make decisions
Adapts dynamically based on TDP, operation and power settings
Priority Tasks on high performance cores
Background Tasks on high efficiency cores
Vector/AI Tasks prioritised for high performance cores
Tasks moved to high efficiency cores based on relative performance ordering
Spin loops are moved to high efficiency cores to reduce power consumption
Scales better at higher power than Xe-LP [Seems to be all points of V/f curve]
Software-first Approach
Takes in aspects from Xe-LP, Xe-HP and Xe-HPC
1.5× Performance/Watt of DG1 (Iris Xe MAX)
1.5× frequency iso-voltage
Process Node: TSMC N6 [+18% Density over N7]
Alchemist is currently sampling to ISVs and partners
Future GPUs: Battlemage (Xe²-HPG), Celestial (Xe³-HPG), Druid
Biggest Leap in Data Center Performance in a decade
Optimised for per-node and data center performance
Multi-tile Design for Increased Scalability
All threads have access to cache, memory and I/O on all tiles
Low-latency and high cross-section bandwidth
Coherent, shared memory spaces between Cores and Acceleration Engines
Architected for AI and microservices
69% higher performance than Ice Lake-SP in microservices
Made with the following building blocks
Separation of Instruction and Tenant
Infrastructure Offload
Diskless Server Architecture
Available as both ASICs and FPGAs
FPGA IPUs allow for quicker time to market and more flexibility to be future-proof
Working with Microsoft, Baidu, JD.com and VMWare on IPUs
Customizable packet processing including bridging and networking services
Programmable through Intel OFS and DPDK
Secure Remote Update of FPGA and Firmware over PCIe
On-board Root of Trust
Juniper Contrail, OVS, SRv6
Agilex FPGA with 16 GB DDR4
Integrated CPU with 16 GB DDR4
Ethernet Controller
2× 100Gb Ethernet (QSFP28/56)
16× PCIe Gen 4.0
Full-height, half-length PCIe Expansion Card
Co-designed with a Cloud Service Provider
Integrated learnings from generations of smartNICs
Designed with security and isolation in mind
Programmable Packet Processing
NVMe Storage Technology scaled from Optane
Advanced Cryptography and Compression Acceleration
P4 Studio based on Barefoot
Leverage and extend DPDK and SPDK
Each Xe-Core (XC) contains 8 512b Vector Engines and 8 4096b Matrix Engines (8-deep Systolic Array)
Each Xe-Core has 512B/cycle load/store
Each Xe-Core contains 512 KiB L1 Data Cache (Software-configurable Shared Local Memory)
Each Vector Engine can execute: 32 FP64, 32 FP32 or 64 FP16 operations/clock
Each Matrix Engine can execute: 256 TF32, 256 FP16, 512 BF16 or 1024 INT8 operations/clock
Each Slice contains 16 XCs, 16 Ray Tracing (RT) units and 1 Hardware Context
RT Units can be used for Ray Traversal, Triangle Intersection and Bounding Box Intersection
Each Xe-HPC Stack contains up to 4 Slices, a large amount of shared L2 cache, 4 HBM2e Controllers, 1 Media Engine and 8 Xe Links
Each Xe-HPC 2-Stack contains up to 2 Stacks
New Verification, Reliability Methodology
New Software
New Signal Integrity Techniques
New Interconnects
New Power Delivery, Packaging Technology
New I/O, Memory, IP, SoC Architecture
Features: Compute Tile, Rambo Tile, Foveros, Base Tile, HBM Tile, Xe Link Tile, Multi-Tile Package, EMIB Tile
High Speed MDFI interconnect between Xe-HPC stacks
100B Transistors
47 Active Tiles
5 Process Nodes
Goals: Leadership Performance in HPC & AI, Connectivity to scale up, Unified Programming Model with oneAPI
Challenges: Scale of Integration, Foveros Implementation, Verification Tools & Methods, Signal Integrity, Reliability & Power Delivery
Coming to customers early next year
Available Form Factors/Systems: PCIe Card, OAM, ×4 Subsystem, ×4 Subsystem with 2 Socket Sapphire Rapids