Intel Architecture Day 2021

1: Gracemont Atom Core
- 1.1: Performance, Power & Area
2: Golden Cove Core
- 2.1: Advanced Matrix eXtensions (AMX)
3: Thread Director
- 3.1: Windows 11
4: Alder Lake
5: Alchemist, Xe-HPG & XeSS
6: Sapphire Rapids
7: Mount Evans & IPU
8: Ponte Vecchio & Xe-HPC
- 8.1: Xe Link
- 8.2: Ponte Vecchio
9: oneAPI
- 9.1: oneAPI Advanced Ray Tracing (oneART)
10: Aurora Supercomputer
11: Intel Innovation Teasers
References

1: Gracemont Atom Core

Cluster: 4 Cores
L1 Cache: 64 KiB L1D, 32 KiB L1I
L2 Cache: 2 or 4 MiB/Cluster (17 cycle latency)
Decoders: 2× 3
Branch Entry Cache: 5000 entries
Reorder Buffer: 256 (Up from 208)
Allocation: 5-wide
Retire: 8-wide
Ports: 17 (Up from 10)
ALUs: 4 Integer, 2 Floating Point, 3 Vector
Load/Store: 2 Integer, 2 Floating Point/Vector
Load/Store: 2 Load, 2 Store AGUs
Jump Ports: 2
ISA: Supports AVX, AVX2 and VNNI-INT8
Supports CET and VT-rp

1.1: Performance, Power & Area

Die Area: ~1/4 of Skylake
Power: -60% iso-performance relative to Skylake
Performance: +40% iso-power relative to Skylake
Power: -80% iso-performance for 4C/4T relative to 2C/4T Skylake
Performance: +80% iso-power for 4C/4T relative to 2C/4T Skylake

2: Golden Cove Core

IPC: +19% over Cypress Cove
- Benchmarks: SPEC CPU 2017, SYSmark 25, Crossmark, PCMark 10, WebXPRT3, Geekbench 5.4.1
L2 Cache: 1.25 MiB (Client), 2 MiB (Server)
Higher Average Frequency
FP16 support for AVX3-512 (AVX512_FP16)
L1 iTLB: 256 4K Entries (Up from 128), 32 2M/4M Entries (Up from 16)
L2 BTB: 12K Branch Targets (Up from 5K) (Variable Size) [Enormous]
μop Cache: 4K μops (Up from 2.5K)
Decoders: 6 Simple, 1 Complex (8 μops/cycle from μop cache)
Fetch Bandwidth: 32b/cycle (Up from 16b)
μop Queue: 72 Entries/Thread (Up from 72), 144 Single Thread (Up from 70)
Allocation: 6-wide
Reorder Buffer: 512 (Up from 352) [256 on Zen 3]
Ports: 12 (Up from 10)
ALUs: Merged 5 Integer, 3 Floating Point/Vector
Load/Store: 3 Load (3× 256b/2× 512b), 2 Store AGUs
L1 DTLB: 96 Entries (Up from 64)
L1D Cache Fill Buffers: 16 (Up from 12)
Page Walkers: 4 (Up from 2)
Mispredict Penalty: 17 cycles (Up from 16)

2.1: Advanced Matrix eXtensions (AMX)

2048 INT8 ops/cycle (Up from 256 on VNNI-INT8)
1024 BLOAT16 ops/cycle
Power: 3× less power than VNNI-INT8
[There’s more stuff here, but I don’t understand any of it]

3: Thread Director

Transparent to Software
Monitors each thread and each core’s state (Microarchitecture Telemetry)
Provides feedback to the OS for the OS scheduler to make decisions
Adapts dynamically based on TDP, operation and power settings
Priority Tasks on high performance cores
Background Tasks on high efficiency cores
Vector/AI Tasks prioritised for high performance cores
Tasks moved to high efficiency cores based on relative performance ordering
Spin loops are moved to high efficiency cores to reduce power consumption

3.1: Windows 11

Thread Director feedback used for core parking
Software developers can specify quality of service attributes

4: Alder Lake

All client segments (Ultra Mobile, Mobile, Desktop)
Socket: LGA 1700, BGA Type3, BGA Type4 HDI
Made with the following building blocks
- Golden Cove Core
- Gracemont Core
- Display
- PCIe
- Thunderbolt (TBT)
- Gaussian Neural Accelerator (GNA) 3.0
- Image Processing Unit (IPU)
- GT1/GT2 Graphics (96/32 EU)
- Last-Level Cache (LLC/L3)
- Memory Controller

4.1: SKUs/Lineups

Desktop: Up to 8+8 Cores, LLC, Memory, PCIe, GT2, Display, GNA 3.0
Mobile: Up to 6+8 Cores, LLC, Memory, PCIe, GT1, Display, IPU, GNA 3.0, 4x TBT
Mobile: Up to 2+8 Cores, LLC, Memory, PCIe, GT1, Display, IPU, GNA 3.0, 2x TBT
Desktop: Up to 16C/24T, 30 MiB LLC

4.2: Features

Memory: DDR4-3200, DDR5-4800, LPDDR4X-4266, LPDDR5-5200
PCIe: 16× 5.0, 4× 4.0 (+4× 4.0 DMI)
Chipset PCIe: 12× 4.0, 16× 3.0 [Z690?]
I/O: Thunderbolt 4
Network: Wi-Fi 6E

4.3: Fabrics

1000 GB/s Compute Fabric [Dual Ring Bus]
- Dynamically adjust LLC inclusivity
204 GB/s Memory Fabric [??????]
- Dynamic Bus Width and Frequency
- Adapt for high bandwidth, low latency or low power
64 GB/s I/O Fabric

5: Alchemist, Xe-HPG & XeSS

Scales better at higher power than Xe-LP [Seems to be all points of V/f curve]
Software-first Approach
Takes in aspects from Xe-LP, Xe-HP and Xe-HPC
1.5× Performance/Watt of DG1 (Iris Xe MAX)
1.5× frequency iso-voltage
Process Node: TSMC N6 [+18% Density over N7]
Alchemist is currently sampling to ISVs and partners
Future GPUs: Battlemage (Xe²-HPG), Celestial (Xe³-HPG), Druid

5.1: Drivers

Re-architecture of memory manager and compiler
Improved performance of CPU-bound titles by 18%
Improved game load times by 25%

5.2: Xe-HPG Architecture

Base units are Xe-core (XC) instead of Execution Unit (like in Xe-LP)
Each XC contains 16 256b Vector Engines and 16 1024b Matrix Engines
Each Render Slice (RS) contains 4 XCs, 4 Ray Tracing (RT) units, and fixed function units
Alchemist contains up to 8 RS’s and a large pool of L2 Cache

5.3: Xe Super Sampling (XeSS)

XeSS uses motion vectors, previous frames, and the current frame to upscale the frame, after the rasterisation pipeline
XeSS can work with both Xe Matrix eXtensions (XMX) and DP4a (INT8 Packing) [DP4a means other GPUs can also use XeSS]
XeSS DP4a adds about 2.5× the latency that XMX adds
XeSS XMX SDK will be made available to ISVs in August
XeSS DP4a SDK will be made available to ISVs later in 2021

5.4: Features

Xe Matrix eXtensions
DirectX 12 Ultimate
- Variable Rate Shading Tier 2, Mesh Shading, Sampler Feedback
Ray Tracing
- Ray Traversal, Triangle Intersection and Bounding Box Intersection
- Full Support for DXR Ray Tracing, Vulkan Ray Tracing

6: Sapphire Rapids

Biggest Leap in Data Center Performance in a decade
Optimised for per-node and data center performance
Multi-tile Design for Increased Scalability
- Modular SoC architecture with modular die fabric
All threads have access to cache, memory and I/O on all tiles
Low-latency and high cross-section bandwidth
Coherent, shared memory spaces between Cores and Acceleration Engines
Architected for AI and microservices
69% higher performance than Ice Lake-SP in microservices
Made with the following building blocks
- Golden Cove-X Core
- Acceleration Engines
- PCIe Gen 5, CXL 1.1
- Ultra Path Interconnect (UPI) 2.0
- DDR5, Optane, HBM [HBM2e]

6.1: Node Performance

High Performance Golden Cove Cores, More cores than Ice Lake-SP
Increased L2 and L3 Caches
All cores have access to all resources of all the chips
DDR5, Next-gen Optane [Crow Pass], PCIe 5.0
Modular SoC with Modular Die Fabric [MDFI]
Uses 55μm EMIB

6.2: Data Center Performance

Fast VM Migration
Better Telemetry
IO Virtualisation
Consistent Caching & Memory Latency [Shots fired at Naples]
Low Jitter to meet high SLA
Next-generation Optane [Crow Pass], CXL 1.1
Improved Security & RAS

6.3: Golden Cove-X Core

AMX: Tiled matrix operations for inference and training acceleration
FP16 [AVX512_FP16]
Cache Management: CLDEMOTE

6.4: Acceleration Engines

Many integrated acceleration engines
Offload common mode tasks without kernel overhead

6.4.1: Data Streaming

Data Movement and Transformation Streaming
Up to 4 instances per socket
Low latency with no memory pinning overhead
39% more CPU Core cycles

6.4.2: Quick Assist Technology

Cryptography and Data (De)Compression
Up to 400 Gb/s Symmetric Cryptography
Up to 160 Gb/s Compression & Decompression each
Fused Operations
98% additional workload capacity

6.5: I/O Advancements

CXL 1.1: Accelerator and memory expansion
PCIe 5.0, Improved Connectivity: Improved DDIO and QoS
Improved Multi-Socket scaling via UPI 2.0
4 ×24 UPI Links operating at 16 GT/s
New 8S-4UPI performance optimised technology

6.6: Memory & LLC

Up to >100 MiB L3 shared across all Cores [112.5 MiB]
Increased bandwidth, security and reliability with DDR5
4 dual-channel memory controllers (8 total channels)
Intel Optane Persistent Memory 300 Series [Crow Pass]

6.7: High Bandwidth Memory

Purpose: AI, HPC, In-memory data analytics [Notice that package size increases when the SPR-HBM is shown]
Significantly higher memory bandwidth, Increased capacity
DDR5 can be eliminated entirely depending on use-case
Modes: HBM Flat Mode (Flat Memory Regions), HBM Caching Mode (DRAM-backed Cache)
- Software Visible vs Software Transparent

6.8: Microservices

Performance: +69% over Cascade Lake-SP per core (+36% over Ice Lake-SP)
Reduced latency for runtime languages (Start-up time)
Efficient accelerators to reduce overhead on cores
Optimised Networking and Data Movement
Improved latency of remote calls and service-mesh

7: Mount Evans & IPU

Separation of Instruction and Tenant
- Guest can fully control the CPU, while Cloud Service Provider maintains control of the infrastructure and Root of Trust
Infrastructure Offload
- Frees up CPU Capacity with specialised accelerators for Infrastructure Processing
Diskless Server Architecture
- Allows Virtual Storage across a network to be used
- Low latency, direct access skipping IPU’s cores
Available as both ASICs and FPGAs
FPGA IPUs allow for quicker time to market and more flexibility to be future-proof
Working with Microsoft, Baidu, JD.com and VMWare on IPUs

7.1: Oak Springs Canyon

Agilex FPGA with 16 GB DDR4
Xeon with 16 GB DDR4
2× 100Gb Ethernet (QSFP28/56)
16× PCIe Gen 4.0
Hardened crypto lock

7.2: Arrow Creek

Customizable packet processing including bridging and networking services
Programmable through Intel OFS and DPDK
Secure Remote Update of FPGA and Firmware over PCIe
On-board Root of Trust
Juniper Contrail, OVS, SRv6
Agilex FPGA with 16 GB DDR4
Integrated CPU with 16 GB DDR4
Ethernet Controller
2× 100Gb Ethernet (QSFP28/56)
16× PCIe Gen 4.0
Full-height, half-length PCIe Expansion Card

7.3: Mount Evans IPU ASIC

Co-designed with a Cloud Service Provider
Integrated learnings from generations of smartNICs
Designed with security and isolation in mind
Programmable Packet Processing
NVMe Storage Technology scaled from Optane
Advanced Cryptography and Compression Acceleration
P4 Studio based on Barefoot
Leverage and extend DPDK and SPDK

7.3.1: Network Subsystem

Supports up to 4 host Xeons with 200 Gb/s full-duplex
High performance RDMA over Converged Ethernet v2
NVMe Offload Engine (Derived from Optane controller)
Programmable Packet Pipeline with QoS and Telemetry
Inline IP Security

7.3.2: Compute Complex

Up to 16 Arm Neoverse N1 Cores
Dedicated Compute and Cache
3 Memory Channels [LPDDR4, so I assume it has a 192b bus]
Lookaside Cryptography and Compression
Dedicated Management Processor

8: Ponte Vecchio & Xe-HPC

Each Xe-Core (XC) contains 8 512b Vector Engines and 8 4096b Matrix Engines (8-deep Systolic Array)
Each Xe-Core has 512B/cycle load/store
Each Xe-Core contains 512 KiB L1 Data Cache (Software-configurable Shared Local Memory)
Each Vector Engine can execute: 32 FP64, 32 FP32 or 64 FP16 operations/clock
Each Matrix Engine can execute: 256 TF32, 256 FP16, 512 BF16 or 1024 INT8 operations/clock
Each Slice contains 16 XCs, 16 Ray Tracing (RT) units and 1 Hardware Context
- Hardware Context allows for multiple actions concurrently without software context switches (Improved Utilisation)
RT Units can be used for Ray Traversal, Triangle Intersection and Bounding Box Intersection
Each Xe-HPC Stack contains up to 4 Slices, a large amount of shared L2 cache, 4 HBM2e Controllers, 1 Media Engine and 8 Xe Links
Each Xe-HPC 2-Stack contains up to 2 Stacks

8.1: Xe Link

High speed coherent Unified Fabric between GPUs
Load/Store, Bulk Data Transfer & Sync Semantics
Can fully connect up to 8 GPUs through an Embedded Switch

8.2: Ponte Vecchio

New Verification, Reliability Methodology
New Software
New Signal Integrity Techniques
New Interconnects
New Power Delivery, Packaging Technology
New I/O, Memory, IP, SoC Architecture
Features: Compute Tile, Rambo Tile, Foveros, Base Tile, HBM Tile, Xe Link Tile, Multi-Tile Package, EMIB Tile
High Speed MDFI interconnect between Xe-HPC stacks
100B Transistors
47 Active Tiles
5 Process Nodes
Goals: Leadership Performance in HPC & AI, Connectivity to scale up, Unified Programming Model with oneAPI
Challenges: Scale of Integration, Foveros Implementation, Verification Tools & Methods, Signal Integrity, Reliability & Power Delivery
- Had to transfer data at 1.5x speed to reduce Foveros connections
- 2 orders of magnitude more Foveros connections than any previous Intel designs
Coming to customers early next year
Available Form Factors/Systems: PCIe Card, OAM, ×4 Subsystem, ×4 Subsystem with 2 Socket Sapphire Rapids

8.2.1: Progress of Execution

A0 Silicon Status
>45 TFLOPs FP32 [Also FP64 by extension]
>5 TBps Memory Fabric Bandwidth
>2 TBps Connectivity Bandwidth

8.2.2: Compute Tile

8 Xe-Cores [1 Slice is 2 Tiles]
L1 Cache: 4 MiB
Process Node: TSMC N5 [The presenter said Node 5…]
Packaging: Foveros (36 μm pitch) [Second-generation Foveros]

8.2.3: Base Tile

L2 Cache: 144 MiB
Process Node: Intel 7
Area: 640 mm²
Host Interface: PCIe Gen 5.0
HBM2e, MDFI & EMIB

8.2.4: Xe Link Tile

8 Xe Links
Embedded Switch with 8 ports
Process Node: TSMC N7
Up to 90 Gb/s Serde
Built due to the Aurora Supercomputer contract

9: oneAPI

Unified Programming Model to overcome separate software stacks
Open and Standards-based
Common Hardware Abstraction Layer, Data Parallel Programming Language (DPC++)
Common collection of performance libraries addressing math, deep learning, data analytics, video processing and more domains
Full performance from hardware (Exposes and exploits the latest features of cutting-edge hardware)
Cross-Arcitecture, Cross-Vendor: Nvidia, AMD GPUs, Arm CPUs, Huawei ASICs
Evolving Specification: Graph Interfaces for Deep Learning and Ray Tracing Libraries added with provisional spec of v1.1 in May 2021
Industry Momentum from a large quantity of End Users, National Laboratories, ISVs & OSVs, OEMs & SIs, Universities & Research Institutes, CSPs & Frameworks
200K Developers, 300 Applications deployed in market, 80 HPC & AI Applications functional on Xe-HPC
Toolkit v2021.3 availability

9.1: oneAPI Advanced Ray Tracing (oneART)

Six Components: Embree, Open Image Denoise, Open Volume Kernel Library, OSPRay, OSPRay Studio, OSPRay for Hydra [Mentioned Embree runs on Apple M1]

10: Aurora Supercomputer

Aurora Blade contains dual-socket Sapphire Rapids & 6 Ponte Vecchio GPUs [Mentioned Aurora has tens of thousands of blades? I’ve heard 9000-10000]
- [Cooling System looks extremely cool]
Developed in conjunction with Argonne National Laboratory, HP Enterprise & US Department of Energy

11: Intel Innovation Teasers

October 27-28
Tour de Force of Technology
Two full days of technical keynotes, break-out sessions, hands-on demos and networking events

Intel Architecture Day 2021

Contents

1: Gracemont Atom Core

1.1: Performance, Power & Area

2: Golden Cove Core

2.1: Advanced Matrix eXtensions (AMX)

3: Thread Director

3.1: Windows 11

4: Alder Lake

4.1: SKUs/Lineups

4.2: Features

4.3: Fabrics

5: Alchemist, Xe-HPG & XeSS

5.1: Drivers

5.2: Xe-HPG Architecture

5.3: Xe Super Sampling (XeSS)

5.4: Features

6: Sapphire Rapids

6.1: Node Performance

6.2: Data Center Performance

6.3: Golden Cove-X Core

6.4: Acceleration Engines

6.4.1: Data Streaming

6.4.2: Quick Assist Technology

6.5: I/O Advancements

6.6: Memory & LLC

6.7: High Bandwidth Memory

6.8: Microservices

7: Mount Evans & IPU

7.1: Oak Springs Canyon

7.2: Arrow Creek

7.3: Mount Evans IPU ASIC

7.3.1: Network Subsystem

7.3.2: Compute Complex

8: Ponte Vecchio & Xe-HPC

8.1: Xe Link

8.2: Ponte Vecchio

8.2.1: Progress of Execution

8.2.2: Compute Tile

8.2.3: Base Tile

8.2.4: Xe Link Tile

9: oneAPI

9.1: oneAPI Advanced Ray Tracing (oneART)

10: Aurora Supercomputer

11: Intel Innovation Teasers

References