Silicon Logo

Intel Architecture Day 2021

Mastodon Logo

Contents

1: Gracemont Atom Core

  • Cluster: 4 Cores

  • L1 Cache: 64 KiB L1D, 32 KiB L1I

  • L2 Cache: 2 or 4 MiB/Cluster (17 cycle latency)

  • Decoders: 2× 3

  • Branch Entry Cache: 5000 entries

  • Reorder Buffer: 256 (Up from 208)

  • Allocation: 5-wide

  • Retire: 8-wide

  • Ports: 17 (Up from 10)

  • ALUs: 4 Integer, 2 Floating Point, 3 Vector

  • Load/Store: 2 Integer, 2 Floating Point/Vector

  • Load/Store: 2 Load, 2 Store AGUs

  • Jump Ports: 2

  • ISA: Supports AVX, AVX2 and VNNI-INT8

  • Supports CET and VT-rp

1.1: Performance, Power & Area

  • Die Area: ~1/4 of Skylake
  • Power: -60% iso-performance relative to Skylake
  • Performance: +40% iso-power relative to Skylake
  • Power: -80% iso-performance for 4C/4T relative to 2C/4T Skylake
  • Performance: +80% iso-power for 4C/4T relative to 2C/4T Skylake

2: Golden Cove Core

  • IPC: +19% over Cypress Cove

    • Benchmarks: SPEC CPU 2017, SYSmark 25, Crossmark, PCMark 10, WebXPRT3, Geekbench 5.4.1
  • L2 Cache: 1.25 MiB (Client), 2 MiB (Server)

  • Higher Average Frequency

  • FP16 support for AVX3-512 (AVX512_FP16)

  • L1 iTLB: 256 4K Entries (Up from 128), 32 2M/4M Entries (Up from 16)

  • L2 BTB: 12K Branch Targets (Up from 5K) (Variable Size) [Enormous]

  • μop Cache: 4K μops (Up from 2.5K)

  • Decoders: 6 Simple, 1 Complex (8 μops/cycle from μop cache)

  • Fetch Bandwidth: 32b/cycle (Up from 16b)

  • μop Queue: 72 Entries/Thread (Up from 72), 144 Single Thread (Up from 70)

  • Allocation: 6-wide

  • Reorder Buffer: 512 (Up from 352) [256 on Zen 3]

  • Ports: 12 (Up from 10)

  • ALUs: Merged 5 Integer, 3 Floating Point/Vector

  • Load/Store: 3 Load (3× 256b/2× 512b), 2 Store AGUs

  • L1 DTLB: 96 Entries (Up from 64)

  • L1D Cache Fill Buffers: 16 (Up from 12)

  • Page Walkers: 4 (Up from 2)

  • Mispredict Penalty: 17 cycles (Up from 16)

2.1: Advanced Matrix eXtensions (AMX)

  • 2048 INT8 ops/cycle (Up from 256 on VNNI-INT8)
  • 1024 BLOAT16 ops/cycle
  • Power: 3× less power than VNNI-INT8
  • [There’s more stuff here, but I don’t understand any of it]

3: Thread Director

  • Transparent to Software

  • Monitors each thread and each core’s state (Microarchitecture Telemetry)

  • Provides feedback to the OS for the OS scheduler to make decisions

  • Adapts dynamically based on TDP, operation and power settings

  • Priority Tasks on high performance cores

  • Background Tasks on high efficiency cores

  • Vector/AI Tasks prioritised for high performance cores

  • Tasks moved to high efficiency cores based on relative performance ordering

  • Spin loops are moved to high efficiency cores to reduce power consumption

3.1: Windows 11

  • Thread Director feedback used for core parking
  • Software developers can specify quality of service attributes

4: Alder Lake

  • All client segments (Ultra Mobile, Mobile, Desktop)
  • Socket: LGA 1700, BGA Type3, BGA Type4 HDI
  • Made with the following building blocks
    • Golden Cove Core
    • Gracemont Core
    • Display
    • PCIe
    • Thunderbolt (TBT)
    • Gaussian Neural Accelerator (GNA) 3.0
    • Image Processing Unit (IPU)
    • GT1/GT2 Graphics (96/32 EU)
    • Last-Level Cache (LLC/L3)
    • Memory Controller

4.1: SKUs/Lineups

  • Desktop: Up to 8+8 Cores, LLC, Memory, PCIe, GT2, Display, GNA 3.0
  • Mobile: Up to 6+8 Cores, LLC, Memory, PCIe, GT1, Display, IPU, GNA 3.0, 4x TBT
  • Mobile: Up to 2+8 Cores, LLC, Memory, PCIe, GT1, Display, IPU, GNA 3.0, 2x TBT
  • Desktop: Up to 16C/24T, 30 MiB LLC

4.2: Features

  • Memory: DDR4-3200, DDR5-4800, LPDDR4X-4266, LPDDR5-5200
  • PCIe: 16× 5.0, 4× 4.0 (+4× 4.0 DMI)
  • Chipset PCIe: 12× 4.0, 16× 3.0 [Z690?]
  • I/O: Thunderbolt 4
  • Network: Wi-Fi 6E

4.3: Fabrics

  • 1000 GB/s Compute Fabric [Dual Ring Bus]
    • Dynamically adjust LLC inclusivity
  • 204 GB/s Memory Fabric [??????]
    • Dynamic Bus Width and Frequency
    • Adapt for high bandwidth, low latency or low power
  • 64 GB/s I/O Fabric

5: Alchemist, Xe-HPG & XeSS

  • Scales better at higher power than Xe-LP [Seems to be all points of V/f curve]

  • Software-first Approach

  • Takes in aspects from Xe-LP, Xe-HP and Xe-HPC

  • 1.5× Performance/Watt of DG1 (Iris Xe MAX)

  • 1.5× frequency iso-voltage

  • Process Node: TSMC N6 [+18% Density over N7]

  • Alchemist is currently sampling to ISVs and partners

  • Future GPUs: Battlemage (Xe²-HPG), Celestial (Xe³-HPG), Druid

5.1: Drivers

  • Re-architecture of memory manager and compiler
  • Improved performance of CPU-bound titles by 18%
  • Improved game load times by 25%

5.2: Xe-HPG Architecture

  • Base units are Xe-core (XC) instead of Execution Unit (like in Xe-LP)
  • Each XC contains 16 256b Vector Engines and 16 1024b Matrix Engines
  • Each Render Slice (RS) contains 4 XCs, 4 Ray Tracing (RT) units, and fixed function units
  • Alchemist contains up to 8 RS’s and a large pool of L2 Cache

5.3: Xe Super Sampling (XeSS)

  • XeSS uses motion vectors, previous frames, and the current frame to upscale the frame, after the rasterisation pipeline
  • XeSS can work with both Xe Matrix eXtensions (XMX) and DP4a (INT8 Packing) [DP4a means other GPUs can also use XeSS]
  • XeSS DP4a adds about 2.5× the latency that XMX adds
  • XeSS XMX SDK will be made available to ISVs in August
  • XeSS DP4a SDK will be made available to ISVs later in 2021

5.4: Features

  • Xe Matrix eXtensions
  • DirectX 12 Ultimate
    • Variable Rate Shading Tier 2, Mesh Shading, Sampler Feedback
  • Ray Tracing
    • Ray Traversal, Triangle Intersection and Bounding Box Intersection
    • Full Support for DXR Ray Tracing, Vulkan Ray Tracing

6: Sapphire Rapids

  • Biggest Leap in Data Center Performance in a decade

  • Optimised for per-node and data center performance

  • Multi-tile Design for Increased Scalability

    • Modular SoC architecture with modular die fabric
  • All threads have access to cache, memory and I/O on all tiles

  • Low-latency and high cross-section bandwidth

  • Coherent, shared memory spaces between Cores and Acceleration Engines

  • Architected for AI and microservices

  • 69% higher performance than Ice Lake-SP in microservices

  • Made with the following building blocks

    • Golden Cove-X Core
    • Acceleration Engines
    • PCIe Gen 5, CXL 1.1
    • Ultra Path Interconnect (UPI) 2.0
    • DDR5, Optane, HBM [HBM2e]

6.1: Node Performance

  • High Performance Golden Cove Cores, More cores than Ice Lake-SP
  • Increased L2 and L3 Caches
  • All cores have access to all resources of all the chips
  • DDR5, Next-gen Optane [Crow Pass], PCIe 5.0
  • Modular SoC with Modular Die Fabric [MDFI]
  • Uses 55μm EMIB

6.2: Data Center Performance

  • Fast VM Migration
  • Better Telemetry
  • IO Virtualisation
  • Consistent Caching & Memory Latency [Shots fired at Naples]
  • Low Jitter to meet high SLA
  • Next-generation Optane [Crow Pass], CXL 1.1
  • Improved Security & RAS

6.3: Golden Cove-X Core

  • AMX: Tiled matrix operations for inference and training acceleration
  • FP16 [AVX512_FP16]
  • Cache Management: CLDEMOTE

6.4: Acceleration Engines

  • Many integrated acceleration engines
  • Offload common mode tasks without kernel overhead

6.4.1: Data Streaming

  • Data Movement and Transformation Streaming
  • Up to 4 instances per socket
  • Low latency with no memory pinning overhead
  • 39% more CPU Core cycles

6.4.2: Quick Assist Technology

  • Cryptography and Data (De)Compression
  • Up to 400 Gb/s Symmetric Cryptography
  • Up to 160 Gb/s Compression & Decompression each
  • Fused Operations
  • 98% additional workload capacity

6.5: I/O Advancements

  • CXL 1.1: Accelerator and memory expansion
  • PCIe 5.0, Improved Connectivity: Improved DDIO and QoS
  • Improved Multi-Socket scaling via UPI 2.0
  • 4 ×24 UPI Links operating at 16 GT/s
  • New 8S-4UPI performance optimised technology

6.6: Memory & LLC

  • Up to >100 MiB L3 shared across all Cores [112.5 MiB]
  • Increased bandwidth, security and reliability with DDR5
  • 4 dual-channel memory controllers (8 total channels)
  • Intel Optane Persistent Memory 300 Series [Crow Pass]

6.7: High Bandwidth Memory

  • Purpose: AI, HPC, In-memory data analytics [Notice that package size increases when the SPR-HBM is shown]
  • Significantly higher memory bandwidth, Increased capacity
  • DDR5 can be eliminated entirely depending on use-case
  • Modes: HBM Flat Mode (Flat Memory Regions), HBM Caching Mode (DRAM-backed Cache)
    • Software Visible vs Software Transparent

6.8: Microservices

  • Performance: +69% over Cascade Lake-SP per core (+36% over Ice Lake-SP)
  • Reduced latency for runtime languages (Start-up time)
  • Efficient accelerators to reduce overhead on cores
  • Optimised Networking and Data Movement
  • Improved latency of remote calls and service-mesh

7: Mount Evans & IPU

  • Separation of Instruction and Tenant

    • Guest can fully control the CPU, while Cloud Service Provider maintains control of the infrastructure and Root of Trust
  • Infrastructure Offload

    • Frees up CPU Capacity with specialised accelerators for Infrastructure Processing
  • Diskless Server Architecture

    • Allows Virtual Storage across a network to be used
    • Low latency, direct access skipping IPU’s cores
  • Available as both ASICs and FPGAs

  • FPGA IPUs allow for quicker time to market and more flexibility to be future-proof

  • Working with Microsoft, Baidu, JD.com and VMWare on IPUs

7.1: Oak Springs Canyon

  • Agilex FPGA with 16 GB DDR4
  • Xeon with 16 GB DDR4
  • 2× 100Gb Ethernet (QSFP28/56)
  • 16× PCIe Gen 4.0
  • Hardened crypto lock

7.2: Arrow Creek

  • Customizable packet processing including bridging and networking services

  • Programmable through Intel OFS and DPDK

  • Secure Remote Update of FPGA and Firmware over PCIe

  • On-board Root of Trust

  • Juniper Contrail, OVS, SRv6

  • Agilex FPGA with 16 GB DDR4

  • Integrated CPU with 16 GB DDR4

  • Ethernet Controller

  • 2× 100Gb Ethernet (QSFP28/56)

  • 16× PCIe Gen 4.0

  • Full-height, half-length PCIe Expansion Card

7.3: Mount Evans IPU ASIC

  • Co-designed with a Cloud Service Provider

  • Integrated learnings from generations of smartNICs

  • Designed with security and isolation in mind

  • Programmable Packet Processing

  • NVMe Storage Technology scaled from Optane

  • Advanced Cryptography and Compression Acceleration

  • P4 Studio based on Barefoot

  • Leverage and extend DPDK and SPDK

7.3.1: Network Subsystem

  • Supports up to 4 host Xeons with 200 Gb/s full-duplex
  • High performance RDMA over Converged Ethernet v2
  • NVMe Offload Engine (Derived from Optane controller)
  • Programmable Packet Pipeline with QoS and Telemetry
  • Inline IP Security

7.3.2: Compute Complex

  • Up to 16 Arm Neoverse N1 Cores
  • Dedicated Compute and Cache
  • 3 Memory Channels [LPDDR4, so I assume it has a 192b bus]
  • Lookaside Cryptography and Compression
  • Dedicated Management Processor

8: Ponte Vecchio & Xe-HPC

  • Each Xe-Core (XC) contains 8 512b Vector Engines and 8 4096b Matrix Engines (8-deep Systolic Array)

  • Each Xe-Core has 512B/cycle load/store

  • Each Xe-Core contains 512 KiB L1 Data Cache (Software-configurable Shared Local Memory)

  • Each Vector Engine can execute: 32 FP64, 32 FP32 or 64 FP16 operations/clock

  • Each Matrix Engine can execute: 256 TF32, 256 FP16, 512 BF16 or 1024 INT8 operations/clock

  • Each Slice contains 16 XCs, 16 Ray Tracing (RT) units and 1 Hardware Context

    • Hardware Context allows for multiple actions concurrently without software context switches (Improved Utilisation)
  • RT Units can be used for Ray Traversal, Triangle Intersection and Bounding Box Intersection

  • Each Xe-HPC Stack contains up to 4 Slices, a large amount of shared L2 cache, 4 HBM2e Controllers, 1 Media Engine and 8 Xe Links

  • Each Xe-HPC 2-Stack contains up to 2 Stacks

  • High speed coherent Unified Fabric between GPUs
  • Load/Store, Bulk Data Transfer & Sync Semantics
  • Can fully connect up to 8 GPUs through an Embedded Switch

8.2: Ponte Vecchio

  • New Verification, Reliability Methodology

  • New Software

  • New Signal Integrity Techniques

  • New Interconnects

  • New Power Delivery, Packaging Technology

  • New I/O, Memory, IP, SoC Architecture

  • Features: Compute Tile, Rambo Tile, Foveros, Base Tile, HBM Tile, Xe Link Tile, Multi-Tile Package, EMIB Tile

  • High Speed MDFI interconnect between Xe-HPC stacks

  • 100B Transistors

  • 47 Active Tiles

  • 5 Process Nodes

  • Goals: Leadership Performance in HPC & AI, Connectivity to scale up, Unified Programming Model with oneAPI

  • Challenges: Scale of Integration, Foveros Implementation, Verification Tools & Methods, Signal Integrity, Reliability & Power Delivery

    • Had to transfer data at 1.5x speed to reduce Foveros connections
    • 2 orders of magnitude more Foveros connections than any previous Intel designs
  • Coming to customers early next year

  • Available Form Factors/Systems: PCIe Card, OAM, ×4 Subsystem, ×4 Subsystem with 2 Socket Sapphire Rapids

8.2.1: Progress of Execution

  • A0 Silicon Status
  • >45 TFLOPs FP32 [Also FP64 by extension]
  • >5 TBps Memory Fabric Bandwidth
  • >2 TBps Connectivity Bandwidth

8.2.2: Compute Tile

  • 8 Xe-Cores [1 Slice is 2 Tiles]
  • L1 Cache: 4 MiB
  • Process Node: TSMC N5 [The presenter said Node 5…]
  • Packaging: Foveros (36 μm pitch) [Second-generation Foveros]

8.2.3: Base Tile

  • L2 Cache: 144 MiB
  • Process Node: Intel 7
  • Area: 640 mm²
  • Host Interface: PCIe Gen 5.0
  • HBM2e, MDFI & EMIB
  • 8 Xe Links
  • Embedded Switch with 8 ports
  • Process Node: TSMC N7
  • Up to 90 Gb/s Serde
  • Built due to the Aurora Supercomputer contract

9: oneAPI

  • Unified Programming Model to overcome separate software stacks
  • Open and Standards-based
  • Common Hardware Abstraction Layer, Data Parallel Programming Language (DPC++)
  • Common collection of performance libraries addressing math, deep learning, data analytics, video processing and more domains
  • Full performance from hardware (Exposes and exploits the latest features of cutting-edge hardware)
  • Cross-Arcitecture, Cross-Vendor: Nvidia, AMD GPUs, Arm CPUs, Huawei ASICs
  • Evolving Specification: Graph Interfaces for Deep Learning and Ray Tracing Libraries added with provisional spec of v1.1 in May 2021
  • Industry Momentum from a large quantity of End Users, National Laboratories, ISVs & OSVs, OEMs & SIs, Universities & Research Institutes, CSPs & Frameworks
  • 200K Developers, 300 Applications deployed in market, 80 HPC & AI Applications functional on Xe-HPC
  • Toolkit v2021.3 availability

9.1: oneAPI Advanced Ray Tracing (oneART)

  • Six Components: Embree, Open Image Denoise, Open Volume Kernel Library, OSPRay, OSPRay Studio, OSPRay for Hydra [Mentioned Embree runs on Apple M1]

10: Aurora Supercomputer

  • Aurora Blade contains dual-socket Sapphire Rapids & 6 Ponte Vecchio GPUs [Mentioned Aurora has tens of thousands of blades? I’ve heard 9000-10000]
    • [Cooling System looks extremely cool]
  • Developed in conjunction with Argonne National Laboratory, HP Enterprise & US Department of Energy

11: Intel Innovation Teasers

  • October 27-28
  • Tour de Force of Technology
  • Two full days of technical keynotes, break-out sessions, hands-on demos and networking events

References