Silicon Logo

Google Argos Video Coding Unit

Mastodon Logo

Contents

Compression Difficulty

  • Software Encoding Time relative to H.264:
    • VP9: 10×
    • H.264: 200×
  • Software Encoding Time relative to H.264 1080p24:
    • VP9 2160p60: 100×
    • AV1 4320p60: 8000× [Why do they need 8K?]

Reasons for VCU Development

  • Existing hardware encoders use 5× more bits for a given quality
  • Desired features were unavailable:
    • Full implementation of H.264 and VP9 encoding
    • Single and Multi-Output Transcoding (SOT, MOT)
    • Speed vs Quality Tuning
    • Live Streaming & Offline Transcoding
    • Full access to software control algorithms
  • Reducing YouTube’s computing cycles dramatically
  • Balance between quality, performance, flexibility, and cost

Video Encoder Core

Video Encoder Core

  • Hardware Acceleration for video encoding

    • H.264 2160p60
    • VP9 2160p60
  • Interconnects:

    • 256b AXI Bus
    • APB Control Bus
  • Reads up to 4 frames and writes 1

  • Frame buffer compression to allow for greater scalability in the chip

Pre-Processing Engine

  • Independent usage is possible
  • Capabilities:
    • Colour space conversion
    • Cropping
    • Scaling
    • Rotation

Temporal Filter

  • Alternate Frame Generation in VP9 encoding
  • Separate operation (Prevents usage of other resources)

Motion-Search & Rate-Distortion Optimisation Engine

  • Adjustable motion-search window
  • Adjustable number of RDO candidates
  • Allows for speed/quality trade-off

Reconstruction & Entropy Coding

  • RD-Optimal Quantisation
  • PSNR Calculation
  • First Pass Statistics Collection

Registers

  • Programmable registers allowing coding quality fine-tuning
  • Controlled by software algorithms (such as Rate control)

High-level Synthesis/Design Flow

  • In use for 10 years by codec team
  • Designed VCU Core with Catapult (C++ HLS flow from Siemens)
  • Instrumental in VCU development
  • Enabled software/hardware co-design
  • Allowed for very fast design iteration

Benefits of C++

  • Separate algorithmic model is not required (Single source of truth)
  • Bit-exact results between model and RTL
  • 5-10× less code written, maintained and reviewed
  • Able to use software development tools
    • Address & memory sanitiser
    • Distributed computing
  • 7-8 orders of magnitude higher testing throughput
  • 99% of functional bugs are found before simulation

Time for a Better Product

  • Team can work on high-value problems
    • Compiler can deal with cycle-for-cycle design
    • No need to deal with block internal timing bugs
  • Able to try a large number of algorithms and architectures
  • Able to add features & improvements late in the process
  • Trivial technology scaling
    • Compiler can create a new data path for a new clock target and technology with the same C++ code

VCU ASIC and System

  • Accelerators needed to address cost/performance gap caused by the end of Moore’s Law
  • Designed for data centers
    • Deployed only in clusters
    • Heterogeneous clusters of CPU and VCU systems
  • Maximises utilisation globally
    • Diverse use-cases across many regions worldwide
    • Support of fungible workloads
  • Optimised for deployment at scale
    • Toleration for chip and core level errors
    • Reduced disruption from changes and failures
  • Designed for agility and adaptability
    • HLS
    • Software control
    • Use-cases and patterns vary

Chip Design Goals

  • Maximum Utilisation
    • Few jobs can use the full chip
    • Isolated userspace queues
  • Maximum userspace control
    • Video rate control, quality and performance controlled with software
    • Simple firmware work-items (DMA Data, run-on-core, etc.)
  • Memory latency, average and peak bandwidth optimised to serve the cores

ASIC Layout

Argos VCU ASIC

  • Various components connected by NoC
    • Microcontroller
    • PCI Express
    • DMA
    • 3 Decoder Cores
    • 10 Encoder Cores
    • 4 LPDDR4 Controllers
  • Microcontroller is attached to peripherals (Deal with low-speed I/O)
  • 8 GB LPDDR4-3200 with ECC (6 LPDDR4 modules) [Weird ratio for ECC]

NoC Topology

  • Supports bursty traffic and uniform access to memory for software simplicity
  • 3 Interconnected Routers (~50 GB/s per Router)
    • Microcontroller, PCIe, DMA, 3 Decoder Cores, 2 Encoder Cores
    • 8 Encoder Cores
    • 4 LPDDR4 Controllers (12 GB/s each)
  • Router (8 GB/s) connecting PCIe egress to Microcontroller and DMA

Firmware

  • Only controls work dispatch and isolation
  • Userspace control of codec, parameters and dependencies

System

  • Maximise VCUs per system to maximise performance/TCO$
  • 20 VCUs per system

Performance

Argos VCU Performance

  • 1 VCU matches a dual socket Skylake system in H.264 (Uses much lower power)

  • 1 VCU matches five dual socket Skylake systems in VP9

  • 20 VCUs replaces multiple racks of CPUs in VP9

  • Great performance scaling from 1 to 8 to 20 VCUs

    • H.264: +5% (8×), +4.5% (20×) [Possibly limited by decoding]
    • VP9: -0.5% (8×), -0.6% (20×)
  • Hardware decoding limits single-output transcoding performance

  • Multi-output transcoding speed is 1.2-1.3× single-output transcoding speed

Software/Hardware Co-design

  • Tuning after deployment [Software capabilities are impressive]
    • Quality improvement with parameter and software tuning (No firmware or driver updates)
    • Opportunistic software decoding to reduce impact of limited hardware decoding
  • Failure Management & Recovery
    • Non-persistent Errors
    • Software can retry after core errors
    • Datacenter can retry after queue errors

References