Sequential Flow

The Hardware Pipeline of Machine Learning

Machine learning isn't just about algorithms and data—it's fundamentally dependent on the physical hardware that powers computation. Understanding the sequential flow of how data moves through ML infrastructure reveals why certain hardware choices matter and how each component contributes to the final result.

Every AI model you interact with—from ChatGPT to image generators—relies on a carefully orchestrated symphony of hardware components working in perfect sequence.

The ML Infrastructure Pipeline

Click on each step to learn more, or watch data flow through the system

Step 1
Data

Storage, ingestion, and preprocessing

Step 2
CPUs

General-purpose orchestration and control

Step 3
Accelerators

GPUs, TPUs, NPUs for parallel computation

Step 4
Interconnects & Networking

High-speed data transfer between components

Step 5
Supporting Infrastructure

Power, cooling, and orchestration systems

System Architecture Overview

DATA Storage CPU Control GPU TPU NPU Accelerators NETWORK Interconnect POWER COOLING Infrastructure Batch Loading Task Distribution Gradient Sync System Support

Data: The Foundation

Storage systems, data pipelines, and preprocessing infrastructure

Data is the lifeblood of machine learning. Before any computation occurs, massive datasets must be stored, organized, and efficiently delivered to compute resources. Modern ML training often requires petabytes of data, demanding specialized storage infrastructure.

Storage Technologies

ML workloads use a hierarchy of storage systems, each optimized for different access patterns:

NVMe SSDs
7 GB/s

Local high-speed storage

Object Storage
Petabyte Scale

S3, GCS, Azure Blob

Distributed FS
100+ GB/s

Lustre, GPFS, WekaFS

Data Loaders
Parallel I/O

Prefetching & caching

The Data Pipeline

Raw data must be transformed before training. This includes:

  • Ingestion: Collecting data from various sources (sensors, web, databases)
  • Cleaning: Removing duplicates, handling missing values, filtering noise
  • Transformation: Normalization, tokenization, feature engineering
  • Batching: Organizing data into mini-batches for efficient GPU utilization
  • Augmentation: Creating variations to improve model generalization

Why Data I/O Matters

A common bottleneck in ML training is data starvation—when GPUs sit idle waiting for data. Modern systems use techniques like prefetching (loading next batches while current ones train), caching hot data in memory, and parallel data loading across multiple CPU cores.

CPUs: The Orchestrators

General-purpose processors that coordinate the entire ML pipeline

While GPUs handle the heavy parallel computation, CPUs (Central Processing Units) remain essential for orchestrating ML workflows. They handle data preprocessing, model deployment, system management, and all tasks requiring complex branching logic.

CPU Architecture for ML

Modern server CPUs are optimized for throughput and memory bandwidth:

Core Count
64-128 cores

AMD EPYC, Intel Xeon

Memory Channels
8-12 channels

Up to 6TB RAM support

PCIe Lanes
128+ lanes

Connecting GPUs & NVMe

Cache
256-768 MB

L3 cache per socket

CPU Responsibilities in ML

Data Management
Loading, preprocessing, augmentation
Orchestration
Job scheduling, GPU coordination
Inference
Lightweight models, edge deployment

Vector Extensions

Modern CPUs include vector processing units that can accelerate certain ML operations:

  • AVX-512: Intel's 512-bit vector instructions for matrix operations
  • AMX: Intel's Advanced Matrix Extensions for int8/bf16 matrix multiply
  • SVE/SVE2: ARM's Scalable Vector Extension for data-parallel operations

While GPUs dominate training, CPUs handle ~30% of total ML infrastructure compute, especially for data preprocessing, feature engineering, and serving lightweight models.

Accelerators: The Powerhouses

GPUs, TPUs, and NPUs that perform the heavy lifting of ML computation

Accelerators are specialized processors designed for the massively parallel computations that dominate machine learning. While CPUs excel at sequential tasks, accelerators can perform thousands of operations simultaneously.

GPU (Graphics Processing Unit)

Originally designed for rendering graphics, GPUs have become the workhorse of ML training due to their thousands of parallel cores.

NVIDIA H100
3,958 TFLOPS

FP8 Tensor Core performance

Memory
80 GB HBM3

3.35 TB/s bandwidth

CUDA Cores
16,896

Parallel processing units

Tensor Cores
528

Matrix multiplication units

TPU (Tensor Processing Unit)

Google's custom ASIC designed specifically for neural network workloads. TPUs optimize for the specific operations in ML rather than general graphics.

Feature TPU v4 TPU v5e
Peak TFLOPS (BF16) 275 197
HBM Memory 32 GB 16 GB
Memory Bandwidth 1.2 TB/s 819 GB/s
Interconnect ICI (Inter-Chip) ICI
Best For Large models Cost-efficient training

NPU (Neural Processing Unit)

NPUs are specialized accelerators optimized for inference (running trained models) rather than training. They prioritize energy efficiency and low latency.

  • Apple Neural Engine: 16-core NPU in M-series chips, 15.8 TOPS
  • Qualcomm Hexagon: Mobile NPU for on-device AI
  • Intel NPU: Integrated in Core Ultra processors
  • Google Edge TPU: Coral devices for edge inference

Why Accelerators Matter

Training GPT-4 required an estimated 25,000 NVIDIA A100 GPUs running for months. Without accelerators, this would take decades on CPUs. The key advantage is parallelism—while a CPU might have 64 cores, a GPU has thousands of cores optimized for the matrix multiplications that dominate neural networks.

Memory Bandwidth: The Hidden Bottleneck

DDR5 RAM
~90 GB/s
GDDR6X
~1 TB/s
HBM3
~3.35 TB/s

Interconnects & Networking

High-speed connections that enable distributed training at scale

Modern ML models are too large for a single GPU. Training requires distributing work across hundreds or thousands of accelerators. The interconnect fabric determines how fast these devices can share gradients and synchronize—often becoming the bottleneck in large-scale training.

Interconnect Hierarchy

Data flows through multiple layers of connectivity, each with different bandwidth and latency characteristics:

Intra-Node
NVLink, NVSwitch
900 GB/s per GPU
Inter-Node
InfiniBand, RoCE
400 Gb/s per link
Data Center
Ethernet Fabric
Spine-Leaf topology

Key Technologies

Technology Bandwidth Latency Use Case
NVLink 4.0 900 GB/s ~1 μs GPU-to-GPU within server
InfiniBand NDR 400 Gb/s ~1.5 μs Server-to-server
PCIe 5.0 128 GB/s ~2 μs CPU-to-GPU, NVMe
400G Ethernet 400 Gb/s ~5 μs Data center networking

Distributed Training Patterns

Different parallelism strategies have different communication requirements:

  • Data Parallelism: Each GPU has full model copy, synchronizes gradients. High bandwidth needed for gradient all-reduce.
  • Model Parallelism: Model split across GPUs. Requires low latency for layer-to-layer communication.
  • Pipeline Parallelism: Different layers on different GPUs. Moderate bandwidth, latency-tolerant.
  • Tensor Parallelism: Individual operations split across GPUs. Very low latency required.

The All-Reduce Operation

In distributed training, GPUs must average their gradients after each batch. With 1000 GPUs and a 175B parameter model like GPT-3, this means transferring 700GB of gradient data every few seconds. Efficient all-reduce algorithms (ring, tree, hierarchical) and fast interconnects are critical.

Supporting Infrastructure

Power delivery, cooling, and orchestration that keeps everything running

Behind every AI cluster is a massive infrastructure investment. A single NVIDIA H100 GPU consumes 700W—a rack of 8 GPUs needs 10+ kW just for compute, before cooling and support systems. At scale, these requirements become a defining constraint.

Power Infrastructure

GPU Power Draw
700W

NVIDIA H100 SXM

8-GPU Server
10+ kW

DGX H100 system

AI Data Center
100+ MW

Large training clusters

PUE Target
1.1-1.2

Power Usage Effectiveness

Cooling Systems

Heat removal is one of the biggest challenges in AI infrastructure. Modern approaches include:

  • Air Cooling: Traditional CRAC units, hot/cold aisle containment. Limited to ~20kW per rack.
  • Rear-Door Heat Exchangers: Water-cooled doors that capture heat at the rack. 30-40kW per rack.
  • Direct Liquid Cooling: Cold plates attached directly to CPUs/GPUs. 80+ kW per rack.
  • Immersion Cooling: Components submerged in dielectric fluid. 100+ kW per rack, near-silent operation.

Orchestration & Management

Running thousands of GPUs requires sophisticated software infrastructure:

Job Scheduler
Slurm, Kubernetes
Resource allocation
Monitoring
Prometheus, Grafana
Health & metrics
ML Platform
MLflow, Weights & Biases
Experiment tracking

Reliability & Fault Tolerance

At scale, hardware failures are constant. A 10,000 GPU cluster might see multiple GPU failures per day. Systems must handle:

  • Checkpointing: Saving model state periodically to resume after failures
  • Automatic failover: Detecting and routing around failed components
  • Graceful degradation: Continuing training with reduced capacity
  • Health monitoring: Predictive maintenance using GPU telemetry

The True Cost of AI

Training a large language model isn't just about GPU hours. Consider: a 100MW data center costs $500M+ to build, consumes $50M/year in electricity, and requires constant maintenance. The infrastructure supporting AI is a massive, ongoing investment that shapes who can participate in frontier research.