Sequential Flow

The Hardware Pipeline of Machine Learning

Machine learning isn't just about algorithms and data—it's fundamentally dependent on the physical hardware that powers computation. Understanding the sequential flow of how data moves through ML infrastructure reveals why certain hardware choices matter and how each component contributes to the final result.

Every AI model you interact with—from ChatGPT to image generators—relies on a carefully orchestrated symphony of hardware components working in perfect sequence.

The ML Infrastructure Pipeline

Click on each step to learn more, or watch data flow through the system

Step 1

Data

Storage, ingestion, and preprocessing

Step 2

CPUs

General-purpose orchestration and control

Step 3

Accelerators

GPUs, TPUs, NPUs for parallel computation

Step 4

Interconnects & Networking

High-speed data transfer between components

Step 5

Supporting Infrastructure

Power, cooling, and orchestration systems

System Architecture Overview

Data: The Foundation

Storage systems, data pipelines, and preprocessing infrastructure

Data is the lifeblood of machine learning. Before any computation occurs, massive datasets must be stored, organized, and efficiently delivered to compute resources. Modern ML training often requires petabytes of data, demanding specialized storage infrastructure.

Storage Technologies

ML workloads use a hierarchy of storage systems, each optimized for different access patterns:

NVMe SSDs

7 GB/s

Local high-speed storage

Object Storage

Petabyte Scale

S3, GCS, Azure Blob

Distributed FS

100+ GB/s

Lustre, GPFS, WekaFS

Data Loaders

Parallel I/O

Prefetching & caching

The Data Pipeline

Raw data must be transformed before training. This includes:

Ingestion: Collecting data from various sources (sensors, web, databases)
Cleaning: Removing duplicates, handling missing values, filtering noise
Transformation: Normalization, tokenization, feature engineering
Batching: Organizing data into mini-batches for efficient GPU utilization
Augmentation: Creating variations to improve model generalization

Why Data I/O Matters

A common bottleneck in ML training is data starvation—when GPUs sit idle waiting for data. Modern systems use techniques like prefetching (loading next batches while current ones train), caching hot data in memory, and parallel data loading across multiple CPU cores.

CPUs: The Orchestrators

General-purpose processors that coordinate the entire ML pipeline

While GPUs handle the heavy parallel computation, CPUs (Central Processing Units) remain essential for orchestrating ML workflows. They handle data preprocessing, model deployment, system management, and all tasks requiring complex branching logic.

CPU Architecture for ML

Modern server CPUs are optimized for throughput and memory bandwidth:

Core Count

64-128 cores

AMD EPYC, Intel Xeon

Memory Channels

8-12 channels

Up to 6TB RAM support

PCIe Lanes

128+ lanes

Connecting GPUs & NVMe

Cache

256-768 MB

L3 cache per socket

CPU Responsibilities in ML

Data Management

Loading, preprocessing, augmentation

Orchestration

Job scheduling, GPU coordination

Inference

Lightweight models, edge deployment

Vector Extensions

Modern CPUs include vector processing units that can accelerate certain ML operations:

AVX-512: Intel's 512-bit vector instructions for matrix operations
AMX: Intel's Advanced Matrix Extensions for int8/bf16 matrix multiply
SVE/SVE2: ARM's Scalable Vector Extension for data-parallel operations

While GPUs dominate training, CPUs handle ~30% of total ML infrastructure compute, especially for data preprocessing, feature engineering, and serving lightweight models.

Accelerators: The Powerhouses

GPUs, TPUs, and NPUs that perform the heavy lifting of ML computation

Accelerators are specialized processors designed for the massively parallel computations that dominate machine learning. While CPUs excel at sequential tasks, accelerators can perform thousands of operations simultaneously.

GPU (Graphics Processing Unit)

Originally designed for rendering graphics, GPUs have become the workhorse of ML training due to their thousands of parallel cores.

NVIDIA H100

3,958 TFLOPS

FP8 Tensor Core performance

Memory

80 GB HBM3

3.35 TB/s bandwidth

CUDA Cores

16,896

Parallel processing units

Tensor Cores

528

Matrix multiplication units

TPU (Tensor Processing Unit)

Google's custom ASIC designed specifically for neural network workloads. TPUs optimize for the specific operations in ML rather than general graphics.

Feature	TPU v4	TPU v5e
Peak TFLOPS (BF16)	275	197
HBM Memory	32 GB	16 GB
Memory Bandwidth	1.2 TB/s	819 GB/s
Interconnect	ICI (Inter-Chip)	ICI
Best For	Large models	Cost-efficient training

NPU (Neural Processing Unit)

NPUs are specialized accelerators optimized for inference (running trained models) rather than training. They prioritize energy efficiency and low latency.

Apple Neural Engine: 16-core NPU in M-series chips, 15.8 TOPS
Qualcomm Hexagon: Mobile NPU for on-device AI
Intel NPU: Integrated in Core Ultra processors
Google Edge TPU: Coral devices for edge inference

Why Accelerators Matter

Training GPT-4 required an estimated 25,000 NVIDIA A100 GPUs running for months. Without accelerators, this would take decades on CPUs. The key advantage is parallelism—while a CPU might have 64 cores, a GPU has thousands of cores optimized for the matrix multiplications that dominate neural networks.

Memory Bandwidth: The Hidden Bottleneck

DDR5 RAM

~90 GB/s

GDDR6X

~1 TB/s

HBM3

~3.35 TB/s

Interconnects & Networking

High-speed connections that enable distributed training at scale

Modern ML models are too large for a single GPU. Training requires distributing work across hundreds or thousands of accelerators. The interconnect fabric determines how fast these devices can share gradients and synchronize—often becoming the bottleneck in large-scale training.

Interconnect Hierarchy

Data flows through multiple layers of connectivity, each with different bandwidth and latency characteristics:

Intra-Node

NVLink, NVSwitch
900 GB/s per GPU

Inter-Node

InfiniBand, RoCE
400 Gb/s per link

Data Center

Ethernet Fabric
Spine-Leaf topology

Key Technologies

Technology	Bandwidth	Latency	Use Case
NVLink 4.0	900 GB/s	~1 μs	GPU-to-GPU within server
InfiniBand NDR	400 Gb/s	~1.5 μs	Server-to-server
PCIe 5.0	128 GB/s	~2 μs	CPU-to-GPU, NVMe
400G Ethernet	400 Gb/s	~5 μs	Data center networking

Distributed Training Patterns

Different parallelism strategies have different communication requirements:

Data Parallelism: Each GPU has full model copy, synchronizes gradients. High bandwidth needed for gradient all-reduce.
Model Parallelism: Model split across GPUs. Requires low latency for layer-to-layer communication.
Pipeline Parallelism: Different layers on different GPUs. Moderate bandwidth, latency-tolerant.
Tensor Parallelism: Individual operations split across GPUs. Very low latency required.

The All-Reduce Operation

In distributed training, GPUs must average their gradients after each batch. With 1000 GPUs and a 175B parameter model like GPT-3, this means transferring 700GB of gradient data every few seconds. Efficient all-reduce algorithms (ring, tree, hierarchical) and fast interconnects are critical.

Supporting Infrastructure

Power delivery, cooling, and orchestration that keeps everything running

Behind every AI cluster is a massive infrastructure investment. A single NVIDIA H100 GPU consumes 700W—a rack of 8 GPUs needs 10+ kW just for compute, before cooling and support systems. At scale, these requirements become a defining constraint.

Power Infrastructure

GPU Power Draw

700W

NVIDIA H100 SXM

8-GPU Server

10+ kW

DGX H100 system

AI Data Center

100+ MW

Large training clusters

PUE Target

1.1-1.2

Power Usage Effectiveness

Cooling Systems

Heat removal is one of the biggest challenges in AI infrastructure. Modern approaches include:

Air Cooling: Traditional CRAC units, hot/cold aisle containment. Limited to ~20kW per rack.
Rear-Door Heat Exchangers: Water-cooled doors that capture heat at the rack. 30-40kW per rack.
Direct Liquid Cooling: Cold plates attached directly to CPUs/GPUs. 80+ kW per rack.
Immersion Cooling: Components submerged in dielectric fluid. 100+ kW per rack, near-silent operation.

Orchestration & Management

Running thousands of GPUs requires sophisticated software infrastructure:

Job Scheduler

Slurm, Kubernetes
Resource allocation

Monitoring

Prometheus, Grafana
Health & metrics

ML Platform

MLflow, Weights & Biases
Experiment tracking

Reliability & Fault Tolerance

At scale, hardware failures are constant. A 10,000 GPU cluster might see multiple GPU failures per day. Systems must handle:

Checkpointing: Saving model state periodically to resume after failures
Automatic failover: Detecting and routing around failed components
Graceful degradation: Continuing training with reduced capacity
Health monitoring: Predictive maintenance using GPU telemetry

The True Cost of AI

Training a large language model isn't just about GPU hours. Consider: a 100MW data center costs $500M+ to build, consumes $50M/year in electricity, and requires constant maintenance. The infrastructure supporting AI is a massive, ongoing investment that shapes who can participate in frontier research.

Sequential Flow

The Hardware Pipeline of Machine Learning

The ML Infrastructure Pipeline

System Architecture Overview

Data: The Foundation

Storage Technologies

The Data Pipeline

Why Data I/O Matters

CPUs: The Orchestrators

CPU Architecture for ML

CPU Responsibilities in ML

Vector Extensions

Accelerators: The Powerhouses

GPU (Graphics Processing Unit)

TPU (Tensor Processing Unit)

NPU (Neural Processing Unit)

Why Accelerators Matter

Memory Bandwidth: The Hidden Bottleneck

Interconnects & Networking

Interconnect Hierarchy

Key Technologies

Distributed Training Patterns

The All-Reduce Operation

Supporting Infrastructure

Power Infrastructure

Cooling Systems

Orchestration & Management

Reliability & Fault Tolerance

The True Cost of AI

AI Assistant