Machine learning isn't just about algorithms and data—it's fundamentally dependent on the physical hardware that powers computation. Understanding the sequential flow of how data moves through ML infrastructure reveals why certain hardware choices matter and how each component contributes to the final result.
Every AI model you interact with—from ChatGPT to image generators—relies on a carefully orchestrated symphony of hardware components working in perfect sequence.
Click on each step to learn more, or watch data flow through the system
Storage, ingestion, and preprocessing
General-purpose orchestration and control
GPUs, TPUs, NPUs for parallel computation
High-speed data transfer between components
Power, cooling, and orchestration systems
Storage systems, data pipelines, and preprocessing infrastructure
Data is the lifeblood of machine learning. Before any computation occurs, massive datasets must be stored, organized, and efficiently delivered to compute resources. Modern ML training often requires petabytes of data, demanding specialized storage infrastructure.
ML workloads use a hierarchy of storage systems, each optimized for different access patterns:
Local high-speed storage
S3, GCS, Azure Blob
Lustre, GPFS, WekaFS
Prefetching & caching
Raw data must be transformed before training. This includes:
A common bottleneck in ML training is data starvation—when GPUs sit idle waiting for data. Modern systems use techniques like prefetching (loading next batches while current ones train), caching hot data in memory, and parallel data loading across multiple CPU cores.
General-purpose processors that coordinate the entire ML pipeline
While GPUs handle the heavy parallel computation, CPUs (Central Processing Units) remain essential for orchestrating ML workflows. They handle data preprocessing, model deployment, system management, and all tasks requiring complex branching logic.
Modern server CPUs are optimized for throughput and memory bandwidth:
AMD EPYC, Intel Xeon
Up to 6TB RAM support
Connecting GPUs & NVMe
L3 cache per socket
Modern CPUs include vector processing units that can accelerate certain ML operations:
While GPUs dominate training, CPUs handle ~30% of total ML infrastructure compute, especially for data preprocessing, feature engineering, and serving lightweight models.
GPUs, TPUs, and NPUs that perform the heavy lifting of ML computation
Accelerators are specialized processors designed for the massively parallel computations that dominate machine learning. While CPUs excel at sequential tasks, accelerators can perform thousands of operations simultaneously.
Originally designed for rendering graphics, GPUs have become the workhorse of ML training due to their thousands of parallel cores.
FP8 Tensor Core performance
3.35 TB/s bandwidth
Parallel processing units
Matrix multiplication units
Google's custom ASIC designed specifically for neural network workloads. TPUs optimize for the specific operations in ML rather than general graphics.
| Feature | TPU v4 | TPU v5e |
|---|---|---|
| Peak TFLOPS (BF16) | 275 | 197 |
| HBM Memory | 32 GB | 16 GB |
| Memory Bandwidth | 1.2 TB/s | 819 GB/s |
| Interconnect | ICI (Inter-Chip) | ICI |
| Best For | Large models | Cost-efficient training |
NPUs are specialized accelerators optimized for inference (running trained models) rather than training. They prioritize energy efficiency and low latency.
Training GPT-4 required an estimated 25,000 NVIDIA A100 GPUs running for months. Without accelerators, this would take decades on CPUs. The key advantage is parallelism—while a CPU might have 64 cores, a GPU has thousands of cores optimized for the matrix multiplications that dominate neural networks.
High-speed connections that enable distributed training at scale
Modern ML models are too large for a single GPU. Training requires distributing work across hundreds or thousands of accelerators. The interconnect fabric determines how fast these devices can share gradients and synchronize—often becoming the bottleneck in large-scale training.
Data flows through multiple layers of connectivity, each with different bandwidth and latency characteristics:
| Technology | Bandwidth | Latency | Use Case |
|---|---|---|---|
| NVLink 4.0 | 900 GB/s | ~1 μs | GPU-to-GPU within server |
| InfiniBand NDR | 400 Gb/s | ~1.5 μs | Server-to-server |
| PCIe 5.0 | 128 GB/s | ~2 μs | CPU-to-GPU, NVMe |
| 400G Ethernet | 400 Gb/s | ~5 μs | Data center networking |
Different parallelism strategies have different communication requirements:
In distributed training, GPUs must average their gradients after each batch. With 1000 GPUs and a 175B parameter model like GPT-3, this means transferring 700GB of gradient data every few seconds. Efficient all-reduce algorithms (ring, tree, hierarchical) and fast interconnects are critical.
Power delivery, cooling, and orchestration that keeps everything running
Behind every AI cluster is a massive infrastructure investment. A single NVIDIA H100 GPU consumes 700W—a rack of 8 GPUs needs 10+ kW just for compute, before cooling and support systems. At scale, these requirements become a defining constraint.
NVIDIA H100 SXM
DGX H100 system
Large training clusters
Power Usage Effectiveness
Heat removal is one of the biggest challenges in AI infrastructure. Modern approaches include:
Running thousands of GPUs requires sophisticated software infrastructure:
At scale, hardware failures are constant. A 10,000 GPU cluster might see multiple GPU failures per day. Systems must handle:
Training a large language model isn't just about GPU hours. Consider: a 100MW data center costs $500M+ to build, consumes $50M/year in electricity, and requires constant maintenance. The infrastructure supporting AI is a massive, ongoing investment that shapes who can participate in frontier research.