Parallel filesystems for AI and DeepSeek 3FS¶
Scope: parallel/distributed filesystems for AI clusters (Lustre, IBM Storage Scale/GPFS, WekaFS) and DeepSeek's open-source Fire-Flyer File System (3FS): why training and checkpoint I/O needs aggregate bandwidth and parallelism, the disaggregated-NVMe + RDMA design of 3FS, and matching the filesystem to checkpoint and dataset access patterns.
flowchart LR
subgraph CLIENTS["Compute nodes (GPU)"]
C1["3FS client (FUSE hf3fs_fuse / USRBIO)"]
end
subgraph CTRL["Control plane"]
CM["Cluster manager"]
MD["Metadata service (stateless, FoundationDB KV)"]
end
subgraph DATA["Storage plane (disaggregated NVMe)"]
S1["Storage service + NVMe (CRAQ chain)"]
S2["Storage service + NVMe (CRAQ chain)"]
end
C1 -- "RDMA (IB / RoCE)" --> S1
C1 -- "RDMA (IB / RoCE)" --> S2
C1 -. "metadata ops" .-> MD
CM -. "membership / config" .-> MD
CM -. "membership / config" .-> S1
What it is¶
A parallel (distributed) filesystem spreads one logical namespace across many storage servers so that a single large file is striped into chunks living on different servers, and many clients read those chunks concurrently. This is the standard answer to feeding multi-node GPU training: NFS from a single server is convenient but becomes a throughput bottleneck once many nodes read at once, so production AI clusters move to parallel filesystems such as Lustre or GPFS (IBM Storage Scale), or cloud caches like Amazon FSx for Lustre.12 In Lustre, the striping targets are Object Storage Targets (OSTs); striping a file across N OSTs multiplies its read throughput: stripe across 4 OSTs at 500 MB/s each and you reach a theoretical peak of ~2 GB/s for that file.3
DeepSeek 3FS (Fire-Flyer File System) is an open-source, AI-codesigned parallel filesystem built from the observation that AI workloads issue massive numbers of random reads, which makes conventional read caching ineffective or counterproductive. 3FS eliminates that caching and uses direct file I/O so every request goes straight to the NVMe device, minimizing kernel page-cache involvement and host-memory copies.4 It has four components (cluster manager, metadata service, storage service, client) interconnected over an RDMA fabric (InfiniBand or RoCE) to minimize CPU involvement and host-side copies.5 In the released implementation the metadata services are stateless and backed by a transactional KV store (FoundationDB), the storage layer uses Chain Replication with Apportioned Queries (CRAQ) for strong consistency, and the client exposes a FUSE mount (hf3fs_fuse) plus an asynchronous zero-copy API (USRBIO).6
Why it matters¶
The economics are simple: an idle GPU waiting on storage is the most expensive idle in the building, and at scale the data-staging phase, not compute, is the thing that starves it.7 Aggregate demand grows with the cluster. The book's worked figure: if each GPU needs ~200 MB/s of training data, a 72-GPU GB200/GB300 NVL72 rack needs 14–20 GB/s of aggregate storage throughput just to keep the rack busy; multimodal samples push that far higher.8 One NFS server cannot serve that. Parallelism in the filesystem is not optional at ultrascale.2
3FS demonstrates how far rethinking the storage layer can go. The released 3FS repo reports a 180-node cluster reaching ~6.6 TiB/s aggregate read throughput under a stress test with concurrent training workloads, plus a GraySort run sorting 110.5 TiB in 30 min 14 s (3.66 TiB/min) and a KVCache lookup path at up to 40 GiB/s.9 (The book frames the headline differently, attributing ~6.6 TB/s to a 68-node AI-HPC cluster with 10×16 TB NVMe and dual-100 Gb/s NICs while concurrently serving 1.4 TB/s of background traffic, and contrasts it with Ceph's ~1.1 TB/s on similar hardware; the 68-node configuration matches DeepSeek's earlier Fire-Flyer AI-HPC paper rather than the open-source 3FS release.109) Treat any quoted number as a starting hypothesis and measure on your own fabric. Aggregate read throughput depends on node count, NVMe per node, NIC generation, and access pattern.
When it is needed (and when not)¶
Reach for a parallel filesystem (Lustre / Storage Scale / WekaFS / 3FS) when:
- You train across multiple nodes and want one shared namespace at high aggregate bandwidth. Beyond a few nodes, a single NFS server is the bottleneck.2
- The access pattern is random-read heavy (e.g., shuffled LLM samples). 3FS was built precisely for this, where page-cache caching backfires.4
- You need a GDS (kernel) path into GPU HBM: only GDS-enabled kernel filesystem clients or specifically integrated parallel filesystems (NVMe, NVMe-oF, BeeGFS, WekaFS, IBM Storage Scale, VAST) deliver direct transfers. See GPUDirect Storage (GDS).11
Prefer something simpler, or be aware of limits, when:
- Modest scale (a few nodes): NFS with tuned mount options is fine and far less operational burden.2
- You expect 3FS's FUSE client to give you a GDS path. It will not. A FUSE/user-space filesystem cannot deliver GDS because GDS requires kernel-level filesystem integration with
O_DIRECTsemantics. 3FS feeds GPUs via RDMA transfers, not a GDS kernel path; if you need a true GDS path, use a GDS-enabled kernel client (NVMe, NVMe-oF, BeeGFS, WekaFS, IBM Storage Scale, VAST).12 - You are tempted to build your own: 3FS shows the upside, but a custom filesystem is a large upfront and ongoing maintenance investment. Most teams should start with an existing distributed filesystem or object store and tune it.13
- Object stores (S3) are not filesystems: naive reads during training are slow. Stage onto local NVMe or front S3 with a cache (FSx for Lustre over S3), and use parallel multithreaded range-GET tools (
s5cmd, AWS S3 C++ SDK).14
How: implement, integrate, maintain¶
Match the layout to the access pattern¶
GPUs and storage both measure far higher throughput on large sequential reads than small random reads. Pack millions of small samples into a few large shards (Arrow, TFRecord, Parquet, WebDataset tar) so one read pulls many samples; tune read size toward ~1 MB rather than 4 KB to amortize per-read overhead.15 For multinode data-parallel training, shard the dataset per node so each rank reads its own slice (e.g., presplit 100 TB across 10 nodes as 10 TB each) to avoid redundant network reads. PyTorch's DistributedSampler coordinates per-epoch unique slices. See data-loading pipeline tuning.16
Stripe and tune Lustre¶
Stripe large dataset files across multiple OSTs to aggregate their bandwidth. -c sets the stripe count (-1 = all available OSTs), -S sets the stripe size (default 1 MB):173
# Stripe new files in this directory across 8 OSTs, 1 MiB stripe size
lfs setstripe -c 8 -S 1m /mnt/lustre/dataset
# Inspect the stripe layout actually applied to a file
lfs getstripe /mnt/lustre/dataset/shard-00001.tar
Monitor for hot storage nodes during training (Lustre Monitoring Tool lmt, or vendor dashboards). A node running hot is almost always a sharding imbalance (more reads landing on fewer OSTs), so rebalance the stripe layout.18
Tune NFS when you must use it¶
Use the largest request sizes and drop access-time updates on the client mount:19
# /etc/fstab — 1 MiB read/write blocks, no atime, attribute caching
nfs-server:/data /mnt/data nfs \
rsize=1048576,wsize=1048576,noatime,async,actimeo=60,lookupcache=pos 0 0
Ensure the NFS backend is NVMe (optionally RAID 0) and the server has multiple fast NICs; split the dataset across multiple NFS servers if one saturates.19
Local NVMe and the kernel I/O path¶
Whatever the parallel filesystem, keep data physically close: local NVMe in the same node, or at least the same rack over NVMe-oF. On local NVMe servers, XFS is common; mount with noatime and verify the block scheduler and read-ahead:2021
# none or mq-deadline is recommended for high-performance NVMe
cat /sys/block/nvme0n1/queue/scheduler
# Bump read-ahead for large sequential streaming (default ~128 KB)
sudo blockdev --setra 4096 /dev/nvme0n1 # 4096 * 512 B = 2 MiB
Stripe multiple SSDs with RAID 0 if a single device cannot saturate the GPUs, and ensure enough PCIe lanes so the drives are not link-bottlenecked.21
Run 3FS (open-source release)¶
3FS is Linux-based and POSIX-compatible via FUSE. The released components and a smallpond demo are in the repo; the high-throughput path uses the USRBIO asynchronous zero-copy API over RDMA rather than the FUSE mount.6 A FUSE mount looks conventional:
# Mount a configured 3FS cluster (FUSE client)
hf3fs_fuse --cfg /opt/3fs/etc/hf3fs_fuse_main.toml /3fs/mount
Confirm RDMA (InfiniBand or RoCE) is healthy on every client and storage node before trusting throughput; 3FS depends on it to bypass host-side copies. See RDMA / RoCE tuning and fabric bring-up & benchmarking.56
Checkpoint vs dataset access patterns¶
Dataset reads are random-read-heavy and benefit from the direct/striped paths above. Checkpoint writes are a different pattern: bursty, large, write-heavy. For RDMA-accelerated checkpoint writes the filesystem must explicitly support RDMA writes for GDS. WekaFS, for example, ships GDS-aware plugins for both read and write over RDMA.22 Note that the CUDA process-checkpoint path (cuda-checkpoint/CRIU) does not DMA directly from GPU memory to storage; the driver first copies device memory to host, then CRIU persists it, so it complements, not replaces, your framework's sharded checkpoint files. See runbook: checkpoint recovery.23
Reduce I/O by replicating or compressing¶
To eliminate network reads entirely, replicate the dataset onto each node's local storage (brute-force but effective, at the cost of disk).24 Alternatively store data compressed (JPEG, Arrow/Parquet) and decompress on the fly: this trades CPU/GPU cycles for I/O bandwidth, worthwhile only when I/O is the bottleneck and the decompression step does not itself become the bottleneck. Blackwell GPUs can decompress LZ4/Snappy/Deflate batches in-pipeline using the on-die Decompression Engine. See GPU decompression engine.25
References¶
Book — Chris Fregly, AI Systems Performance Engineering (O'Reilly), Chapter 5, "GPU-Based Storage I/O Optimizations", sections "Fast Storage and Data Locality", "DeepSeek's Fire-Flyer File System", and "Distributed, Parallel Filesystems and Object Stores".
Official documentation:
- DeepSeek 3FS (Fire-Flyer File System), source and design notes: https://github.com/deepseek-ai/3FS
- DeepSeek Fire-Flyer AI-HPC paper (arXiv 2408.14158): https://arxiv.org/abs/2408.14158
- Lustre
lfs setstripe(1)/ file striping: https://wiki.lustre.org/Configuring_Lustre_File_Striping - Lustre project: https://www.lustre.org/
- IBM Storage Scale (GPFS): https://www.ibm.com/products/storage-scale
- WEKA (WekaFS): https://www.weka.io/
- NVIDIA GPUDirect Storage: https://docs.nvidia.com/gpudirect-storage/index.html
- Amazon FSx for Lustre performance: https://docs.aws.amazon.com/fsx/latest/LustreGuide/performance.html
Related: GPUDirect Storage (GDS) · storage & data platform · data-loading pipeline tuning · NVIDIA DALI pipeline · GPU decompression engine · RDMA / RoCE tuning · fabric bring-up & benchmarking · runbook: checkpoint recovery · Glossary
-
Fregly, Ch. 5: "a common setup is to use a shared filesystem like an NFS server, or a parallel filesystem like Lustre, GPFS, Ceph, etc... these filesystems can become a bottleneck if not configured properly." ↩
-
Fregly, Ch. 5: "a single NFS server can easily become a throughput bottleneck if many nodes are reading at once... For larger training clusters, a single NFS server—even a high-end implementation—is likely to become a bottleneck. This is why parallel filesystems and cloud storage caches like Amazon FSx for Lustre are preferred." ↩↩↩↩
-
Fregly, Ch. 5: "A Lustre setup... has multiple Object Storage Targets (OSTs) that serve data. By striping files across OSTs, you can multiply your throughput... you might stripe your Arrow, TFRecord, or Parquet files across 4 OSTs. If each OST gives 500 MB/s, you can achieve a theoretical peak read throughput of 2 GB/s." ↩↩
-
Fregly, Ch. 5: "DeepSeek created a custom, open source filesystem called Fire-Flyer File System (3FS)... AI workloads perform massive numbers of random reads. These random reads make conventional read data caching ineffective—and even counterproductive... By eliminating caching and employing direct file I/O, 3FS ensures that every request goes straight to the NVMe SSD device... 3FS minimizes kernel page-cache involvement and host memory copies during reads." ↩↩
-
Fregly, Ch. 5: "3FS consists of four key components: cluster manager, metadata service, storage service, and client. These are interconnected over an RDMA-capable fabric like InfiniBand or RoCE to minimize CPU involvement and host-side copies... Metadata is sharded and replicated across multiple nodes for scale-out performance. Data paths bypass the OS page cache entirely." ↩↩
-
DeepSeek 3FS repository (https://github.com/deepseek-ai/3FS): metadata services are "stateless metadata services backed by a transactional key-value store (e.g., FoundationDB)"; storage uses "Chain Replication with Apportioned Queries (CRAQ) for strong consistency"; FUSE client
hf3fs_fuseplus the USRBIO asynchronous zero-copy API; test cluster used 2x200 Gbps InfiniBand NICs per node. ↩↩↩ -
Fregly, Ch. 5: "If the storage pipeline is slow, the GPUs will starve and sit idle." 3FS "helps prevent the data-staging phase from becoming the bottleneck—and helps keep GPU utilization high across both training and inference workloads." ↩
-
Fregly, Ch. 5: "An NVIDIA Grace Blackwell GB200/GB300 NVL72 rack with 72 Blackwell GPUs... If each GPU needs 200 MB/s of training data to stay busy, this can require 14–20 GB/s of aggregate storage throughput to keep all 72 GPUs busy." ↩
-
DeepSeek 3FS repository (https://github.com/deepseek-ai/3FS): a 180-node cluster reached ~6.6 TiB/s aggregate read throughput in a stress test with concurrent training workloads; GraySort sorted 110.5 TiB in 30 min 14 s (3.66 TiB/min); KVCache read throughput up to 40 GiB/s. ↩↩
-
Fregly, Ch. 5: "results up to 7.3 TB/s in their environment. In another benchmark, a large 3FS cluster achieved aggregated read throughput on the order of 6.6 TB/s using a 68-node AI-HPC cluster with 10 x 16 TB NVMe SSDs and dual 100 Gb/s... while concurrently serving background workloads at an additional 1.4 TB/s. This reported 3FS throughput, 6.6 TB/s, far exceeds Ceph's ~1.1 TB/s on similar hardware." The 68-node figure corresponds to DeepSeek's Fire-Flyer AI-HPC paper (arXiv 2408.14158); the open-source 3FS release reports the 180-node ~6.6 TiB/s result. ↩
-
Fregly, Ch. 5: "GDS support depends on the filesystem and RDMA-capable stack... select parallel filesystems such as BeeGFS, WekaFS, VAST, IBM Storage Scale, and others that integrate with nvidia-fs." ↩
-
Fregly, Ch. 5: "If a file system is implemented using FUSE in user space, it will not be able to deliver a GDS path because GDS requires kernel-level filesystem integration with O_DIRECT semantics... To feed data directly into GPU pipelines, DeepSeek integrates RDMA-based transfers in 3FS. If you require a true GDS path, use a GDS-enabled kernel filesystem client such as NVMe, NVMe-oF, BeeGFS, WekaFS, IBM Storage Scale, or VAST." ↩
-
Fregly, Ch. 5: "Building your own filesystem is an advanced technique that requires a lot of upfront investment and ongoing maintenance. Instead, it's more likely that you will start with an existing distributed filesystem or object store." ↩
-
Fregly, Ch. 5: "Object storage like Amazon S3 is not a typical filesystem... The solution often involves staging data on local NVMe SSD storage—or using a caching layer on top of object storage (e.g., Amazon FSx for Lustre on top of S3). Tools like s5cmd and aws s3 cp let you download data... use highly parallel, optimized data-transfer tools such as the AWS S3 C++ SDK and multithreaded utilities like s5cmd." ↩
-
Fregly, Ch. 5: "storage measures much higher throughput for large, sequential reads than for small, random reads... store them in a few large binary (e.g., Arrow, TFRecord, or Parquet) files... WebDataset tar files... reading in 1 MB chunks will yield better throughput than 4 KB chunks." ↩
-
Fregly, Ch. 5: "shard the dataset across nodes so that each node primarily reads a subset of data from its local disk... if you have 100 TB of data and 10 nodes, you might presplit 10 TB to each node's local storage... PyTorch's DistributedSampler will coordinate workers such that each process gets a unique slice of data per epoch." ↩
-
Lustre Wiki, Configuring Lustre File Striping /
lfs-setstripe(1): "-c (Stripe Count): ...-1 means stripe over all available OSTs"; "-S (Stripe Size): ...0 means use the system default (usually 1 MB) otherwise use k, m or g"; examplelfs setstripe -S 128k -c 2 /mnt/lustre/file1. https://wiki.lustre.org/Configuring_Lustre_File_Striping ↩ -
Fregly, Ch. 5: "lfs setstripe is used on Lustre to set striping for a large dataset across 4 or 8 OSTs... Monitor the filesystem's I/O during training, using tools like lmt for Lustre... You'll be looking to see if individual nodes in the storage cluster are hot... most likely a sharding issue." ↩
-
Fregly, Ch. 5: "NFS also has tuning parameters like rsize/wsize... use the max value (e.g., 1 MB)... mount your NFS client with rsize=1048576,wsize=1048576,noatime,async... add actimeo=60,lookupcache=pos... ensure the server has multiple fast NICs... consider using multiple NFS servers to split up the dataset." ↩↩
-
Fregly, Ch. 5: "XFS is common on Linux NVMe servers. You should mount it with noatime to eliminate costly access-time updates on each read." ↩
-
Fregly, Ch. 5: "Modern kernels use the none or mq-deadline multiqueue scheduler by default for NVMe... verify and set the scheduler using /sys/block/nvme*/queue/scheduler... read ahead setting in /sys/block/
/queue/read_ahead_kb... done using blockdev --setra... multiple SSDs can be striped using RAID 0... make sure you have enough lanes (e.g., PCIe)." ↩↩ -
Fregly, Ch. 5: "For RDMA-accelerated writes, the filesystem must support RDMA writes for GDS. WekaFS... offer a parallel filesystem that ships with GDS-aware plugins for both read and write workloads over RDMA." ↩
-
Fregly, Ch. 5, "Checkpointing GPU State with cuda-checkpoint": "Unlike data ingestion with GDS, the checkpoint path does not DMA directly from GPU memory to storage. Instead, the device memory image is first brought into host memory by the driver during suspend. CRIU then persists that process memory... Use this to complement, not replace, your framework's state-dict or sharded checkpoint files." ↩
-
Fregly, Ch. 5: "To eliminate network reads entirely, you can, in some cases, choose to replicate the dataset onto each node in the compute cluster... an admittedly brute-force but relatively common and very effective solution." ↩
-
Fregly, Ch. 5: "store data compressed on the filesystem or object store—and decompress them on the fly... Modern GPUs add an on-die Decompression Engine supporting formats such as LZ4, Snappy, and Deflate... Blackwell GPUs can decompress them in-pipeline... The key is still to make sure that the decompression time does not replace I/O as the bottleneck." ↩