AI Infrastructure Knowledge Base

AI Infrastructure Knowledge Base A comprehensive, citable knowledge base for deploying, operating, and optimising GPU clusters: NVIDIA Blackwell (B300 / GB300 NVL72), InfiniBand and RoCE fabrics, Kubernetes, k3s, Ray and Slurm, distributed training (FSDP, DDP, ZeRO, tensor and pipeline parallelism, DiLoCo), RL post-training (GRPO, DPO, SFT/LoRA) with verl, slime and SkyRL, LLM inference serving and disaggregation, observability, SRE and MLOps. https://ai-infrastructure.net/ setloop.io en Thu, 02 Jul 2026 22:42:12 -0000 Thu, 02 Jul 2026 22:42:12 -0000 1440 MkDocs RSS plugin - v1.19.0 None AI Infrastructure Knowledge Base https://ai-infrastructure.net/ KV Cache Token Eviction Why most KV-cache eviction methods fail in production: FlashAttention never exposes attention scores, and paged allocators free only empty blocks… https://ai-infrastructure.net/kv-cache-token-eviction/ Thu, 02 Jul 2026 19:51:48 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/kv-cache-token-eviction/ KV Compression No Savings Diagnose why an enabled KV-cache eviction or compression method shows no memory savings in vLLM: wrong gauge, eager-attention fallback, or block-granular… https://ai-infrastructure.net/runbook-kv-compression-no-savings/ Thu, 02 Jul 2026 19:51:48 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/runbook-kv-compression-no-savings/ cuTile Rust: Safe Tile Kernels cuTile Rust extends Rust's ownership discipline to GPU kernels: mutable outputs are partitioned into disjoint sub-tensors before launch, kernel launches… https://ai-infrastructure.net/cutile-rust/ Thu, 02 Jul 2026 19:50:08 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/cutile-rust/ Disaggregation Rate Matching (When It Pays) When prefill/decode disaggregation actually pays, from NVIDIA's systematic study of hundreds of thousands of simulated design points: prefill-heavy… https://ai-infrastructure.net/disaggregation-rate-matching/ Thu, 02 Jul 2026 19:50:08 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/disaggregation-rate-matching/ Legate Sparse: Distributed scipy.sparse Legate Sparse distributes and accelerates unmodified scipy.sparse programs across CPU and GPU clusters on the Legion runtime, composing with cuPyNumeric… https://ai-infrastructure.net/legate-sparse/ Thu, 02 Jul 2026 19:50:08 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/legate-sparse/ Evaluating Speculative Decoding (SPEED-Bench) Speculative decoding speedups are data-dependent: acceptance rates vary by domain, batch size shifts the optimal draft length, and random-token… https://ai-infrastructure.net/speculative-decoding-evaluation/ Thu, 02 Jul 2026 19:50:08 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/speculative-decoding-evaluation/ Automated Harness Optimization A worked case study of the Meta-Harness loop on Harvey's Legal Agent Benchmark: an LLM proposer rewrites the harness around a frozen open model, a… https://ai-infrastructure.net/automated-harness-optimization/ Thu, 02 Jul 2026 19:20:29 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/automated-harness-optimization/ Kubernetes Network Drivers (DRA Networking) The Kubernetes Network Driver model replaces the CNI + device-plugin composition with DRA ResourceClaims and NRI runtime hooks: declarative… https://ai-infrastructure.net/kubernetes-network-drivers/ Thu, 02 Jul 2026 19:20:29 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/kubernetes-network-drivers/ LLM Inference Efficiency (Convergence Map) How models, hardware, and serving algorithms converge on LLM inference efficiency: prefill vs decode roofline placement, the bandwidth ladder, FP8/FP4… https://ai-infrastructure.net/llm-inference-efficiency/ Thu, 02 Jul 2026 19:20:29 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/llm-inference-efficiency/ Loop Engineering Loop engineering is the layer above the harness: scheduled, self-feeding loops that discover work, hand it to agents, verify it with an independent… https://ai-infrastructure.net/loop-engineering/ Thu, 02 Jul 2026 19:20:29 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/loop-engineering/ Multi-Agent Collaboration (TradingAgents) How role-specialized LLM agent teams collaborate through structured reports and bounded natural-language debate, using TradingAgents (arXiv 2412.20138)… https://ai-infrastructure.net/multi-agent-collaboration/ Thu, 02 Jul 2026 19:20:29 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/multi-agent-collaboration/ NeMo AutoModel (MoE Fine-Tuning) NVIDIA NeMo AutoModel subclasses Transformers v5 AutoModelForCausalLM and adds Expert Parallelism on a dedicated moe_mesh, DeepEP fused all-to-all… https://ai-infrastructure.net/nemo-automodel/ Thu, 02 Jul 2026 19:20:29 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/nemo-automodel/ OpenHands Agent Platform OpenHands (f.k.a. OpenDevin) as an agent platform: the event-stream architecture, the step-function agent abstraction, the Docker-sandboxed runtime with… https://ai-infrastructure.net/openhands-platform/ Thu, 02 Jul 2026 19:20:29 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/openhands-platform/ Skill Optimization (SkillOpt) SkillOpt treats the agent skill document as the trainable external state of a frozen model: an optimizer model turns scored rollouts into bounded… https://ai-infrastructure.net/skill-optimization/ Thu, 02 Jul 2026 19:20:29 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/skill-optimization/ Time-Series Foundation Models (TimesFM) TimesFM is a 200M-parameter decoder-only foundation model that forecasts unseen time-series zero-shot: input patching, longer output patches for fewer… https://ai-infrastructure.net/time-series-foundation-models/ Thu, 02 Jul 2026 19:20:29 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/time-series-foundation-models/ Chat Rendering & Loss Masking the renderer layer that sits between structured chat messages and token sequences in every post-training stack: how conversations become supervised… https://ai-infrastructure.net/chat-rendering-loss-masking/ Thu, 02 Jul 2026 17:01:12 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/chat-rendering-loss-masking/ Cybersecurity Agent Evaluation How to measure what AI agents can do in security, offensively and defensively: real-world-grounded, sandboxed benchmarks scored on outcomes and… https://ai-infrastructure.net/cyber-agent-evaluation/ Thu, 02 Jul 2026 17:01:12 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/cyber-agent-evaluation/ LoRA Hyperparameter Scaling empirically calibrated rules for setting LoRA post-training hyperparameters: the 10x learning-rate multiplier over full fine-tuning, hidden-size LR… https://ai-infrastructure.net/lora-hyperparameter-scaling/ Thu, 02 Jul 2026 17:01:12 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/lora-hyperparameter-scaling/ Tinker (Training-as-a-Service) Thinking Machines' managed fine-tuning API (tinker) and its open-source recipe library (tinker-cookbook): the four training primitives, the multi-tenant… https://ai-infrastructure.net/rllib-tinker/ Thu, 02 Jul 2026 17:01:12 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/rllib-tinker/ Model Weight Loading (Inference Engines) How an inference engine turns safetensors on disk into a running model: the config architectures field and model registry, PyTorch dotted parameter… https://ai-infrastructure.net/engine-weight-loading/ Thu, 02 Jul 2026 14:55:24 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/engine-weight-loading/ FMware Performance Engineering (SPE) Making foundation-model-powered software (FMware) meet throughput and latency SLOs instead of treating performance as a post-deployment afterthought: the… https://ai-infrastructure.net/fmware-performance-engineering/ Thu, 02 Jul 2026 14:55:24 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/fmware-performance-engineering/ Local coding agents Run a coding agent fully locally on open-weight models: serve the model with Ollama or vLLM behind an OpenAI-compatible endpoint, point a coding harness… https://ai-infrastructure.net/local-coding-agents/ Thu, 02 Jul 2026 14:55:24 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/local-coding-agents/ Looped & Recurrent-Depth Transformers Weight-tied transformer blocks applied iteratively to refine a latent state: iterative depth as a scaling axis orthogonal to model size and data… https://ai-infrastructure.net/looped-transformers/ Thu, 02 Jul 2026 14:55:24 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/looped-transformers/ Muon Optimizer & Distributed Muon (DMuon) Muon is a matrix-aware optimizer that orthogonalizes each weight-matrix gradient with a Newton-Schulz iteration; it is ~2x more compute-efficient than… https://ai-infrastructure.net/muon-optimizer/ Thu, 02 Jul 2026 14:55:24 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/muon-optimizer/ Delta Weight Sync (Sparse Weight Transfer) Only ~1-3% of weights change per RL step, so shipping just the delta cuts trainer-to-rollout weight-sync traffic ~100x, losslessly and bit-identically… https://ai-infrastructure.net/delta-weight-sync/ Thu, 02 Jul 2026 08:45:53 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/delta-weight-sync/ GRPO Variants & Training Tricks The fixes that make GRPO work at scale: DAPO's clip-higher, dynamic sampling, token-level loss and overlong shaping; Dr. GRPO's bias fixes; GSPO/GMPO… https://ai-infrastructure.net/grpo-variants/ Thu, 02 Jul 2026 06:45:15 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/grpo-variants/ LLM Benchmarks (Anatomy & Metrics) How LLM capability benchmarks are built and read: task formats, the metrics (accuracy, a validated pass@k estimator, calibration, discrimination), the… https://ai-infrastructure.net/llm-benchmarks/ Thu, 02 Jul 2026 06:45:15 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/llm-benchmarks/ RL Scaling Laws How RL post-training compute scales: sigmoidal reward-vs-compute curves (ScaleRL), power-law fits across model sizes, what sets the asymptote versus the… https://ai-infrastructure.net/rl-scaling-laws/ Thu, 02 Jul 2026 06:45:15 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/rl-scaling-laws/ Rollout Redundancy (Prompt Dedup & Cascade Attention) Group-sampling RL (GRPO/PPO) makes the prompt massively shared across rollouts; exploit it twice: prompt deduplication in the training forward/backward… https://ai-infrastructure.net/rl-rollout-redundancy/ Thu, 02 Jul 2026 06:11:01 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/rl-rollout-redundancy/ RLSD (RL + Self-Distillation) RLSD fuses RLVR and privileged-context self-distillation: the verifiable reward sets each update's direction while a token-level self-distillation signal… https://ai-infrastructure.net/rlsd/ Thu, 02 Jul 2026 06:11:01 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/rlsd/ vLLM: GLM-5.2-FP8 a vLLM reference template for serving zai-org/GLM-5.2-FP8, Z.ai's current flagship long-horizon agentic coding and reasoning model: what GLM-5.2 is (a… https://ai-infrastructure.net/cookbook-vllm-glm-5-2/ Thu, 02 Jul 2026 05:54:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/cookbook-vllm-glm-5-2/ Experiment Tracking & Model Registry The MLOps backbone for finetuning/post-training: track every run's params/metrics/artifacts, version models in a registry with promotion stages, and… https://ai-infrastructure.net/experiment-tracking-model-registry/ Thu, 02 Jul 2026 05:54:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/experiment-tracking-model-registry/ LLM Evaluation Harness & Eval Gate Measure a post-trained model reproducibly: benchmark suites, the harnesses that run them (lm-evaluation-harness, lighteval), decontamination for honest… https://ai-infrastructure.net/llm-evaluation-harness/ Thu, 02 Jul 2026 05:54:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/llm-evaluation-harness/ Model Merging (SLERP/TIES/DARE) Combine multiple fine-tuned checkpoints into one model with no training: task vectors, interference resolution (TIES, DARE), interpolation (SLERP, model… https://ai-infrastructure.net/model-merging/ Thu, 02 Jul 2026 05:54:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/model-merging/ Multi-LoRA / Adapter Serving Serve hundreds of LoRA adapters over one shared base model: heterogeneous batching across adapters (S-LoRA, Punica), adapter paging, and the vLLM… https://ai-infrastructure.net/multi-lora-serving/ Thu, 02 Jul 2026 05:54:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/multi-lora-serving/ Synthetic Data Generation Generate finetuning data with LLMs: teacher/distillation data, instruction synthesis (Self-Instruct, Evol-Instruct, Magpie), and AI feedback, plus the… https://ai-infrastructure.net/synthetic-data-generation/ Thu, 02 Jul 2026 05:54:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/synthetic-data-generation/ Training-Data Curation & Decontamination Turn raw or synthetic data into a training set that helps: exact/fuzzy/semantic deduplication, quality filtering, benchmark decontamination, and dataset… https://ai-infrastructure.net/training-data-curation/ Thu, 02 Jul 2026 05:54:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/training-data-curation/ Changelog What is new in the AI Infrastructure Knowledge Base: new pages and notable updates, newest first, so additions are easy to find. https://ai-infrastructure.net/changelog/ Wed, 01 Jul 2026 20:24:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/changelog/ Autonomous Experimentation Loops Closed-loop autonomous ML experimentation: an LLM proposes hyperparameters or code changes, a bounded trial runs, an evaluator scores it, the loop keeps… https://ai-infrastructure.net/autonomous-experimentation-loops/ Wed, 01 Jul 2026 19:56:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/autonomous-experimentation-loops/ Evaluation Integrity & Anti-Gaming Protect the evaluator from the optimizer: the frozen-vs-mutable boundary, reward hacking and Goodhart's law, sandbox enforcement, and held-out integrity… https://ai-infrastructure.net/evaluation-integrity-anti-gaming/ Wed, 01 Jul 2026 19:56:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/evaluation-integrity-anti-gaming/ Learning-Curve Extrapolation & Early Stopping Kill doomed training trials early: forecast the final metric from a partial learning curve, and multi-fidelity bandits (Successive Halving, Hyperband… https://ai-infrastructure.net/learning-curve-extrapolation-early-stopping/ Wed, 01 Jul 2026 19:56:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/learning-curve-extrapolation-early-stopping/ LLM Request Routing (MoM) Route each LLM request to the right model in a heterogeneous pool: predictive vs cascading routing, decision signals (semantic, preference-learned… https://ai-infrastructure.net/llm-request-routing/ Wed, 01 Jul 2026 19:56:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/llm-request-routing/ On-Policy Distillation On-policy distillation post-training: the student trains on its own sampled rollouts, graded per-token by a teacher (reverse KL). GKD, the dense-reward… https://ai-infrastructure.net/on-policy-distillation/ Wed, 01 Jul 2026 19:56:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/on-policy-distillation/ RLVR (Verifiable Rewards) RLVR post-training: reward an LLM from a deterministic verifier (answer match, unit tests, format, proof) instead of a learned reward model. Verifier… https://ai-infrastructure.net/rlvr/ Wed, 01 Jul 2026 19:56:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/rlvr/ vLLM Semantic Router The vLLM Semantic Router: an Envoy External Processor that classifies each request with Rust/Candle BERT models and routes it across a Mixture-of-Models… https://ai-infrastructure.net/vllm-semantic-router/ Wed, 01 Jul 2026 19:56:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/vllm-semantic-router/ vLLM: DeepSeek-V3.2-Exp a vLLM reference template for serving deepseek-ai/DeepSeek-V3.2-Exp, DeepSeek's sparse-attention model: what DeepSeek Sparse Attention (DSA) changes, why… https://ai-infrastructure.net/cookbook-vllm-deepseek-v3-2/ Wed, 01 Jul 2026 19:13:32 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/cookbook-vllm-deepseek-v3-2/ vLLM: MiniMax-M2 a vLLM reference template for serving MiniMaxAI/MiniMax-M2, MiniMax's efficient MoE agent and reasoning model: what it is (230B total / 10B active… https://ai-infrastructure.net/cookbook-vllm-minimax-m2/ Wed, 01 Jul 2026 19:13:32 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/cookbook-vllm-minimax-m2/ vLLM: small models on consumer GPUs running the current small open-weight models (roughly 1B to 32B) on a single consumer or workstation GPU with vLLM: which model fits which card, the VRAM… https://ai-infrastructure.net/cookbook-vllm-consumer-gpu/ Wed, 01 Jul 2026 17:06:53 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/cookbook-vllm-consumer-gpu/ Dynamic & Fractional GPU Sharing sharing a GPU by real, changing demand instead of a fixed partition. Covers fractional allocation (a memory ceiling plus a compute share) with schedulers… https://ai-infrastructure.net/dynamic-fractional-gpu-sharing/ Wed, 01 Jul 2026 16:37:12 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/dynamic-fractional-gpu-sharing/ Use as an agent skill Install ai-infrastructure.net as a reusable agent skill so any AI agent can use this GPU and AI-infrastructure knowledge base as a cited source of truth. https://ai-infrastructure.net/agent-skill/ Mon, 29 Jun 2026 21:26:12 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/agent-skill/ Governing Self-Modifying Agents When an agent edits its own prompt, tools, or middleware at runtime, govern the optimizer: change contracts, two-track promotion, shadow evaluation, and… https://ai-infrastructure.net/agent-governance-self-modifying/ Mon, 29 Jun 2026 21:11:53 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/agent-governance-self-modifying/ Identity & Access Under whose authority an agent acts: a credential broker that restores the user's identity to the backend, plus zero-trust, ABAC, and SPIFFE workload… https://ai-infrastructure.net/agent-identity-access/ Mon, 29 Jun 2026 21:11:53 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/agent-identity-access/ Intent Verification Confirm an agent action matches what the user actually asked using an out-of-band signed-intent attestation, because an in-chat confirmation can be… https://ai-infrastructure.net/agent-intent-verification/ Mon, 29 Jun 2026 21:11:53 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/agent-intent-verification/ Agentic Loop Economics Why prefix caching dominates agentic loop cost and latency: prefill O(N^2), byte-identical KV-cache reuse, and the harness moves that keep or break the… https://ai-infrastructure.net/agent-loop-economics/ Mon, 29 Jun 2026 21:11:53 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/agent-loop-economics/ Policy Engine Decide whether an agent action may run: a deny-by-default policy engine in the in-process tool hook, named Cedar rules, one policy across runtimes, and… https://ai-infrastructure.net/agent-policy-engine/ Mon, 29 Jun 2026 21:11:53 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/agent-policy-engine/ Self-Improving Harnesses The agent harness as an optimization target: searched, ablated, and self-edited, and lifting control flow from the prompt into an explicit program graph. https://ai-infrastructure.net/agent-self-improving-harness/ Mon, 29 Jun 2026 21:11:53 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/agent-self-improving-harness/ Offensive AI & Arms Race How AI shifts the offense-defense balance in software security: the discovery-versus-construction split, capability that scales with inference budget… https://ai-infrastructure.net/offensive-ai-security/ Mon, 29 Jun 2026 21:11:53 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/offensive-ai-security/ Container Image Supply-Chain Provenance ensuring the container images that run on your GPU nodes are exactly what you built and authorized. The chain from build to registry to node: digest… https://ai-infrastructure.net/container-image-provenance/ Mon, 29 Jun 2026 18:39:32 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/container-image-provenance/ Context & Memory Manage an agent's working set: context engineering, the storage-versus-presentation split, hierarchical reduction (compaction then summarization)… https://ai-infrastructure.net/agent-context-memory/ Mon, 29 Jun 2026 18:22:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/agent-context-memory/ Evaluating Agents Evaluate agents on output, components, and trajectory: build datasets, do error analysis into a failure taxonomy, write PASS/FAIL rubrics, use… https://ai-infrastructure.net/agent-evaluation/ Mon, 29 Jun 2026 18:22:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/agent-evaluation/ Harness Architecture The harness around an LLM (context management, tool dispatch, error recovery, state, and memory) and why it drives agent reliability more than the model… https://ai-infrastructure.net/agent-harness-architecture/ Mon, 29 Jun 2026 18:22:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/agent-harness-architecture/ The Agent Loop The think-act-observe cycle at the core of every agent: ReAct, the run/step/think/act control flow, termination and loop guards, and when a loop beats a… https://ai-infrastructure.net/agent-loop/ Mon, 29 Jun 2026 18:22:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/agent-loop/ Agent Observability Tracing the agent inference path: capture the full prompt and response on every model call, span the trajectory with OpenTelemetry GenAI conventions, and… https://ai-infrastructure.net/agent-observability/ Mon, 29 Jun 2026 18:22:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/agent-observability/ Orchestration & Control Plane Treat agent orchestration as a control plane, not a message bus: a decide() gate chain for authorization, mutation, budget, retries, and identity that… https://ai-infrastructure.net/agent-orchestration-control-plane/ Mon, 29 Jun 2026 18:22:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/agent-orchestration-control-plane/ Planning & Reasoning Give agents time to think: ReAct's limits on complex tasks, explicit planning and task decomposition, reflection and failure recovery, and the… https://ai-infrastructure.net/agent-planning-reasoning/ Mon, 29 Jun 2026 18:22:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/agent-planning-reasoning/ Sandboxing & Isolation Run agent-generated code and tool calls safely: the isolation spectrum from containers to gVisor and Firecracker microVMs, real container-escape CVEs… https://ai-infrastructure.net/agent-sandboxing-isolation/ Mon, 29 Jun 2026 18:22:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/agent-sandboxing-isolation/ Threat Model Why agency turns model flaws into system compromise: the lethal trifecta, lost identity and intent, the OWASP LLM Top 10 and MITRE ATLAS, and the shift… https://ai-infrastructure.net/agent-security-threat-model/ Mon, 29 Jun 2026 18:22:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/agent-security-threat-model/ Tools & Function Calling How agents act: the five-step tool-calling mechanism, JSON-Schema function definitions, tool-design rules, the Model Context Protocol, and code execution… https://ai-infrastructure.net/agent-tools-function-calling/ Mon, 29 Jun 2026 18:22:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/agent-tools-function-calling/ Start here Build and secure LLM agents: the agent loop, tools and function calling, context and memory, harness architecture, orchestration, observability… https://ai-infrastructure.net/agentic-systems-index/ Mon, 29 Jun 2026 18:22:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/agentic-systems-index/ Agentic AIOps & Autonomous Operations using software (increasingly LLM agents) to run the incident lifecycle on a GPU cluster: detect an anomaly, localize the faulty component, perform… https://ai-infrastructure.net/aiops-agentic-operations/ Mon, 29 Jun 2026 18:22:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/aiops-agentic-operations/ Distributed Training as a Platform Service running elastic, multi-worker distributed training as a managed service on a Kubernetes GPU cluster, the platform layer that schedules workers, brings up… https://ai-infrastructure.net/distributed-training-platform-service/ Mon, 29 Jun 2026 18:22:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/distributed-training-platform-service/ GPU Confidential Computing & Attestation running a workload on an NVIDIA GPU so that its code, weights, and data are protected in use from the host, the hypervisor, and the cloud operator. This… https://ai-infrastructure.net/gpu-confidential-computing/ Mon, 29 Jun 2026 18:22:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/gpu-confidential-computing/ GPU Platform Split-Plane Architecture the reference architecture for a GPU-on-demand platform (a neocloud, an internal GPU cloud, or a multi-provider GPU service) that splits into a… https://ai-infrastructure.net/gpu-platform-split-plane-architecture/ Mon, 29 Jun 2026 18:22:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/gpu-platform-split-plane-architecture/ K8s Pod Networking over WireGuard making a real Kubernetes pod network work when the nodes are not on a shared L2 fabric but stitched together by a WireGuard overlay, the specific… https://ai-infrastructure.net/kubernetes-networking-wireguard-hybrid/ Mon, 29 Jun 2026 18:22:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/kubernetes-networking-wireguard-hybrid/ Operator for GPU Allocation the custom Kubernetes operator pattern that turns declarative GPU-workload intents (“lease 8×H100 for this tenant”, “run this served deployment”… https://ai-infrastructure.net/kubernetes-operator-gpu-orchestration/ Mon, 29 Jun 2026 18:22:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/kubernetes-operator-gpu-orchestration/ Continuous NCCL Fabric Benchmarking treating inter-node collective bandwidth as a standing, monitored signal rather than a one-time bring-up check. A long-lived service periodically… https://ai-infrastructure.net/nccl-fabric-benchmarking-service/ Mon, 29 Jun 2026 18:22:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/nccl-fabric-benchmarking-service/ Overlay & Mesh Networking (Geo-Distributed) the encrypted overlay network that stitches GPU nodes living behind different providers' NATs and firewalls (across regions, clouds, or a decentralized… https://ai-infrastructure.net/overlay-mesh-networking-distributed-gpu/ Mon, 29 Jun 2026 18:22:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/overlay-mesh-networking-distributed-gpu/ Prompt-Injection Defense Defend agents against direct, indirect, and streaming prompt injection: detector ensembles with honest accuracy limits, two-checkpoint streaming… https://ai-infrastructure.net/prompt-injection-defense/ Mon, 29 Jun 2026 18:22:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/prompt-injection-defense/ Remote GPU Verification (Rented Hardware) how to verify that a GPU you do not own (a rented neocloud instance, a node in a decentralized marketplace, a provider you cannot physically inspect)… https://ai-infrastructure.net/remote-gpu-verification/ Mon, 29 Jun 2026 18:22:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/remote-gpu-verification/ Agentic & Tool-Use RL RL post-training for agentic, tool-using LLMs: multi-turn ReAct rollouts, tool-output loss masking, trajectory rewards, and the systems cost of tool… https://ai-infrastructure.net/agentic-rl/ Mon, 29 Jun 2026 15:32:55 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/agentic-rl/ Recipe: Memory-Efficient GRPO Post-Training Run GRPO RL post-training on one node: a 4-bit QLoRA base, FP8 vLLM rollouts, a verifiable reward, and group-relative advantages, with apply and verify. https://ai-infrastructure.net/recipe-grpo-posttraining/ Mon, 29 Jun 2026 08:50:46 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/recipe-grpo-posttraining/ PyTorch Custom CUDA Extensions Ship a hand-written CUDA kernel into PyTorch via torch.utils.cpp_extension or pybind11, wrapped as a torch.autograd.Function with forward and backward. https://ai-infrastructure.net/pytorch-cuda-extensions/ Mon, 29 Jun 2026 08:12:35 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/pytorch-cuda-extensions/ Quantization for Inference Quantize LLM weights and activations for GPU serving with INT8, INT4, FP8 and NVFP4, plus GPTQ, AWQ, SmoothQuant and NF4 QLoRA for cheaper inference. https://ai-infrastructure.net/quantization-inference/ Mon, 29 Jun 2026 08:12:35 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/quantization-inference/ Tensor Core Programming Program NVIDIA Tensor Cores from CUDA with the WMMA fragment API, Hopper WGMMA warp-group MMA in inline PTX, and Blackwell TCGen05 at a high level. https://ai-infrastructure.net/tensor-core-programming/ Mon, 29 Jun 2026 08:12:35 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/tensor-core-programming/ Async & Disaggregated RL Systems Scale LLM RL post-training with asynchronous, disaggregated rollout and training: off-policy staleness, truncated importance sampling, and weight sync. https://ai-infrastructure.net/async-rl-systems/ Mon, 29 Jun 2026 07:43:49 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/async-rl-systems/ Rejection Sampling & Best-of-N Rejection sampling fine-tuning and Best-of-N for LLM post-training, the simple bridge from SFT to RL. Generate N completions, score, SFT on the best. https://ai-infrastructure.net/rejection-sampling-bon/ Mon, 29 Jun 2026 07:43:49 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/rejection-sampling-bon/ Reward Model Training Train a reward model for RLHF with the Bradley-Terry log-sigmoid loss and a scalar head, plus ORM, PRM, and generative LLM-as-a-judge variants in TRL. https://ai-infrastructure.net/reward-model-training/ Mon, 29 Jun 2026 07:43:49 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/reward-model-training/ PPO actor-critic RL for LLM RLHF, a trainable policy updated on a clipped surrogate objective, with per-token advantages from a separately-trained value… https://ai-infrastructure.net/rl-ppo/ Mon, 29 Jun 2026 07:43:49 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/rl-ppo/ Reward Design for RL Design rewards for LLM RL post-training: verifiable rewards (RLVR), reward models, shaping and normalization, and how to detect and prevent reward hacking. https://ai-infrastructure.net/reward-design-rl-posttraining/ Mon, 29 Jun 2026 07:09:11 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/reward-design-rl-posttraining/ Burn-Rate Alerting Rules a reusable multi-window, multi-burn-rate SLO alerting pattern: how to derive burn rate from an error budget, the fast+slow window pairing, the Prometheus… https://ai-infrastructure.net/alerting-burn-rate-rules/ Sun, 28 Jun 2026 15:17:25 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/alerting-burn-rate-rules/ Build-vs-Rent Cost Model A concrete TCO model to decide owning vs renting GPUs: capex (GPUs, network, facility) plus opex (power, cooling, staff) against a cloud GPU-hour rate… https://ai-infrastructure.net/build-vs-rent-gpu-cost-model/ Sun, 28 Jun 2026 15:17:25 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/build-vs-rent-gpu-cost-model/ GPU Capacity Planning sizing a GPU fleet by building a demand model from workloads, setting target utilization and headroom, fitting the power/cooling envelope, and folding in… https://ai-infrastructure.net/gpu-capacity-planning/ Sun, 28 Jun 2026 15:17:25 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/gpu-capacity-planning/ GPU Consumption Models On-demand vs reserved/committed vs spot/preemptible vs colocation/owned for GPUs. The price/risk tradeoffs, which workload shape each fits (bursty vs… https://ai-infrastructure.net/gpu-consumption-models/ Sun, 28 Jun 2026 15:17:25 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/gpu-consumption-models/ GPU Provider Landscape the four categories you can rent GPUs from (hyperscalers, GPU neoclouds, decentralized/marketplace, and second-hand/distressed capacity), what each is… https://ai-infrastructure.net/gpu-provider-landscape/ Sun, 28 Jun 2026 15:17:25 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/gpu-provider-landscape/ KubeRay (Ray on Kubernetes) running Ray on Kubernetes with the KubeRay operator, covering install, the RayCluster / RayJob / RayService CRDs, exposing GPUs and RDMA to Ray pods… https://ai-infrastructure.net/kuberay-integration/ Sun, 28 Jun 2026 15:17:25 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/kuberay-integration/ Orchestration Decision Guide choosing the orchestrator for GPU work, Slurm vs Kubernetes vs Ray (and hybrids), by workload shape (batch HPC vs services vs Python-native), team, and… https://ai-infrastructure.net/orchestration-decision-guide/ Sun, 28 Jun 2026 15:17:25 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/orchestration-decision-guide/ Playbook: End-to-End Bring-Up the ordered, step-by-step playbook that takes a freshly built cluster to its first running workload (facility - fabric proof - nodes - K8s GPU stack… https://ai-infrastructure.net/playbook-end-to-end-workload-bringup/ Sun, 28 Jun 2026 15:17:25 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/playbook-end-to-end-workload-bringup/ Ray on Slurm Composing Ray with a Slurm allocation: starting a Ray head and workers inside one sbatch job, address/port wiring, GPU binding, clean teardown, and when… https://ai-infrastructure.net/ray-on-slurm/ Sun, 28 Jun 2026 15:17:25 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/ray-on-slurm/ Recipe: DiLoCo (geo-distributed) a standalone recipe to run low-communication DiLoCo training across datacenters or poorly-connected workers, covering inner/outer optimizer config, sync… https://ai-infrastructure.net/recipe-diloco-geo-distributed/ Sun, 28 Jun 2026 15:17:25 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/recipe-diloco-geo-distributed/ Recipe: Fabric Validation (nccl-tests) a standalone, executable recipe to validate the cluster fabric with nccl-tests as a Kubeflow MPIJob, covering the manifest, how to apply it, the… https://ai-infrastructure.net/recipe-fabric-validation-nccl-tests/ Sun, 28 Jun 2026 15:17:25 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/recipe-fabric-validation-nccl-tests/ Recipe: FSDP (single datacenter) a standalone recipe to run FSDP2 (fully_shard) training inside one NVLink/InfiniBand datacenter: the launcher and config (sharding granularity, mixed… https://ai-infrastructure.net/recipe-fsdp-single-dc/ Sun, 28 Jun 2026 15:17:25 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/recipe-fsdp-single-dc/ Recipe: Gang-Scheduled Training a standalone recipe to launch a gang-scheduled distributed-training smoke job (Volcano Job + torchrun): the manifest, apply/verify, an MFU sanity check… https://ai-infrastructure.net/recipe-gang-scheduled-training/ Sun, 28 Jun 2026 15:17:25 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/recipe-gang-scheduled-training/ Recipe: vLLM Inference Deployment A model-agnostic recipe to deploy a vLLM OpenAI-compatible server on Kubernetes: Deployment, Service, HPA, model and token config, and health checks. https://ai-infrastructure.net/recipe-vllm-inference-deployment/ Sun, 28 Jun 2026 15:17:25 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/recipe-vllm-inference-deployment/ Driver / Module Load Failure recover a node where the NVIDIA kernel modules fail to load or refuse to bind. Either modprobe nvidia errors, or nvidia-smi returns Failed to initialize… https://ai-infrastructure.net/runbook-driver-module-load-failure/ Sun, 28 Jun 2026 15:17:25 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/runbook-driver-module-load-failure/ Inference KV-Cache OOM Stabilize an LLM server thrashing on KV-cache pressure: size KV memory, cap concurrency, tune block allocation, and stop preemption and recompute loops. https://ai-infrastructure.net/runbook-inference-kv-cache-oom/ Sun, 28 Jun 2026 15:17:25 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/runbook-inference-kv-cache-oom/ NVLink Visibility / P2P Failure diagnose GPUs that cannot see each other over NVLink: nvidia-smi nvlink --status shows links inactive, CUDA P2P access is disabled, and collectives… https://ai-infrastructure.net/runbook-nvlink-visibility-failure/ Sun, 28 Jun 2026 15:17:25 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/runbook-nvlink-visibility-failure/ PCIe / P2P Bandwidth Regression investigate a PCIe link trained down (lower gen/width) or P2P blocked by ACS (H2D/D2H/P2P bandwidth far below expected) and restore full bandwidth. https://ai-infrastructure.net/runbook-pcie-p2p-bandwidth-regression/ Sun, 28 Jun 2026 15:17:25 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/runbook-pcie-p2p-bandwidth-regression/ Scheduler: GPU Job Pending diagnose a Kubernetes (Pending) or Slurm (PD) GPU job that never starts (insufficient allocatable GPUs, taints/affinity, MIG/profile mismatch, quota, or… https://ai-infrastructure.net/runbook-scheduler-pending-gpu-job/ Sun, 28 Jun 2026 15:17:25 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/runbook-scheduler-pending-gpu-job/ Training OOM Triage CUDA out-of-memory in distributed training: tell a true OOM from fragmentation, then shrink the working set or fix the allocator and resume. https://ai-infrastructure.net/runbook-training-oom/ Sun, 28 Jun 2026 15:17:25 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/runbook-training-oom/ SLOs: Cluster & Fabric operational SLOs for the cluster substrate (allocatable-GPU health, fabric link health and bandwidth, node readiness, and thermal headroom) expressed as… https://ai-infrastructure.net/slo-cluster-fabric/ Sun, 28 Jun 2026 15:17:25 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/slo-cluster-fabric/ SLOs: Inference Serving Define and measure inference serving SLOs: TTFT, TPOT/ITL, throughput, and error rate, with concrete PromQL SLIs and target-setting guidance. https://ai-infrastructure.net/slo-inference-serving/ Sun, 28 Jun 2026 15:17:25 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/slo-inference-serving/ SLOs: Training Platform training-platform SLOs distinct from inference (scheduler queue wait, job success rate, goodput/MFU, checkpoint success, and infra-failure rate), with… https://ai-infrastructure.net/slo-training-platform/ Sun, 28 Jun 2026 15:17:25 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/slo-training-platform/ AI-Assisted Optimization using AI to optimize AI systems: LLM/agent-driven kernel generation and autotuning, AI-discovered algorithms (AlphaTensor-style faster GEMM), automated… https://ai-infrastructure.net/ai-driven-performance-optimization/ Sun, 28 Jun 2026 10:12:50 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/ai-driven-performance-optimization/ Grace CPU the NVIDIA Grace CPU (a 72-core Arm Neoverse V2 server processor with high-bandwidth on-package LPDDR5X) and its role next to the GPU inside a Grace… https://ai-infrastructure.net/grace-cpu/ Sun, 28 Jun 2026 10:12:50 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/grace-cpu/ Mechanical Sympathy & Codesign the book's organizing principle: write software with the grain of the hardware (mechanical sympathy), and exploit the virtuous cycle where hardware… https://ai-infrastructure.net/mechanical-sympathy-codesign/ Sun, 28 Jun 2026 10:12:50 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/mechanical-sympathy-codesign/ GPU Roadmap (Rubin, Feynman) The datacenter GPU/platform cadence, Hopper to Blackwell to Blackwell Ultra (B300/GB300) to Vera Rubin to Feynman, covering what each generation changes… https://ai-infrastructure.net/nvidia-gpu-roadmap/ Sun, 28 Jun 2026 10:12:50 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/nvidia-gpu-roadmap/ Scaling to 100T Parameters The systems thesis for the next order of magnitude, covering the memory wall, sparse MoE holding per-token FLOPs flat while capacity grows, rack-scale… https://ai-infrastructure.net/scaling-100t-parameters/ Sun, 28 Jun 2026 10:12:50 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/scaling-100t-parameters/ Constrained Decoding Force model output to a grammar, JSON schema, or regex by masking next-token logits to only the tokens a compiled FSM / pushdown automaton allows at each… https://ai-infrastructure.net/constrained-decoding/ Sun, 28 Jun 2026 10:03:32 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/constrained-decoding/ Continuous Batching Internals how a modern serving scheduler (vLLM, SGLang) iterates via token-level continuous (in-flight) batching that admits and retires requests every step, the… https://ai-infrastructure.net/continuous-batching-internals/ Sun, 28 Jun 2026 10:03:32 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/continuous-batching-internals/ Expert Parallelism (Inference) sharding MoE experts across GPUs (expert parallelism, EP), the all-to-all dispatch/combine it forces at every MoE layer, keeping that all-to-all on… https://ai-infrastructure.net/expert-parallelism-inference/ Sun, 28 Jun 2026 10:03:32 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/expert-parallelism-inference/ Inference Parallelism Strategies choosing a parallelism layout for serving (distinct from training), tensor parallelism inside the NVLink domain for latency, pipeline parallelism across… https://ai-infrastructure.net/inference-parallelism-strategies/ Sun, 28 Jun 2026 10:03:32 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/inference-parallelism-strategies/ QoS & Admission Control protecting inference SLOs under load. Priority classes and per-request SLO targets, admission control / load shedding when the queue grows, latency-aware… https://ai-infrastructure.net/inference-qos-admission-control/ Sun, 28 Jun 2026 10:03:32 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/inference-qos-admission-control/ KV Cache Management managing the KV cache that dominates decode memory, covering PagedAttention block tables to kill fragmentation, prefix caching to reuse shared prefixes… https://ai-infrastructure.net/kv-cache-management/ Sun, 28 Jun 2026 10:03:32 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/kv-cache-management/ KV Cache Transfer (NIXL) moving the KV cache from prefill workers to decode workers (and across memory/storage tiers) in disaggregated serving, using NVIDIA's Inference Xfer… https://ai-infrastructure.net/kv-cache-transfer-nixl/ Sun, 28 Jun 2026 10:03:32 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/kv-cache-transfer-nixl/ MoE Routing & Load Balancing The gating/router (top-k selection) in a Mixture-of-Experts layer, why uneven expert load wastes GPUs under expert parallelism (the straggler effect)… https://ai-infrastructure.net/moe-routing-load-balancing/ Sun, 28 Jun 2026 10:03:32 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/moe-routing-load-balancing/ MoE Sparse Scaling why sparse MoE decouples total parameters from per-token FLOPs: the routing math, shared vs routed experts, and the memory-to-fit vs compute-to-run… https://ai-infrastructure.net/moe-sparsity-scaling/ Sun, 28 Jun 2026 10:03:32 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/moe-sparsity-scaling/ Speculative Decoding accelerating the decode stage by proposing several tokens cheaply (small draft model, n-gram/suffix lookup, or EAGLE/Medusa/MTP heads) and verifying them… https://ai-infrastructure.net/speculative-decoding/ Sun, 28 Jun 2026 10:03:32 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/speculative-decoding/ Data Loading Pipeline Tuning Keeping GPUs fed from the host input pipeline: sizing num_workers and prefetch_factor, pin_memory plus non_blocking H2D copies, persistent_workers… https://ai-infrastructure.net/data-loading-pipeline-tuning/ Sun, 28 Jun 2026 09:32:58 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/data-loading-pipeline-tuning/ Parallel FS & DeepSeek 3FS parallel/distributed filesystems for AI clusters (Lustre, IBM Storage Scale/GPFS, WekaFS) and DeepSeek's open-source Fire-Flyer File System (3FS): why… https://ai-infrastructure.net/deepseek-3fs-filesystem/ Sun, 28 Jun 2026 09:32:58 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/deepseek-3fs-filesystem/ GPU Decompression (nvCOMP) trading cheap GPU cycles for scarce storage/PCIe bandwidth by storing data compressed on disk and decompressing it on the GPU. Covers the Blackwell… https://ai-infrastructure.net/gpu-decompression-engine/ Sun, 28 Jun 2026 09:32:58 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/gpu-decompression-engine/ GPUDirect Storage (GDS) the direct DMA path from NVMe/NVMe-oF/RDMA-NAS into GPU HBM that bypasses the CPU bounce buffer, via the cuFile API and the nvidia-fs kernel module… https://ai-infrastructure.net/gpudirect-storage-gds/ Sun, 28 Jun 2026 09:32:58 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/gpudirect-storage-gds/ NVIDIA DALI Offloading decode and augmentation (image/video/audio) onto the GPU with a NVIDIA DALI pipeline to remove the CPU preprocessing bottleneck: the… https://ai-infrastructure.net/nvidia-dali-pipeline/ Sun, 28 Jun 2026 09:32:58 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/nvidia-dali-pipeline/ BlueField DPUs NVIDIA BlueField-3 offloading networking, storage, and security off the host CPU: line-rate RDMA/RoCE, NVMe over Fabrics, data-path isolation for secure… https://ai-infrastructure.net/bluefield-dpu-networking/ Sun, 28 Jun 2026 09:15:31 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/bluefield-dpu-networking/ Comms-Compute Overlap Hiding collective latency behind compute so the GPU stops waiting on the network: DDP gradient bucketing (wait-free backprop), FSDP all-gather prefetch… https://ai-infrastructure.net/comms-compute-overlap/ Sun, 28 Jun 2026 09:15:31 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/comms-compute-overlap/ NCCL Collectives & Algorithms how NCCL implements all-reduce, all-gather, reduce-scatter, and broadcast, how it selects an algorithm (Ring/Tree/CollNet/NVLS) and protocol… https://ai-infrastructure.net/nccl-collectives-algorithms/ Sun, 28 Jun 2026 09:15:31 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/nccl-collectives-algorithms/ NVSHMEM GPU-Initiated Comms NVSHMEM's PGAS one-sided model (GPU threads issuing put/get directly from kernel code with the CPU off the critical path) for fine-grained compute/comm… https://ai-infrastructure.net/nvshmem-gpu-communication/ Sun, 28 Jun 2026 09:15:31 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/nvshmem-gpu-communication/ RDMA & RoCE Tuning tuning RDMA over Converged Ethernet (RoCEv2) for GPU clusters, covering GPUDirect RDMA (NIC-to-HBM direct DMA), the lossless fabric (PFC + ECN/DCQCN)… https://ai-infrastructure.net/rdma-roce-tuning/ Sun, 28 Jun 2026 09:15:31 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/rdma-roce-tuning/ SHARP In-Network Reduction NVIDIA SHARP offloading all-reduce / reduce-scatter / all-gather into the InfiniBand switch ASIC (and NVLink SHARP / NVLS in the NVSwitch) to halve… https://ai-infrastructure.net/sharp-in-network-reduction/ Sun, 28 Jun 2026 09:15:31 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/sharp-in-network-reduction/ GPU Containerization Performance Reaching near-bare-metal GPU throughput inside containers via host-driver/container-CUDA splitting, OverlayFS I/O avoidance, slim images, and rootless… https://ai-infrastructure.net/gpu-containerization-performance/ Sun, 28 Jun 2026 09:04:49 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/gpu-containerization-performance/ Power, Clocks & Thermal Tuning controlling GPU clocks, power draw, and heat for deterministic benchmarks and better performance-per-watt: locking core/memory clocks (nvidia-smi -lgc /… https://ai-infrastructure.net/gpu-power-thermal-tuning/ Sun, 28 Jun 2026 09:04:49 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/gpu-power-thermal-tuning/ Topology-Aware K8s Scheduling making Kubernetes co-locate the GPUs of a multi-GPU pod on the same NVLink/NUMA domain and align the pod's CPUs and memory to that domain, via the… https://ai-infrastructure.net/kubernetes-topology-aware-scheduling/ Sun, 28 Jun 2026 09:04:49 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/kubernetes-topology-aware-scheduling/ Linux OS & Kernel Tuning host kernel/OS knobs that keep GPUs fed, namely vm.swappiness/swapoff, transparent hugepages (training vs inference), the performance CPU governor and… https://ai-infrastructure.net/linux-os-tuning-gpu-nodes/ Sun, 28 Jun 2026 09:04:49 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/linux-os-tuning-gpu-nodes/ Activation Checkpointing & Offloading trade compute and bandwidth for HBM capacity. Recompute activations in the backward pass (activation/gradient checkpointing), and offload activations… https://ai-infrastructure.net/activation-checkpointing-offloading/ Sun, 28 Jun 2026 08:59:06 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/activation-checkpointing-offloading/ Attention APIs (SDPA, FlexAttention) scaled_dot_product_attention backend selection and forcing, FlexAttention for custom masks/biases compiled to fused kernels, and how both map onto the… https://ai-infrastructure.net/pytorch-attention-apis/ Sun, 28 Jun 2026 08:59:06 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/pytorch-attention-apis/ Caching Allocator Tuning PyTorch's native CUDA caching allocator and how to tune it with PYTORCH_ALLOC_CONF (expandable_segments, max_split_size_mb, garbage_collection_threshold… https://ai-infrastructure.net/pytorch-cuda-memory-allocator/ Sun, 28 Jun 2026 08:59:06 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/pytorch-cuda-memory-allocator/ Performance Regression CI standing up automated performance regression tests for PyTorch training/inference: capturing step-time, throughput, MFU, and peak-memory baselines… https://ai-infrastructure.net/pytorch-perf-regression-ci/ Sun, 28 Jun 2026 08:59:06 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/pytorch-perf-regression-ci/ PyTorch/XLA Backend lazy-tensor tracing into an XLA/HLO graph, the XLA fusion/layout compiler, mark_step/torch_xla.sync() graph boundaries, per-shape-signature compilation… https://ai-infrastructure.net/pytorch-xla-backend/ Sun, 28 Jun 2026 08:59:06 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/pytorch-xla-backend/ torch.compile (Capture & Backends) how torch.compile turns eager PyTorch into fused kernels, covering TorchDynamo bytecode capture, AOTAutograd, the TorchInductor (Triton) backend, the… https://ai-infrastructure.net/torch-compile/ Sun, 28 Jun 2026 08:59:06 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/torch-compile/ Compute Sanitizer compute-sanitizer and its four tools (memcheck, racecheck, initcheck, synccheck) for catching out-of-bounds accesses, shared-memory data races… https://ai-infrastructure.net/cuda-compute-sanitizer/ Sun, 28 Jun 2026 08:48:41 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/cuda-compute-sanitizer/ Stream-Ordered Allocator cudaMallocAsync/cudaFreeAsync and the per-device memory pool behind them: stream-ordered allocation that reuses freed blocks without a device-wide sync… https://ai-infrastructure.net/cuda-stream-ordered-allocator/ Sun, 28 Jun 2026 08:48:41 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/cuda-stream-ordered-allocator/ Unified Memory & NVLink-C2C a single pointer addressable from CPU and GPU (cudaMallocManaged), on-demand page migration and the page-fault stalls it causes, defusing those stalls… https://ai-infrastructure.net/cuda-unified-memory/ Sun, 28 Jun 2026 08:48:41 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/cuda-unified-memory/ CUTLASS: Templated GEMM CUTLASS as the open, templated C++ library for high-performance GEMM/conv on Tensor Cores, its tiling/pipelining abstractions (CuTe, CollectiveBuilder)… https://ai-infrastructure.net/cutlass-gemm/ Sun, 28 Jun 2026 08:48:41 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/cutlass-gemm/ Dynamic Parallelism & Device Launch launching work from the device, via CUDA Dynamic Parallelism (a kernel launches kernels) and device graph launch (a kernel launches a preinstantiated… https://ai-infrastructure.net/dynamic-parallelism-device-launch/ Sun, 28 Jun 2026 08:48:41 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/dynamic-parallelism-device-launch/ Inline PTX & SASS Tuning dropping to inline PTX (asm volatile) for instructions the compiler will not emit, reading the real SASS with cuobjdump / nvdisasm to verify what… https://ai-infrastructure.net/inline-ptx-sass/ Sun, 28 Jun 2026 08:48:41 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/inline-ptx-sass/ Instruction-Level Parallelism instruction-level parallelism (ILP) on GPUs, using independent instructions within a thread to hide latency at low occupancy. Covers loop unrolling and… https://ai-infrastructure.net/instruction-level-parallelism-gpu/ Sun, 28 Jun 2026 08:48:41 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/instruction-level-parallelism-gpu/ Triton: Python GPU Kernels Triton's block-based programming model for writing fused GPU kernels in Python (the language torch.compile emits), its autotuner, and when a Triton… https://ai-infrastructure.net/openai-triton/ Sun, 28 Jun 2026 08:48:41 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/openai-triton/ Persistent & Megakernels launch one grid sized to the GPU and loop over a work queue to amortize per-launch overhead and keep SMs resident (persistent kernels); fuse a whole… https://ai-infrastructure.net/persistent-kernels-megakernels/ Sun, 28 Jun 2026 08:48:41 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/persistent-kernels-megakernels/ Thread Block Clusters & DSMEM the Hopper/Blackwell thread block cluster (an optional grouping level above the block that co-schedules a set of blocks on SMs within one GPC) and… https://ai-infrastructure.net/thread-block-clusters-dsmem/ Sun, 28 Jun 2026 08:48:41 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/thread-block-clusters-dsmem/ Warp Specialization & Pipelining splitting a thread block into producer (load) and consumer (compute) warps and software-pipelining the stages with the CUDA Pipeline API and async copies… https://ai-infrastructure.net/warp-specialization-pipelining/ Sun, 28 Jun 2026 08:48:41 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/warp-specialization-pipelining/ CUDA Graphs amortizing per-kernel CPU launch overhead by capturing a fixed pipeline of kernels, copies, and events once and replaying it as a single submission… https://ai-infrastructure.net/cuda-graphs/ Sun, 28 Jun 2026 08:14:53 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/cuda-graphs/ Occupancy Tuning occupancy as active warps versus the SM hardware maximum, the three resource limiters (registers/thread, shared memory/block, block size), theoretical… https://ai-infrastructure.net/cuda-occupancy-tuning/ Sun, 28 Jun 2026 08:14:53 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/cuda-occupancy-tuning/ CUDA Streams & Concurrency CUDA streams as the unit of inter-kernel concurrency: the legacy default stream (stream 0) versus the per-thread default stream (PTDS) versus explicit… https://ai-infrastructure.net/cuda-streams-concurrency/ Sun, 28 Jun 2026 08:14:53 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/cuda-streams-concurrency/ FlashAttention & MLA IO-aware exact attention (FlashAttention tiling, online softmax, versions 2 and 3) and DeepSeek's Multi-Head Latent Attention (MLA), which compresses the… https://ai-infrastructure.net/flashattention-mla/ Sun, 28 Jun 2026 08:14:53 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/flashattention-mla/ Goodput: Useful Throughput goodput as the north-star efficiency metric for GPU clusters, the useful work per unit time normalized to peak, how to compute it, why it beats raw FLOPS… https://ai-infrastructure.net/goodput-ai-systems/ Sun, 28 Jun 2026 08:14:53 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/goodput-ai-systems/ Execution Model (SM, Warp, SIMT) How an NVIDIA GPU actually executes a kernel: streaming multiprocessors, the 32-thread warp, SIMT lockstep, the thread/block/grid hierarchy, warp… https://ai-infrastructure.net/gpu-execution-model-simt/ Sun, 28 Jun 2026 08:14:53 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/gpu-execution-model-simt/ GPU Memory Hierarchy the on-chip-to-off-chip memory tiers of an NVIDIA GPU SM -- registers, shared memory/L1, the read-only/constant caches, L2, and HBM global memory… https://ai-infrastructure.net/gpu-memory-hierarchy/ Sun, 28 Jun 2026 08:14:53 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/gpu-memory-hierarchy/ Kernel Fusion fusing a chain of operations into one CUDA kernel to raise arithmetic intensity, eliminate HBM round-trips for intermediate tensors, and remove… https://ai-infrastructure.net/kernel-fusion/ Sun, 28 Jun 2026 08:14:53 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/kernel-fusion/ Memory Coalescing how a warp's 32 lanes should address global (HBM) memory -- contiguous and aligned so the hardware coalesces lane requests into the fewest cache-line… https://ai-infrastructure.net/memory-coalescing/ Sun, 28 Jun 2026 08:14:53 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/memory-coalescing/ Nsight Profiling Workflow the profile-driven workflow that finds the bottleneck with Nsight Systems (timeline / system view), drills into the offending kernel with Nsight Compute… https://ai-infrastructure.net/nsight-profiling-workflow/ Sun, 28 Jun 2026 08:14:53 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/nsight-profiling-workflow/ NUMA & CPU Pinning binding CPU threads, host memory, and PyTorch DataLoader workers to the GPU's local NUMA node so input pipelines never pay a cross-node hop. Covers… https://ai-infrastructure.net/numa-cpu-pinning/ Sun, 28 Jun 2026 08:14:53 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/numa-cpu-pinning/ Roofline & Arithmetic Intensity arithmetic intensity (FLOPs per byte), the roofline envelope of peak-compute and peak-bandwidth ceilings, the ridge point that separates memory-bound… https://ai-infrastructure.net/roofline-arithmetic-intensity/ Sun, 28 Jun 2026 08:14:53 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/roofline-arithmetic-intensity/ Shared Memory & Tiling using on-chip shared memory as a software-managed cache, covering tiling for data reuse (tiled GEMM), the 32-bank conflict model, and padding/swizzling… https://ai-infrastructure.net/shared-memory-tiling/ Sun, 28 Jun 2026 08:14:53 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/shared-memory-tiling/ Tensor Cores & Mixed Precision how Tensor Cores and reduced-precision formats (TF32, BF16/FP16, FP8, NVFP4, INT8) raise arithmetic intensity and throughput, why accumulation precision… https://ai-infrastructure.net/tensor-cores-mixed-precision/ Sun, 28 Jun 2026 08:14:53 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/tensor-cores-mixed-precision/ vLLM: DeepSeek-R1 a production-oriented vLLM reference template for serving deepseek-ai/DeepSeek-R1 as an OpenAI-compatible endpoint: what the model is, why and when to… https://ai-infrastructure.net/cookbook-vllm-deepseek-r1/ Thu, 25 Jun 2026 17:46:06 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/cookbook-vllm-deepseek-r1/ vLLM: GLM-4.7-FP8 a vLLM reference template for serving zai-org/GLM-4.7-FP8: what GLM-4.7 is, why and when to use it for coding and agentic workloads, how to launch the… https://ai-infrastructure.net/cookbook-vllm-glm-4-7/ Thu, 25 Jun 2026 17:46:06 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/cookbook-vllm-glm-4-7/ vLLM: Kimi K2 a vLLM reference template for serving moonshotai/Kimi-K2-Instruct: what Kimi K2 is, why and when to use it for agentic/tool workloads, how to size the… https://ai-infrastructure.net/cookbook-vllm-kimi-k2/ Thu, 25 Jun 2026 17:46:06 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/cookbook-vllm-kimi-k2/ vLLM: Llama 4 Maverick a vLLM reference template for serving meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8: what Llama 4 Maverick is, why and when to use it, how to handle… https://ai-infrastructure.net/cookbook-vllm-llama-4-maverick/ Thu, 25 Jun 2026 17:46:06 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/cookbook-vllm-llama-4-maverick/ vLLM: Qwen3-235B-A22B a vLLM reference template for serving Qwen/Qwen3-235B-A22B-Instruct-2507: what the model is, why and when to use it, how to deploy the 235B/22B MoE… https://ai-infrastructure.net/cookbook-vllm-qwen3-235b/ Thu, 25 Jun 2026 17:46:06 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/cookbook-vllm-qwen3-235b/ Helm: DRA Driver install nvidia-dra-driver-gpu via Helm so the kubelet plugin publishes a ResourceSlice per node and pods can claim GPUs through Dynamic Resource… https://ai-infrastructure.net/helm-dra-driver/ Thu, 25 Jun 2026 15:21:25 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/helm-dra-driver/ Helm: GPU Operator helm repo add nvidia + helm install gpu-operator, the load-bearing chart values (driver.enabled, driver.version, mig.strategy, toolkit.enabled… https://ai-infrastructure.net/helm-gpu-operator/ Thu, 25 Jun 2026 15:21:25 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/helm-gpu-operator/ Helm: Kueue Quota install Kueue (release manifest or Helm OCI chart), model fair-share GPU quota with ResourceFlavor + ClusterQueue + LocalQueue, share spare capacity… https://ai-infrastructure.net/helm-kueue-quota/ Thu, 25 Jun 2026 15:21:25 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/helm-kueue-quota/ Helm: Network Operator helm install network-operator with the RDMA shared device plugin and the secondary-network path, so pods get an IB/RoCE device and GPUDirect RDMA engages… https://ai-infrastructure.net/helm-network-operator/ Thu, 25 Jun 2026 15:21:25 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/helm-network-operator/ Helm: Volcano Scheduler install Volcano via Helm; configure the scheduler/controller/admission, stand up queues, and make gang (minMember) scheduling place distributed jobs… https://ai-infrastructure.net/helm-volcano-scheduler/ Thu, 25 Jun 2026 15:21:25 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/helm-volcano-scheduler/ Inventory & Variables the inventory model that the bring-up roles read: the gpu_nodes group, the group_vars//host_vars/ layout, the per-tier variables (gpu_tier… https://ai-infrastructure.net/inventory-node-model/ Thu, 25 Jun 2026 15:21:25 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/inventory-node-model/ Manifest: DCGM Exporter deploy dcgm-exporter (via the GPU Operator or standalone), wire a Prometheus ServiceMonitor/scrape config, understand the DCGM_FI_DEV_/DCGM_FI_PROF_… https://ai-infrastructure.net/manifest-dcgm-exporter/ Thu, 25 Jun 2026 15:21:25 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/manifest-dcgm-exporter/ Manifest: DRA ResourceClaim a ResourceClaimTemplate that selects a GPU by attribute/capacity via a CEL expression (e.g. memory = 40Gi), a Pod that consumes it through pod-level… https://ai-infrastructure.net/manifest-dra-resourceclaim/ Thu, 25 Jun 2026 15:21:25 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/manifest-dra-resourceclaim/ Manifest: GPU Operator ClusterPolicy the ClusterPolicy CRD (apiVersion: nvidia.com/v1) that is the single source of truth for an NVIDIA GPU Operator install. It covers the driver, toolkit… https://ai-infrastructure.net/manifest-gpu-operator-clusterpolicy/ Thu, 25 Jun 2026 15:21:25 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/manifest-gpu-operator-clusterpolicy/ Manifest: Kueue ClusterQueue the ResourceFlavor + ClusterQueue + LocalQueue triad that fences nvidia.com/gpu into team quota, plus a Job labelled to a LocalQueue and the kubectl get… https://ai-infrastructure.net/manifest-kueue-clusterqueue/ Thu, 25 Jun 2026 15:21:25 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/manifest-kueue-clusterqueue/ Manifest: MIG Mode driving MIG declaratively through the GPU Operator's mig-manager, covering the nvidia.com/mig.config node label, the default-mig-parted-config ConfigMap… https://ai-infrastructure.net/manifest-mig-mode/ Thu, 25 Jun 2026 15:21:25 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/manifest-mig-mode/ Manifest: NicClusterPolicy the NicClusterPolicy CRD that drives the NVIDIA Network Operator. It wires the OFED/DOCA driver, the RDMA shared device plugin (resourceName, ifNames)… https://ai-infrastructure.net/manifest-nic-cluster-policy/ Thu, 25 Jun 2026 15:21:25 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/manifest-nic-cluster-policy/ Manifest: Time-Slicing the NVIDIA k8s device-plugin ConfigMap (replicas under sharing.timeSlicing.resources), wiring it through the GPU Operator via devicePlugin.config, the… https://ai-infrastructure.net/manifest-time-slicing/ Thu, 25 Jun 2026 15:21:25 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/manifest-time-slicing/ Manifest: Volcano Job a Volcano Job (batch.volcano.sh/v1alpha1) that gang-schedules a multi-pod GPU training run (minAvailable, multiple tasks, schedulerName: volcano) so… https://ai-infrastructure.net/manifest-volcano-job/ Thu, 25 Jun 2026 15:21:25 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/manifest-volcano-job/ Site Playbook the site.yml that orchestrates a fleet bring-up: host selection, privilege escalation, staged role order (base_tuning - acs_disable - rdma_fabric OFED… https://ai-infrastructure.net/playbook-site-bring-up/ Thu, 25 Jun 2026 15:21:25 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/playbook-site-bring-up/ RBAC for Operators the ServiceAccounts, ClusterRoles and ClusterRoleBindings that the GPU Operator, Network Operator, DRA driver and Kueue install and run as. What each is… https://ai-infrastructure.net/rbac-gpu-operators/ Thu, 25 Jun 2026 15:21:25 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/rbac-gpu-operators/ Role: base_tuning host prep before any NVIDIA package lands. Blacklist nouveau, set the GRUB kernel cmdline (IOMMU mode per platform, pci=realloc for large GPU BARs), pin… https://ai-infrastructure.net/role-base-tuning/ Thu, 25 Jun 2026 15:21:25 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/role-base-tuning/ Role: mig enable MIG mode and lay out one requested profile per GPU via nvidia-smi mig, wrapped so the role is idempotent. It reads current state (nvidia-smi… https://ai-infrastructure.net/role-mig-configuration/ Thu, 25 Jun 2026 15:21:25 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/role-mig-configuration/ Role: nvidia_stack install the NVIDIA GPU software stack on a prepared node, namely the kernel driver (tier-aware package, branch-pinned), the CUDA toolkit… https://ai-infrastructure.net/role-nvidia-stack/ Thu, 25 Jun 2026 15:21:25 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/role-nvidia-stack/ Role: rdma_fabric install the DOCA-OFED host stack, load nvidia_peermem for GPUDirect RDMA, and write /etc/nccl.conf defaults (NCCL_IB_HCA, NCCL_IB_GID_INDEX… https://ai-infrastructure.net/role-rdma-fabric/ Thu, 25 Jun 2026 15:21:25 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/role-rdma-fabric/ Role: validate fail the play if a node is not fit to take work. The validate role is the last role in site.yml: it asserts GPUs enumerate, persistence and (on NVSwitch… https://ai-infrastructure.net/role-validate-health/ Thu, 25 Jun 2026 15:21:25 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/role-validate-health/ PCIe ACS-Disable Service a systemd one-shot that clears the PCIe ACS Control redirect bits on every bridge at boot (setpci loop) so GPU/NIC peer-to-peer (GPUDirect P2P/RDMA) is… https://ai-infrastructure.net/service-acs-disable/ Thu, 25 Jun 2026 15:21:25 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/service-acs-disable/ Smoke Tests a consolidated acceptance suite for a freshly built GPU platform. It runs a CUDA pod running nvidia-smi, an nccl-tests Job over RDMA (GDRDMA confirmed… https://ai-infrastructure.net/smoke-tests-gpu-platform/ Thu, 25 Jun 2026 15:21:25 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/smoke-tests-gpu-platform/ Bare-Metal Provisioning & PXE network boot for fleet-scale OS install: the PXE/iPXE chain, the DHCP options that drive it (next-server, filename, client-arch, HTTPClient), TFTP vs… https://ai-infrastructure.net/bare-metal-pxe/ Thu, 25 Jun 2026 13:50:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/bare-metal-pxe/ Container Toolkit & CDI how the NVIDIA Container Toolkit and Container Device Interface (CDI) expose host GPUs to OCI containers: install, runtime wiring (Docker/containerd)… https://ai-infrastructure.net/container-toolkit/ Thu, 25 Jun 2026 13:50:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/container-toolkit/ CUDA Driver the CUDA driver (libcuda.so, the user-mode half of the NVIDIA GPU driver), its Driver API vs the runtime API, the driver version vs the CUDA version… https://ai-infrastructure.net/cuda-driver/ Thu, 25 Jun 2026 13:50:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/cuda-driver/ CUDA Libraries the math and collective-communication libraries that sit between the CUDA runtime and the frameworks (cuBLAS, cuDNN, NCCL, CUTLASS, cuFFT): how they are… https://ai-infrastructure.net/cuda-libraries/ Thu, 25 Jun 2026 13:50:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/cuda-libraries/ CUDA Toolkit & Runtime the difference between the CUDA Toolkit (compiler, headers, static libs), the CUDA runtime that ships inside applications, and the driver underneath… https://ai-infrastructure.net/cuda-toolkit-runtime/ Thu, 25 Jun 2026 13:50:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/cuda-toolkit-runtime/ Diagnostics & Validation the tooling that proves a GPU node is healthy enough to take work (dcgmi diag run levels, DCGM health watches, nvbandwidth, gpu-burn, and nvidia-smi… https://ai-infrastructure.net/diagnostics-tools/ Thu, 25 Jun 2026 13:50:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/diagnostics-tools/ Driver Support by Tier which of NVIDIA's four driver families a given GPU class runs (datacenter/Tesla, GeForce, RTX Enterprise Production Branch, DGX OS) and which platform… https://ai-infrastructure.net/driver-by-tier/ Thu, 25 Jun 2026 13:50:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/driver-by-tier/ Driver Versions & Branches NVIDIA data center (Tesla) GPU driver branches, Long Term Support (LTSB) vs Production, their support windows, how to choose and pin a fleet to one… https://ai-infrastructure.net/driver-versions-branches/ Thu, 25 Jun 2026 13:50:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/driver-versions-branches/ ECC ECC on GPU memory across the tiers: what carries it, how nvidia-smi toggles and reports it, the volatile-vs-aggregate counters, and row remapping on… https://ai-infrastructure.net/ecc-support/ Thu, 25 Jun 2026 13:50:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/ecc-support/ Fabric Manager Run NVIDIA Fabric Manager (nv-fabricmanager) on NVSwitch systems: what it does, how it is versioned with the driver, and how to operate and debug it. https://ai-infrastructure.net/fabric-manager/ Thu, 25 Jun 2026 13:50:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/fabric-manager/ GPU Firmware & GSP the on-board firmware a GPU carries, namely the VBIOS and the GSP (GPU System Processor) firmware the kernel driver loads at init. How to read it, where… https://ai-infrastructure.net/gpu-firmware-gsp/ Thu, 25 Jun 2026 13:50:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/gpu-firmware-gsp/ Frameworks (PyTorch/JAX/TensorRT) how ML frameworks ship their own CUDA/cuDNN/NCCL inside wheels and containers, why this decouples the application stack from the host stack, the driver… https://ai-infrastructure.net/gpu-frameworks/ Thu, 25 Jun 2026 13:50:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/gpu-frameworks/ GPU Health Gating keep jobs off bad GPUs. The control layer that runs a health check, turns a verdict into a scheduler state change, and stops work landing on a degraded… https://ai-infrastructure.net/gpu-health-gating/ Thu, 25 Jun 2026 13:50:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/gpu-health-gating/ MIG Partitioning hardware partitioning of a single datacenter/workstation GPU into isolated GPU instances: profiles, the nvidia-smi mig lifecycle, isolation guarantees… https://ai-infrastructure.net/gpu-partitioning-mig/ Thu, 25 Jun 2026 13:50:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/gpu-partitioning-mig/ MPS (Multi-Process Service) CUDA Multi-Process Service, software space-sharing that lets many cooperative processes run CUDA kernels concurrently on one GPU through a shared server… https://ai-infrastructure.net/gpu-partitioning-mps/ Thu, 25 Jun 2026 13:50:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/gpu-partitioning-mps/ Image & Config Management keeping a GPU fleet bit-for-bit consistent: golden OS images with pinned driver/CUDA/firmware baselines, the immutable-image vs config-management… https://ai-infrastructure.net/image-management/ Thu, 25 Jun 2026 13:50:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/image-management/ Install & Lifecycle how the NVIDIA driver and surrounding stack get onto a GPU node and stay correct over its life. Covers apt network repo vs runfile, open vs proprietary… https://ai-infrastructure.net/install-lifecycle/ Thu, 25 Jun 2026 13:50:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/install-lifecycle/ IPMI (Legacy OOB) operating a BMC over the network with ipmitool -I lanplus (chassis power, sensors, SEL, LAN and user config), plus the protocol's security weaknesses… https://ai-infrastructure.net/ipmi-protocol/ Thu, 25 Jun 2026 13:50:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/ipmi-protocol/ Kernel Modules the kernel-space half of the GPU driver on a single Linux node: the five nvidia modules, the open-vs-proprietary flavor choice, and the install-time… https://ai-infrastructure.net/kernel-modules/ Thu, 25 Jun 2026 13:50:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/kernel-modules/ nvidia-smi Reference nvidia-smi (the NVIDIA System Management Interface) as a working instrument: inspection (-q, --query-gpu ... --format=csv), live monitoring (dmon, pmon)… https://ai-infrastructure.net/nvidia-smi-reference/ Thu, 25 Jun 2026 13:50:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/nvidia-smi-reference/ NVSwitch & NVLink the GPU-to-GPU interconnect (NVLink links, NVSwitch ASICs, the Fabric Manager that fuses them into one all-to-all domain) across both an 8-GPU baseboard… https://ai-infrastructure.net/nvswitch-nvlink/ Thu, 25 Jun 2026 13:50:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/nvswitch-nvlink/ Out-of-Band Management & BMC the baseboard management controller (BMC) and the lights-out plane it serves (remote power, serial-over-LAN console, sensor telemetry, firmware update… https://ai-infrastructure.net/oob-management-bmc/ Thu, 25 Jun 2026 13:50:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/oob-management-bmc/ OOB Network Infrastructure the physically separate management network that carries lights-out control traffic: dedicated 1G switches, the management VLAN/subnet, BMC addressing via… https://ai-infrastructure.net/oob-network-infra/ Thu, 25 Jun 2026 13:50:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/oob-network-infra/ Persistence Mode keeping the NVIDIA kernel driver initialized between jobs on a headless GPU node: the nvidia-persistenced daemon (preferred) versus the deprecated… https://ai-infrastructure.net/persistence-mode/ Thu, 25 Jun 2026 13:50:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/persistence-mode/ Provisioning Tooling the bare-metal provisioning systems that turn racked GPU nodes into a fleet of identical, schedulable machines: Canonical MAAS, Warewulf, xCAT, OpenStack… https://ai-infrastructure.net/provisioning-tools/ Thu, 25 Jun 2026 13:50:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/provisioning-tools/ Redfish (Modern OOB) DMTF Redfish, the REST/JSON-over-HTTPS out-of-band management API exposed by a server's BMC. What the resource model is, how to drive… https://ai-infrastructure.net/redfish-protocol/ Thu, 25 Jun 2026 13:50:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/redfish-protocol/ ECC Toggle Recovery recover a datacenter GPU whose ECC mode was toggled but is stuck with Current disagreeing with Pending, or that needs a reset/reboot after enabling ECC… https://ai-infrastructure.net/runbook-ecc-toggle-recovery/ Thu, 25 Jun 2026 13:50:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/runbook-ecc-toggle-recovery/ Fabric Manager Failure nvidia-fabricmanager is inactive or aborting on an NVSwitch system (HGX/DGX 8-GPU baseboard, GB200/GB300 NVL72), so GPUs do not form their NVLink domain… https://ai-infrastructure.net/runbook-fabric-manager-failure/ Thu, 25 Jun 2026 13:50:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/runbook-fabric-manager-failure/ GSP Firmware / Driver Mismatch recover a node where a partial driver change left the kernel modules and the GSP (GPU System Processor) firmware on different branches, so nvidia-smi… https://ai-infrastructure.net/runbook-gsp-firmware-mismatch/ Thu, 25 Jun 2026 13:50:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/runbook-gsp-firmware-mismatch/ Image Drift Across Fleet converge a GPU fleet back to one pinned software baseline when non-reproducible failures trace to nodes running different driver / CUDA / GSP-firmware /… https://ai-infrastructure.net/runbook-image-drift/ Thu, 25 Jun 2026 13:50:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/runbook-image-drift/ Kernel Upgrade — GPU Missing a single node that booted into a new kernel and now shows no GPUs (nvidia-smi: No devices were found) because the NVIDIA kernel module was never rebuilt… https://ai-infrastructure.net/runbook-kernel-gpu-missing/ Thu, 25 Jun 2026 13:50:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/runbook-kernel-gpu-missing/ Stale MIG State a node whose actual MIG geometry (nvidia-smi -L / nvidia-smi mig -lgi) has drifted from what the scheduler believes: pods stuck Pending on a MIG… https://ai-infrastructure.net/runbook-mig-state-stale/ Thu, 25 Jun 2026 13:50:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/runbook-mig-state-stale/ OOB / BMC Unreachable a node has hung and its baseboard management controller (BMC) does not answer (no ping, ipmitool and Redfish both time out), so there is no lights-out… https://ai-infrastructure.net/runbook-oob-unreachable/ Thu, 25 Jun 2026 13:50:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/runbook-oob-unreachable/ Persistence Mode / Clock Bounce a node (or fleet) showing slow job starts, bouncing clocks and non-reproducible benchmarks because the NVIDIA driver is de-initializing idle GPUs… https://ai-infrastructure.net/runbook-persistence-mode/ Thu, 25 Jun 2026 13:50:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/runbook-persistence-mode/ Topology-Unaware Scheduling a tightly-coupled training job runs but crawls because its ranks landed scattered across the spine instead of rail-local on the fewest leaf switches, so… https://ai-infrastructure.net/runbook-topology-scheduling/ Thu, 25 Jun 2026 13:50:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/runbook-topology-scheduling/ Slurm Topology Placement making Slurm pack a tightly-coupled job onto the fewest, closest network leaves so its collectives stay rail-local. Covers topology.conf with the… https://ai-infrastructure.net/slurm-topology-placement/ Thu, 25 Jun 2026 13:50:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/slurm-topology-placement/ Slurm vs Kubernetes a decision guide for picking the workload manager on a GPU cluster, batch HPC (Slurm) versus service orchestration (Kubernetes), across gang scheduling… https://ai-infrastructure.net/slurm-vs-kubernetes/ Thu, 25 Jun 2026 13:50:44 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/slurm-vs-kubernetes/ Fabric Bring-Up, Validation & Benchmarking the shared procedure for bringing a GPU interconnect up, proving it healthy, and benchmarking it to line rate (InfiniBand, RoCE, and NVLink) before any… https://ai-infrastructure.net/fabric-bringup-benchmarking/ Thu, 25 Jun 2026 13:05:27 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/fabric-bringup-benchmarking/ Recipes & manifests (index) the index of runnable recipes (Ansible playbooks, Kubernetes/Helm manifests, telemetry stacks, and workload bring-up cookbooks). Where operational… https://ai-infrastructure.net/recipes-index/ Thu, 25 Jun 2026 13:05:27 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/recipes-index/ Vendor Sourcing & Procurement turning a validated bill of materials into placed orders and received, asset-tagged hardware: who to buy from per region and component tier, what an RFQ… https://ai-infrastructure.net/vendor-sourcing-procurement/ Thu, 25 Jun 2026 13:05:27 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/vendor-sourcing-procurement/ DGX Spark (GB10 Desktop) NVIDIA's GB10 Grace Blackwell desktop AI computer (formerly “Project DIGITS”), a single-board Arm + Blackwell system with 128 GB of unified memory, two… https://ai-infrastructure.net/dgx-spark/ Wed, 24 Jun 2026 11:19:14 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/dgx-spark/ DGX & HGX Systems NVIDIA's turnkey AI systems (DGX) and the OEM baseboards they share their silicon with (HGX). What makes a DGX operationally different from a self-built… https://ai-infrastructure.net/dgx-systems/ Wed, 24 Jun 2026 11:19:14 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/dgx-systems/ Ampere (A100 / A30 / A40) the Ampere datacenter generation (GA100/GA10x dies, TSMC 7nm / Samsung 8nm, 2020): the A100, A30, A40, and A10, their interconnect and MIG behaviour, and… https://ai-infrastructure.net/gpu-ampere/ Wed, 24 Jun 2026 11:19:14 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/gpu-ampere/ Generations & Families Compare NVIDIA GPU generations for clusters: Ampere, Hopper, and Blackwell datacenter GPUs, plus RTX and DGX, and what changed each generation. https://ai-infrastructure.net/gpu-generations/ Wed, 24 Jun 2026 11:19:14 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/gpu-generations/ Hopper (H100 / H200 / GH200) the Hopper datacenter generation (GH100 die, TSMC 4N, 2022-2024): the H100 and H200 GPUs, the GH200 Grace Hopper superchip, their HGX/DGX systems, and… https://ai-infrastructure.net/gpu-hopper/ Wed, 24 Jun 2026 11:19:14 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/gpu-hopper/ RTX Consumer & Workstation (5090 / 4090 / RTX PRO) NVIDIA's consumer (GeForce RTX 50/40) and professional workstation/server (RTX PRO 6000 Blackwell, RTX 6000 Ada, L40S) GPUs, and how they differ… https://ai-infrastructure.net/gpu-rtx-workstation/ Wed, 24 Jun 2026 11:19:14 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/gpu-rtx-workstation/ Overview Ansible playbooks to take a freshly-imaged GPU node from bare OS to ready for Kubernetes or Slurm: driver stack, Fabric Manager, InfiniBand, tuning. https://ai-infrastructure.net/ansible-node-fabric-bringup/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/ansible-node-fabric-bringup/ BOM Validation Validate a GPU cluster bill of materials before procurement: catch SKU, NVLink, power, cooling, and networking errors while they are cheap to fix. https://ai-infrastructure.net/bom-validation/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/bom-validation/ Cloud, Neoclouds & Cost overview and decision index for GPU beyond the owned hall: hyperscaler instances, the neoclouds, decentralized/permissionless GPU, and the economics that… https://ai-infrastructure.net/cloud-neoclouds-cost/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/cloud-neoclouds-cost/ k3s k3s as a single-binary, CNCF-conformant Kubernetes distribution for edge, small, CI, and dev clusters: the same API as full Kubernetes (Kubernetes) at a… https://ai-infrastructure.net/cluster-k3s/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/cluster-k3s/ Kubernetes Kubernetes as the orchestration technology under a GPU platform (its objects, control loops, and CRD model) and how training, inference, and fine-tuning… https://ai-infrastructure.net/cluster-kubernetes/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/cluster-kubernetes/ Orchestration Overview overview, decision, and index page for the orchestration layer that decides what runs where: the HPC batch world (Slurm), the cloud-native world… https://ai-infrastructure.net/cluster-orchestration/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/cluster-orchestration/ Ray Ray as a Python-native distributed runtime for GPU clusters: tasks, actors, the head/worker model, and the Train/Serve/Data/RLlib libraries that make it… https://ai-infrastructure.net/cluster-ray/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/cluster-ray/ Slurm Slurm as the HPC batch workload manager for GPU clusters, covering partitions, gang scheduling, GRES, topology-aware placement, and multi-node training… https://ai-infrastructure.net/cluster-slurm/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/cluster-slurm/ Commissioning & Acceptance Commission a GPU cluster from racked hardware to production sign-off: the bring-up sequence and the acceptance tests that prove readiness. https://ai-infrastructure.net/commissioning-acceptance/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/commissioning-acceptance/ Datacentre Physical Readiness reading datacentre drawings and confirming the facility can actually host the cluster. Power, UPS, cooling, airflow, weight, and the schematics that… https://ai-infrastructure.net/datacentre-physical/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/datacentre-physical/ Disaggregated Inference splitting LLM inference into separate prefill and decode pools that scale independently, the KV-cache transfer that connects them, and how to run it with… https://ai-infrastructure.net/disaggregated-inference/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/disaggregated-inference/ FSDP / DiLoCo Recipes an index / decision page for the two scale-out paradigms, FSDP for high-bandwidth single-DC sharding and DiLoCo for low-communication / geo-distributed… https://ai-infrastructure.net/distributed-training-recipes/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/distributed-training-recipes/ Distributed Training Platform the frameworks and mechanics of training across many GPUs: launchers, parallelism strategies, the libraries, numerics, checkpoint/resume, fault… https://ai-infrastructure.net/distributed-training/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/distributed-training/ Fine-tuning & Post-training adapting open-weight models through supervised fine-tuning, parameter-efficient LoRA/QLoRA, preference optimisation (DPO), and reinforcement learning… https://ai-infrastructure.net/finetuning-posttraining/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/finetuning-posttraining/ SFT & LoRA/QLoRA supervised fine-tuning on demonstrations, and the parameter-efficient adapters, LoRA (low-rank) and QLoRA (4-bit base), that make it fit large models on… https://ai-infrastructure.net/ft-sft-lora/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/ft-sft-lora/ Glossary concise definitions of the terms used across the knowledge base. Reference glossary, not a single WHAT/WHY/WHEN/HOW topic. https://ai-infrastructure.net/glossary/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/glossary/ Start here Navigate the AI infrastructure knowledge base by role and topic: hardware, fabric, Kubernetes, Slurm, training, inference, operations, and recipes. https://ai-infrastructure.net/gpu-cluster-kb-index/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/gpu-cluster-kb-index/ GPU Performance & Health tuning the collective-communication and GPU layer, and monitoring GPU health, where interconnect saturation and telemetry decide whether the hardware is… https://ai-infrastructure.net/gpu-performance-health/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/gpu-performance-health/ Overview & Node Admin the software stack on a single GPU node and the day-to-day administration of it. Driver, CUDA, the management interfaces, GPU partitioning, and the… https://ai-infrastructure.net/gpu-software-stack/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/gpu-software-stack/ Inference Serving & Optimization Serve LLMs in production: vLLM and SGLang, continuous batching, KV cache, quantization, and disaggregated prefill/decode for cost and latency. https://ai-infrastructure.net/inference-serving/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/inference-serving/ Containers & Kubernetes for GPUs Run GPU workloads on Kubernetes: the NVIDIA device plugin and GPU Operator, DRA, MIG partitioning, time-slicing, and GPUDirect RDMA in pods. https://ai-infrastructure.net/kubernetes-gpu/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/kubernetes-gpu/ Overview Helm manifests to turn a Kubernetes cluster into a GPU platform: GPU Operator, Network/RDMA operator, DRA, sharing models, and gang scheduling. https://ai-infrastructure.net/kubernetes-helm-gpu-platform/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/kubernetes-helm-gpu-platform/ HPC Networking Fabric Design and validate the GPU cluster network fabric: InfiniBand vs RoCE, NVLink and NVSwitch, topologies, and bandwidth validation with nccl-tests. https://ai-infrastructure.net/networking-fabric/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/networking-fabric/ Blackwell Datacenter (B200/B300, GB200/GB300) NVIDIA Blackwell datacenter platform: B200 and B300 GPUs, GB200/GB300 Grace-Blackwell superchips, and the GB300 NVL72 rack for AI training. https://ai-infrastructure.net/nvidia-blackwell-platform/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/nvidia-blackwell-platform/ Observability & Monitoring See what GPUs and the cluster are doing: the metrics that matter, DCGM telemetry, profiling, logging, and alerting for GPU cluster operations. https://ai-infrastructure.net/observability-monitoring/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/observability-monitoring/ Operational Runbooks Index the index of operational runbooks. Each recurring use-case is its own page with trigger, pre-checks, procedure, verification, and rollback. Where the… https://ai-infrastructure.net/operational-runbooks/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/operational-runbooks/ Performance Optimization & Tuning Tune GPU cluster performance end to end: the roofline method, NCCL and PCIe, NUMA pinning, kernel optimization, and finding low-MFU bottlenecks. https://ai-infrastructure.net/performance-optimization/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/performance-optimization/ Overview Turn bare-metal nodes into a schedulable GPU cluster: out-of-band/BMC, PXE imaging, health gating, and choosing Slurm vs Kubernetes for scheduling. https://ai-infrastructure.net/provisioning-scheduling/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/provisioning-scheduling/ Reliability, RAS & Failure Modes what goes wrong with GPUs at scale and how it is detected, classified, and remediated. XID/SXID errors, ECC, HBM row remapping, thermal and bus failures… https://ai-infrastructure.net/reliability-ras/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/reliability-ras/ DPO offline preference alignment, training a policy directly on (chosen, rejected) pairs against a frozen reference, with no reward model and no rollouts… https://ai-infrastructure.net/rl-dpo/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/rl-dpo/ GRPO critic-free reinforcement learning for LLMs, sampling a group of completions per prompt, scoring each against a (often verifiable) reward, and updating… https://ai-infrastructure.net/rl-grpo/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/rl-grpo/ RL Libraries Overview comparison/selection overview and index for the open-source RL post-training libraries: how the systems are structured, which inference and training… https://ai-infrastructure.net/rl-libraries-llms/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/rl-libraries-llms/ NeMo-RL NVIDIA's open-source RL post-training library in the NeMo stack: Ray-orchestrated, with FSDP2/Megatron training and vLLM rollouts, built for scalable… https://ai-infrastructure.net/rllib-nemo-rl/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/rllib-nemo-rl/ OpenRLHF OpenRLHF, a community Ray + vLLM RLHF/agentic-RL framework on DeepSpeed, an early pioneer of asynchronous RL execution with strong reward-model support… https://ai-infrastructure.net/rllib-openrlhf/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/rllib-openrlhf/ SkyRL SkyRL, a modular full-stack RL library for LLMs from the Berkeley Sky Computing Lab (NovaSky), built for multi-turn agentic RL with flexible… https://ai-infrastructure.net/rllib-skyrl/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/rllib-skyrl/ slime Tsinghua / Z.ai's decoupled, async-first RL post-training framework, the RL stack behind the GLM model family. https://ai-infrastructure.net/rllib-slime/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/rllib-slime/ TRL Hugging Face's Transformer Reinforcement Learning library, the HF-native post-training stack (SFT/DPO/GRPO and more) built on the Trainer + Accelerate… https://ai-infrastructure.net/rllib-trl/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/rllib-trl/ verl ByteDance's flexible, high-performance RL post-training library for LLMs, built on the HybridFlow programming model. https://ai-infrastructure.net/rllib-verl/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/rllib-verl/ Add GPU Capacity safely add GPU capacity (new nodes or scale-up) to a running cluster: burn-in, fabric and health validation, then admit to scheduling, with a rollback… https://ai-infrastructure.net/runbook-capacity-add/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/runbook-capacity-add/ Checkpoint Recovery / Resume recover and resume a distributed training job from its last good checkpoint after a crash, preemption, or hardware fault. https://ai-infrastructure.net/runbook-checkpoint-recovery/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/runbook-checkpoint-recovery/ Rolling Driver / CUDA Upgrade the longform procedure for rolling a GPU fleet to a new NVIDIA driver branch (with Fabric Manager and CUDA moved in step), one node at a time behind… https://ai-infrastructure.net/runbook-driver-upgrade/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/runbook-driver-upgrade/ GPU Fault — Drain, Reset, RMA drain, reset, and (if needed) RMA a faulted GPU after an XID/ECC alert, returning the node to service safely. https://ai-infrastructure.net/runbook-gpu-fault-rma/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/runbook-gpu-fault-rma/ Inference SLO Breach diagnose and remediate an inference SLO breach (TTFT/TPOT burn-rate) without taking the service down. https://ai-infrastructure.net/runbook-inference-slo-breach/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/runbook-inference-slo-breach/ Training MFU Regression localize and fix a training throughput (MFU) regression that has fallen below baseline. https://ai-infrastructure.net/runbook-mfu-regression/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/runbook-mfu-regression/ NCCL Hang / Collective Stall Diagnose and clear an NCCL hang or collective stall: when step time goes to infinity with no XID and the whole world-size blocks on a collective. https://ai-infrastructure.net/runbook-nccl-hang/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/runbook-nccl-hang/ Thermal / Cooling Emergency respond to a thermal or cooling emergency (GPU throttling or a CDU alarm) to protect hardware and restore service. https://ai-infrastructure.net/runbook-thermal-emergency/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/runbook-thermal-emergency/ Security, Isolation & Multi-tenancy securing GPU infrastructure and isolating tenants. Hardware isolation (MIG, vGPU), Blackwell confidential computing, the out-of-band/firmware attack… https://ai-infrastructure.net/security-multitenancy/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/security-multitenancy/ Overview Pick and serve open-weight LLMs with vLLM: DeepSeek, Kimi K2, GLM, Qwen, and Llama, with model selection and links to per-model deployment recipes. https://ai-infrastructure.net/serving-oss-models/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/serving-oss-models/ SLO / SLI Catalog index and decision page for service-level indicators and objectives across GPU services. It frames the SLI → SLO → error-budget → burn-rate paradigm… https://ai-infrastructure.net/slo-sli-catalog/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/slo-sli-catalog/ SRE, Platform & MLOps Practices the operating-excellence layer over the whole stack: SLOs and error budgets, GitOps and policy-as-code, IaC, and the MLOps lifecycle. The practices that… https://ai-infrastructure.net/sre-platform-mlops-practices/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/sre-platform-mlops-practices/ Storage & Data Platform Feed the GPUs: parallel filesystems, object storage, local scratch, checkpoint strategy, GPUDirect Storage, and a data-loading path that keeps them busy. https://ai-infrastructure.net/storage-data-platform/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/storage-data-platform/ Telemetry, Monitoring & Alerting Build a GPU monitoring stack: DCGM exporter to Prometheus to Grafana and Alertmanager, with scrape config, dashboards, and PromQL alert rules. https://ai-infrastructure.net/telemetry-monitoring-alerting/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/telemetry-monitoring-alerting/ DDP (Distributed Data Parallel) PyTorch DistributedDataParallel, which replicates the model on every GPU, shards the data, and all-reduces gradients each step. The simplest, fastest… https://ai-infrastructure.net/train-ddp/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/train-ddp/ DeepSpeed & ZeRO the ZeRO family of sharded-data-parallel optimisations (stages 1/2/3 + CPU/NVMe offload) and the deepspeed launcher, as one of the memory-scaling… https://ai-infrastructure.net/train-deepspeed-zero/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/train-deepspeed-zero/ DiLoCo (Low-Communication) a two-level optimisation method that lets workers train mostly independently and synchronise rarely. It is the low-bandwidth alternative to per-step data… https://ai-infrastructure.net/train-diloco/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/train-diloco/ FSDP (Fully Sharded Data Parallel) PyTorch FSDP2 (fully_shard), sharding parameters, gradients, and optimizer state across ranks to train models too large for DDP, and how that scales… https://ai-infrastructure.net/train-fsdp/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/train-fsdp/ Pipeline Parallelism splitting a model's layer stack into stages across devices/nodes and streaming micro-batches through them. The third axis of distributed training's 3D… https://ai-infrastructure.net/train-pipeline-parallel/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/train-pipeline-parallel/ Tensor Parallelism Megatron-style intra-layer model parallelism (splitting a single layer's matmuls across GPUs) as one axis of the parallelism stack in distributed training. https://ai-infrastructure.net/train-tensor-parallel/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/train-tensor-parallel/ Troubleshooting (symptom → fix) a triage index. When something breaks, match the symptom to the runbook that fixes it. This page is the dispatcher: the detailed step-by-step HOW lives… https://ai-infrastructure.net/troubleshooting-runbook/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/troubleshooting-runbook/ Workload & Bring-Up Recipes index and decision overview for runnable GPU-cluster workloads (fabric validation, distributed training, inference serving) and the order they run in… https://ai-infrastructure.net/workload-bringup-recipes/ Wed, 24 Jun 2026 10:27:21 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/workload-bringup-recipes/ Home Deploy and operate GPU clusters: NVIDIA GPUs, DGX/HGX, InfiniBand/RoCE, Kubernetes, Slurm, distributed training, and LLM inference serving. https://ai-infrastructure.net/ Wed, 24 Jun 2026 09:43:35 +0000 AI Infrastructure Knowledge Base https://ai-infrastructure.net/