<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="rss.xsl"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    
    <title>AI Infrastructure Knowledge Base</title>
    <description>A comprehensive, citable knowledge base for deploying, operating, and optimising GPU clusters: NVIDIA Blackwell (B300 / GB300 NVL72), InfiniBand and RoCE fabrics, Kubernetes, k3s, Ray and Slurm, distributed training (FSDP, DDP, ZeRO, tensor and pipeline parallelism, DiLoCo), RL post-training (GRPO, DPO, SFT/LoRA) with verl, slime and SkyRL, LLM inference serving and disaggregation, observability, SRE and MLOps.</description>
    <link>https://ai-infrastructure.net/</link>
    <atom:link href="https://ai-infrastructure.net/feed_rss_created.xml" rel="self" type="application/rss+xml" />

    
    <managingEditor>setloop.io</managingEditor>
    
    <language>en</language>

    
    <pubDate>Thu, 02 Jul 2026 22:42:12 -0000</pubDate>
    <lastBuildDate>Thu, 02 Jul 2026 22:42:12 -0000</lastBuildDate>
    <ttl>1440</ttl>

    
    <generator>MkDocs RSS plugin - v1.19.0</generator>

    
    
    <image>
      <url>None</url>
      <title>AI Infrastructure Knowledge Base</title>
      <link>https://ai-infrastructure.net/</link>
    </image>
    

    
    
    <item>
      <title>KV Cache Token Eviction</title>
      
      
      
      
      <description>Why most KV-cache eviction methods fail in production: FlashAttention never exposes attention scores, and paged allocators free only empty blocks…</description>
      <link>https://ai-infrastructure.net/kv-cache-token-eviction/</link>
      <pubDate>Thu, 02 Jul 2026 19:51:48 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/kv-cache-token-eviction/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/kv-cache-token-eviction.png" type="image/png" length="84338" />
      
    </item>
    
    <item>
      <title>KV Compression No Savings</title>
      
      
      
      
      <description>Diagnose why an enabled KV-cache eviction or compression method shows no memory savings in vLLM: wrong gauge, eager-attention fallback, or block-granular…</description>
      <link>https://ai-infrastructure.net/runbook-kv-compression-no-savings/</link>
      <pubDate>Thu, 02 Jul 2026 19:51:48 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/runbook-kv-compression-no-savings/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/runbook-kv-compression-no-savings.png" type="image/png" length="77215" />
      
    </item>
    
    <item>
      <title>cuTile Rust: Safe Tile Kernels</title>
      
      
      
      
      <description>cuTile Rust extends Rust&#39;s ownership discipline to GPU kernels: mutable outputs are partitioned into disjoint sub-tensors before launch, kernel launches…</description>
      <link>https://ai-infrastructure.net/cutile-rust/</link>
      <pubDate>Thu, 02 Jul 2026 19:50:08 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/cutile-rust/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/cutile-rust.png" type="image/png" length="73981" />
      
    </item>
    
    <item>
      <title>Disaggregation Rate Matching (When It Pays)</title>
      
      
      
      
      <description>When prefill/decode disaggregation actually pays, from NVIDIA&#39;s systematic study of hundreds of thousands of simulated design points: prefill-heavy…</description>
      <link>https://ai-infrastructure.net/disaggregation-rate-matching/</link>
      <pubDate>Thu, 02 Jul 2026 19:50:08 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/disaggregation-rate-matching/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/disaggregation-rate-matching.png" type="image/png" length="79483" />
      
    </item>
    
    <item>
      <title>Legate Sparse: Distributed scipy.sparse</title>
      
      
      
      
      <description>Legate Sparse distributes and accelerates unmodified scipy.sparse programs across CPU and GPU clusters on the Legion runtime, composing with cuPyNumeric…</description>
      <link>https://ai-infrastructure.net/legate-sparse/</link>
      <pubDate>Thu, 02 Jul 2026 19:50:08 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/legate-sparse/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/legate-sparse.png" type="image/png" length="77335" />
      
    </item>
    
    <item>
      <title>Evaluating Speculative Decoding (SPEED-Bench)</title>
      
      
      
      
      <description>Speculative decoding speedups are data-dependent: acceptance rates vary by domain, batch size shifts the optimal draft length, and random-token…</description>
      <link>https://ai-infrastructure.net/speculative-decoding-evaluation/</link>
      <pubDate>Thu, 02 Jul 2026 19:50:08 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/speculative-decoding-evaluation/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/speculative-decoding-evaluation.png" type="image/png" length="86035" />
      
    </item>
    
    <item>
      <title>Automated Harness Optimization</title>
      
      
      
      
      <description>A worked case study of the Meta-Harness loop on Harvey&#39;s Legal Agent Benchmark: an LLM proposer rewrites the harness around a frozen open model, a…</description>
      <link>https://ai-infrastructure.net/automated-harness-optimization/</link>
      <pubDate>Thu, 02 Jul 2026 19:20:29 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/automated-harness-optimization/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/automated-harness-optimization.png" type="image/png" length="82596" />
      
    </item>
    
    <item>
      <title>Kubernetes Network Drivers (DRA Networking)</title>
      
      
      
      
      <description>The Kubernetes Network Driver model replaces the CNI + device-plugin composition with DRA ResourceClaims and NRI runtime hooks: declarative…</description>
      <link>https://ai-infrastructure.net/kubernetes-network-drivers/</link>
      <pubDate>Thu, 02 Jul 2026 19:20:29 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/kubernetes-network-drivers/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/kubernetes-network-drivers.png" type="image/png" length="89661" />
      
    </item>
    
    <item>
      <title>LLM Inference Efficiency (Convergence Map)</title>
      
      
      
      
      <description>How models, hardware, and serving algorithms converge on LLM inference efficiency: prefill vs decode roofline placement, the bandwidth ladder, FP8/FP4…</description>
      <link>https://ai-infrastructure.net/llm-inference-efficiency/</link>
      <pubDate>Thu, 02 Jul 2026 19:20:29 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/llm-inference-efficiency/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/llm-inference-efficiency.png" type="image/png" length="73064" />
      
    </item>
    
    <item>
      <title>Loop Engineering</title>
      
      
      
      
      <description>Loop engineering is the layer above the harness: scheduled, self-feeding loops that discover work, hand it to agents, verify it with an independent…</description>
      <link>https://ai-infrastructure.net/loop-engineering/</link>
      <pubDate>Thu, 02 Jul 2026 19:20:29 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/loop-engineering/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/loop-engineering.png" type="image/png" length="75799" />
      
    </item>
    
    <item>
      <title>Multi-Agent Collaboration (TradingAgents)</title>
      
      
      
      
      <description>How role-specialized LLM agent teams collaborate through structured reports and bounded natural-language debate, using TradingAgents (arXiv 2412.20138)…</description>
      <link>https://ai-infrastructure.net/multi-agent-collaboration/</link>
      <pubDate>Thu, 02 Jul 2026 19:20:29 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/multi-agent-collaboration/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/multi-agent-collaboration.png" type="image/png" length="84702" />
      
    </item>
    
    <item>
      <title>NeMo AutoModel (MoE Fine-Tuning)</title>
      
      
      
      
      <description>NVIDIA NeMo AutoModel subclasses Transformers v5 AutoModelForCausalLM and adds Expert Parallelism on a dedicated moe_mesh, DeepEP fused all-to-all…</description>
      <link>https://ai-infrastructure.net/nemo-automodel/</link>
      <pubDate>Thu, 02 Jul 2026 19:20:29 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/nemo-automodel/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/nemo-automodel.png" type="image/png" length="78015" />
      
    </item>
    
    <item>
      <title>OpenHands Agent Platform</title>
      
      
      
      
      <description>OpenHands (f.k.a. OpenDevin) as an agent platform: the event-stream architecture, the step-function agent abstraction, the Docker-sandboxed runtime with…</description>
      <link>https://ai-infrastructure.net/openhands-platform/</link>
      <pubDate>Thu, 02 Jul 2026 19:20:29 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/openhands-platform/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/openhands-platform.png" type="image/png" length="86224" />
      
    </item>
    
    <item>
      <title>Skill Optimization (SkillOpt)</title>
      
      
      
      
      <description>SkillOpt treats the agent skill document as the trainable external state of a frozen model: an optimizer model turns scored rollouts into bounded…</description>
      <link>https://ai-infrastructure.net/skill-optimization/</link>
      <pubDate>Thu, 02 Jul 2026 19:20:29 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/skill-optimization/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/skill-optimization.png" type="image/png" length="81872" />
      
    </item>
    
    <item>
      <title>Time-Series Foundation Models (TimesFM)</title>
      
      
      
      
      <description>TimesFM is a 200M-parameter decoder-only foundation model that forecasts unseen time-series zero-shot: input patching, longer output patches for fewer…</description>
      <link>https://ai-infrastructure.net/time-series-foundation-models/</link>
      <pubDate>Thu, 02 Jul 2026 19:20:29 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/time-series-foundation-models/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/time-series-foundation-models.png" type="image/png" length="76268" />
      
    </item>
    
    <item>
      <title>Chat Rendering &amp; Loss Masking</title>
      
      
      
      
      <description>the renderer layer that sits between structured chat messages and token sequences in every post-training stack: how conversations become supervised…</description>
      <link>https://ai-infrastructure.net/chat-rendering-loss-masking/</link>
      <pubDate>Thu, 02 Jul 2026 17:01:12 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/chat-rendering-loss-masking/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/chat-rendering-loss-masking.png" type="image/png" length="69131" />
      
    </item>
    
    <item>
      <title>Cybersecurity Agent Evaluation</title>
      
      
      
      
      <description>How to measure what AI agents can do in security, offensively and defensively: real-world-grounded, sandboxed benchmarks scored on outcomes and…</description>
      <link>https://ai-infrastructure.net/cyber-agent-evaluation/</link>
      <pubDate>Thu, 02 Jul 2026 17:01:12 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/cyber-agent-evaluation/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/cyber-agent-evaluation.png" type="image/png" length="77286" />
      
    </item>
    
    <item>
      <title>LoRA Hyperparameter Scaling</title>
      
      
      
      
      <description>empirically calibrated rules for setting LoRA post-training hyperparameters: the 10x learning-rate multiplier over full fine-tuning, hidden-size LR…</description>
      <link>https://ai-infrastructure.net/lora-hyperparameter-scaling/</link>
      <pubDate>Thu, 02 Jul 2026 17:01:12 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/lora-hyperparameter-scaling/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/lora-hyperparameter-scaling.png" type="image/png" length="67912" />
      
    </item>
    
    <item>
      <title>Tinker (Training-as-a-Service)</title>
      
      
      
      
      <description>Thinking Machines&#39; managed fine-tuning API (tinker) and its open-source recipe library (tinker-cookbook): the four training primitives, the multi-tenant…</description>
      <link>https://ai-infrastructure.net/rllib-tinker/</link>
      <pubDate>Thu, 02 Jul 2026 17:01:12 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/rllib-tinker/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/rllib-tinker.png" type="image/png" length="62883" />
      
    </item>
    
    <item>
      <title>Model Weight Loading (Inference Engines)</title>
      
      
      
      
      <description>How an inference engine turns safetensors on disk into a running model: the config architectures field and model registry, PyTorch dotted parameter…</description>
      <link>https://ai-infrastructure.net/engine-weight-loading/</link>
      <pubDate>Thu, 02 Jul 2026 14:55:24 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/engine-weight-loading/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/engine-weight-loading.png" type="image/png" length="75630" />
      
    </item>
    
    <item>
      <title>FMware Performance Engineering (SPE)</title>
      
      
      
      
      <description>Making foundation-model-powered software (FMware) meet throughput and latency SLOs instead of treating performance as a post-deployment afterthought: the…</description>
      <link>https://ai-infrastructure.net/fmware-performance-engineering/</link>
      <pubDate>Thu, 02 Jul 2026 14:55:24 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/fmware-performance-engineering/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/fmware-performance-engineering.png" type="image/png" length="70502" />
      
    </item>
    
    <item>
      <title>Local coding agents</title>
      
      
      
      
      <description>Run a coding agent fully locally on open-weight models: serve the model with Ollama or vLLM behind an OpenAI-compatible endpoint, point a coding harness…</description>
      <link>https://ai-infrastructure.net/local-coding-agents/</link>
      <pubDate>Thu, 02 Jul 2026 14:55:24 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/local-coding-agents/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/local-coding-agents.png" type="image/png" length="67206" />
      
    </item>
    
    <item>
      <title>Looped &amp; Recurrent-Depth Transformers</title>
      
      
      
      
      <description>Weight-tied transformer blocks applied iteratively to refine a latent state: iterative depth as a scaling axis orthogonal to model size and data…</description>
      <link>https://ai-infrastructure.net/looped-transformers/</link>
      <pubDate>Thu, 02 Jul 2026 14:55:24 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/looped-transformers/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/looped-transformers.png" type="image/png" length="72574" />
      
    </item>
    
    <item>
      <title>Muon Optimizer &amp; Distributed Muon (DMuon)</title>
      
      
      
      
      <description>Muon is a matrix-aware optimizer that orthogonalizes each weight-matrix gradient with a Newton-Schulz iteration; it is ~2x more compute-efficient than…</description>
      <link>https://ai-infrastructure.net/muon-optimizer/</link>
      <pubDate>Thu, 02 Jul 2026 14:55:24 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/muon-optimizer/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/muon-optimizer.png" type="image/png" length="82187" />
      
    </item>
    
    <item>
      <title>Delta Weight Sync (Sparse Weight Transfer)</title>
      
      
      
      
      <description>Only ~1-3% of weights change per RL step, so shipping just the delta cuts trainer-to-rollout weight-sync traffic ~100x, losslessly and bit-identically…</description>
      <link>https://ai-infrastructure.net/delta-weight-sync/</link>
      <pubDate>Thu, 02 Jul 2026 08:45:53 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/delta-weight-sync/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/delta-weight-sync.png" type="image/png" length="68235" />
      
    </item>
    
    <item>
      <title>GRPO Variants &amp; Training Tricks</title>
      
      
      
      
      <description>The fixes that make GRPO work at scale: DAPO&#39;s clip-higher, dynamic sampling, token-level loss and overlong shaping; Dr. GRPO&#39;s bias fixes; GSPO/GMPO…</description>
      <link>https://ai-infrastructure.net/grpo-variants/</link>
      <pubDate>Thu, 02 Jul 2026 06:45:15 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/grpo-variants/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/grpo-variants.png" type="image/png" length="71489" />
      
    </item>
    
    <item>
      <title>LLM Benchmarks (Anatomy &amp; Metrics)</title>
      
      
      
      
      <description>How LLM capability benchmarks are built and read: task formats, the metrics (accuracy, a validated pass@k estimator, calibration, discrimination), the…</description>
      <link>https://ai-infrastructure.net/llm-benchmarks/</link>
      <pubDate>Thu, 02 Jul 2026 06:45:15 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/llm-benchmarks/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/llm-benchmarks.png" type="image/png" length="83672" />
      
    </item>
    
    <item>
      <title>RL Scaling Laws</title>
      
      
      
      
      <description>How RL post-training compute scales: sigmoidal reward-vs-compute curves (ScaleRL), power-law fits across model sizes, what sets the asymptote versus the…</description>
      <link>https://ai-infrastructure.net/rl-scaling-laws/</link>
      <pubDate>Thu, 02 Jul 2026 06:45:15 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/rl-scaling-laws/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/rl-scaling-laws.png" type="image/png" length="63419" />
      
    </item>
    
    <item>
      <title>Rollout Redundancy (Prompt Dedup &amp; Cascade Attention)</title>
      
      
      
      
      <description>Group-sampling RL (GRPO/PPO) makes the prompt massively shared across rollouts; exploit it twice: prompt deduplication in the training forward/backward…</description>
      <link>https://ai-infrastructure.net/rl-rollout-redundancy/</link>
      <pubDate>Thu, 02 Jul 2026 06:11:01 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/rl-rollout-redundancy/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/rl-rollout-redundancy.png" type="image/png" length="79945" />
      
    </item>
    
    <item>
      <title>RLSD (RL + Self-Distillation)</title>
      
      
      
      
      <description>RLSD fuses RLVR and privileged-context self-distillation: the verifiable reward sets each update&#39;s direction while a token-level self-distillation signal…</description>
      <link>https://ai-infrastructure.net/rlsd/</link>
      <pubDate>Thu, 02 Jul 2026 06:11:01 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/rlsd/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/rlsd.png" type="image/png" length="72463" />
      
    </item>
    
    <item>
      <title>vLLM: GLM-5.2-FP8</title>
      
      
      
      
      <description>a vLLM reference template for serving zai-org/GLM-5.2-FP8, Z.ai&#39;s current flagship long-horizon agentic coding and reasoning model: what GLM-5.2 is (a…</description>
      <link>https://ai-infrastructure.net/cookbook-vllm-glm-5-2/</link>
      <pubDate>Thu, 02 Jul 2026 05:54:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/cookbook-vllm-glm-5-2/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/cookbook-vllm-glm-5-2.png" type="image/png" length="55085" />
      
    </item>
    
    <item>
      <title>Experiment Tracking &amp; Model Registry</title>
      
      
      
      
      <description>The MLOps backbone for finetuning/post-training: track every run&#39;s params/metrics/artifacts, version models in a registry with promotion stages, and…</description>
      <link>https://ai-infrastructure.net/experiment-tracking-model-registry/</link>
      <pubDate>Thu, 02 Jul 2026 05:54:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/experiment-tracking-model-registry/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/experiment-tracking-model-registry.png" type="image/png" length="87132" />
      
    </item>
    
    <item>
      <title>LLM Evaluation Harness &amp; Eval Gate</title>
      
      
      
      
      <description>Measure a post-trained model reproducibly: benchmark suites, the harnesses that run them (lm-evaluation-harness, lighteval), decontamination for honest…</description>
      <link>https://ai-infrastructure.net/llm-evaluation-harness/</link>
      <pubDate>Thu, 02 Jul 2026 05:54:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/llm-evaluation-harness/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/llm-evaluation-harness.png" type="image/png" length="65755" />
      
    </item>
    
    <item>
      <title>Model Merging (SLERP/TIES/DARE)</title>
      
      
      
      
      <description>Combine multiple fine-tuned checkpoints into one model with no training: task vectors, interference resolution (TIES, DARE), interpolation (SLERP, model…</description>
      <link>https://ai-infrastructure.net/model-merging/</link>
      <pubDate>Thu, 02 Jul 2026 05:54:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/model-merging/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/model-merging.png" type="image/png" length="73325" />
      
    </item>
    
    <item>
      <title>Multi-LoRA / Adapter Serving</title>
      
      
      
      
      <description>Serve hundreds of LoRA adapters over one shared base model: heterogeneous batching across adapters (S-LoRA, Punica), adapter paging, and the vLLM…</description>
      <link>https://ai-infrastructure.net/multi-lora-serving/</link>
      <pubDate>Thu, 02 Jul 2026 05:54:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/multi-lora-serving/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/multi-lora-serving.png" type="image/png" length="76102" />
      
    </item>
    
    <item>
      <title>Synthetic Data Generation</title>
      
      
      
      
      <description>Generate finetuning data with LLMs: teacher/distillation data, instruction synthesis (Self-Instruct, Evol-Instruct, Magpie), and AI feedback, plus the…</description>
      <link>https://ai-infrastructure.net/synthetic-data-generation/</link>
      <pubDate>Thu, 02 Jul 2026 05:54:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/synthetic-data-generation/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/synthetic-data-generation.png" type="image/png" length="72032" />
      
    </item>
    
    <item>
      <title>Training-Data Curation &amp; Decontamination</title>
      
      
      
      
      <description>Turn raw or synthetic data into a training set that helps: exact/fuzzy/semantic deduplication, quality filtering, benchmark decontamination, and dataset…</description>
      <link>https://ai-infrastructure.net/training-data-curation/</link>
      <pubDate>Thu, 02 Jul 2026 05:54:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/training-data-curation/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/training-data-curation.png" type="image/png" length="71599" />
      
    </item>
    
    <item>
      <title>Changelog</title>
      
      
      
      
      <description>What is new in the AI Infrastructure Knowledge Base: new pages and notable updates, newest first, so additions are easy to find.</description>
      <link>https://ai-infrastructure.net/changelog/</link>
      <pubDate>Wed, 01 Jul 2026 20:24:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/changelog/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/changelog.png" type="image/png" length="74948" />
      
    </item>
    
    <item>
      <title>Autonomous Experimentation Loops</title>
      
      
      
      
      <description>Closed-loop autonomous ML experimentation: an LLM proposes hyperparameters or code changes, a bounded trial runs, an evaluator scores it, the loop keeps…</description>
      <link>https://ai-infrastructure.net/autonomous-experimentation-loops/</link>
      <pubDate>Wed, 01 Jul 2026 19:56:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/autonomous-experimentation-loops/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/autonomous-experimentation-loops.png" type="image/png" length="71755" />
      
    </item>
    
    <item>
      <title>Evaluation Integrity &amp; Anti-Gaming</title>
      
      
      
      
      <description>Protect the evaluator from the optimizer: the frozen-vs-mutable boundary, reward hacking and Goodhart&#39;s law, sandbox enforcement, and held-out integrity…</description>
      <link>https://ai-infrastructure.net/evaluation-integrity-anti-gaming/</link>
      <pubDate>Wed, 01 Jul 2026 19:56:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/evaluation-integrity-anti-gaming/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/evaluation-integrity-anti-gaming.png" type="image/png" length="74537" />
      
    </item>
    
    <item>
      <title>Learning-Curve Extrapolation &amp; Early Stopping</title>
      
      
      
      
      <description>Kill doomed training trials early: forecast the final metric from a partial learning curve, and multi-fidelity bandits (Successive Halving, Hyperband…</description>
      <link>https://ai-infrastructure.net/learning-curve-extrapolation-early-stopping/</link>
      <pubDate>Wed, 01 Jul 2026 19:56:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/learning-curve-extrapolation-early-stopping/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/learning-curve-extrapolation-early-stopping.png" type="image/png" length="71340" />
      
    </item>
    
    <item>
      <title>LLM Request Routing (MoM)</title>
      
      
      
      
      <description>Route each LLM request to the right model in a heterogeneous pool: predictive vs cascading routing, decision signals (semantic, preference-learned…</description>
      <link>https://ai-infrastructure.net/llm-request-routing/</link>
      <pubDate>Wed, 01 Jul 2026 19:56:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/llm-request-routing/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/llm-request-routing.png" type="image/png" length="75268" />
      
    </item>
    
    <item>
      <title>On-Policy Distillation</title>
      
      
      
      
      <description>On-policy distillation post-training: the student trains on its own sampled rollouts, graded per-token by a teacher (reverse KL). GKD, the dense-reward…</description>
      <link>https://ai-infrastructure.net/on-policy-distillation/</link>
      <pubDate>Wed, 01 Jul 2026 19:56:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/on-policy-distillation/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/on-policy-distillation.png" type="image/png" length="86462" />
      
    </item>
    
    <item>
      <title>RLVR (Verifiable Rewards)</title>
      
      
      
      
      <description>RLVR post-training: reward an LLM from a deterministic verifier (answer match, unit tests, format, proof) instead of a learned reward model. Verifier…</description>
      <link>https://ai-infrastructure.net/rlvr/</link>
      <pubDate>Wed, 01 Jul 2026 19:56:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/rlvr/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/rlvr.png" type="image/png" length="74314" />
      
    </item>
    
    <item>
      <title>vLLM Semantic Router</title>
      
      
      
      
      <description>The vLLM Semantic Router: an Envoy External Processor that classifies each request with Rust/Candle BERT models and routes it across a Mixture-of-Models…</description>
      <link>https://ai-infrastructure.net/vllm-semantic-router/</link>
      <pubDate>Wed, 01 Jul 2026 19:56:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/vllm-semantic-router/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/vllm-semantic-router.png" type="image/png" length="78476" />
      
    </item>
    
    <item>
      <title>vLLM: DeepSeek-V3.2-Exp</title>
      
      
      
      
      <description>a vLLM reference template for serving deepseek-ai/DeepSeek-V3.2-Exp, DeepSeek&#39;s sparse-attention model: what DeepSeek Sparse Attention (DSA) changes, why…</description>
      <link>https://ai-infrastructure.net/cookbook-vllm-deepseek-v3-2/</link>
      <pubDate>Wed, 01 Jul 2026 19:13:32 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/cookbook-vllm-deepseek-v3-2/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/cookbook-vllm-deepseek-v3-2.png" type="image/png" length="60375" />
      
    </item>
    
    <item>
      <title>vLLM: MiniMax-M2</title>
      
      
      
      
      <description>a vLLM reference template for serving MiniMaxAI/MiniMax-M2, MiniMax&#39;s efficient MoE agent and reasoning model: what it is (230B total / 10B active…</description>
      <link>https://ai-infrastructure.net/cookbook-vllm-minimax-m2/</link>
      <pubDate>Wed, 01 Jul 2026 19:13:32 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/cookbook-vllm-minimax-m2/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/cookbook-vllm-minimax-m2.png" type="image/png" length="51654" />
      
    </item>
    
    <item>
      <title>vLLM: small models on consumer GPUs</title>
      
      
      
      
      <description>running the current small open-weight models (roughly 1B to 32B) on a single consumer or workstation GPU with vLLM: which model fits which card, the VRAM…</description>
      <link>https://ai-infrastructure.net/cookbook-vllm-consumer-gpu/</link>
      <pubDate>Wed, 01 Jul 2026 17:06:53 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/cookbook-vllm-consumer-gpu/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/cookbook-vllm-consumer-gpu.png" type="image/png" length="74663" />
      
    </item>
    
    <item>
      <title>Dynamic &amp; Fractional GPU Sharing</title>
      
      
      
      
      <description>sharing a GPU by real, changing demand instead of a fixed partition. Covers fractional allocation (a memory ceiling plus a compute share) with schedulers…</description>
      <link>https://ai-infrastructure.net/dynamic-fractional-gpu-sharing/</link>
      <pubDate>Wed, 01 Jul 2026 16:37:12 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/dynamic-fractional-gpu-sharing/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/dynamic-fractional-gpu-sharing.png" type="image/png" length="74016" />
      
    </item>
    
    <item>
      <title>Use as an agent skill</title>
      
      
      
      
      <description>Install ai-infrastructure.net as a reusable agent skill so any AI agent can use this GPU and AI-infrastructure knowledge base as a cited source of truth.</description>
      <link>https://ai-infrastructure.net/agent-skill/</link>
      <pubDate>Mon, 29 Jun 2026 21:26:12 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/agent-skill/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/agent-skill.png" type="image/png" length="71962" />
      
    </item>
    
    <item>
      <title>Governing Self-Modifying Agents</title>
      
      
      
      
      <description>When an agent edits its own prompt, tools, or middleware at runtime, govern the optimizer: change contracts, two-track promotion, shadow evaluation, and…</description>
      <link>https://ai-infrastructure.net/agent-governance-self-modifying/</link>
      <pubDate>Mon, 29 Jun 2026 21:11:53 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/agent-governance-self-modifying/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/agent-governance-self-modifying.png" type="image/png" length="75469" />
      
    </item>
    
    <item>
      <title>Identity &amp; Access</title>
      
      
      
      
      <description>Under whose authority an agent acts: a credential broker that restores the user&#39;s identity to the backend, plus zero-trust, ABAC, and SPIFFE workload…</description>
      <link>https://ai-infrastructure.net/agent-identity-access/</link>
      <pubDate>Mon, 29 Jun 2026 21:11:53 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/agent-identity-access/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/agent-identity-access.png" type="image/png" length="64463" />
      
    </item>
    
    <item>
      <title>Intent Verification</title>
      
      
      
      
      <description>Confirm an agent action matches what the user actually asked using an out-of-band signed-intent attestation, because an in-chat confirmation can be…</description>
      <link>https://ai-infrastructure.net/agent-intent-verification/</link>
      <pubDate>Mon, 29 Jun 2026 21:11:53 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/agent-intent-verification/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/agent-intent-verification.png" type="image/png" length="59977" />
      
    </item>
    
    <item>
      <title>Agentic Loop Economics</title>
      
      
      
      
      <description>Why prefix caching dominates agentic loop cost and latency: prefill O(N^2), byte-identical KV-cache reuse, and the harness moves that keep or break the…</description>
      <link>https://ai-infrastructure.net/agent-loop-economics/</link>
      <pubDate>Mon, 29 Jun 2026 21:11:53 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/agent-loop-economics/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/agent-loop-economics.png" type="image/png" length="65548" />
      
    </item>
    
    <item>
      <title>Policy Engine</title>
      
      
      
      
      <description>Decide whether an agent action may run: a deny-by-default policy engine in the in-process tool hook, named Cedar rules, one policy across runtimes, and…</description>
      <link>https://ai-infrastructure.net/agent-policy-engine/</link>
      <pubDate>Mon, 29 Jun 2026 21:11:53 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/agent-policy-engine/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/agent-policy-engine.png" type="image/png" length="60721" />
      
    </item>
    
    <item>
      <title>Self-Improving Harnesses</title>
      
      
      
      
      <description>The agent harness as an optimization target: searched, ablated, and self-edited, and lifting control flow from the prompt into an explicit program graph.</description>
      <link>https://ai-infrastructure.net/agent-self-improving-harness/</link>
      <pubDate>Mon, 29 Jun 2026 21:11:53 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/agent-self-improving-harness/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/agent-self-improving-harness.png" type="image/png" length="61760" />
      
    </item>
    
    <item>
      <title>Offensive AI &amp; Arms Race</title>
      
      
      
      
      <description>How AI shifts the offense-defense balance in software security: the discovery-versus-construction split, capability that scales with inference budget…</description>
      <link>https://ai-infrastructure.net/offensive-ai-security/</link>
      <pubDate>Mon, 29 Jun 2026 21:11:53 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/offensive-ai-security/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/offensive-ai-security.png" type="image/png" length="75941" />
      
    </item>
    
    <item>
      <title>Container Image Supply-Chain Provenance</title>
      
      
      
      
      <description>ensuring the container images that run on your GPU nodes are exactly what you built and authorized. The chain from build to registry to node: digest…</description>
      <link>https://ai-infrastructure.net/container-image-provenance/</link>
      <pubDate>Mon, 29 Jun 2026 18:39:32 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/container-image-provenance/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/container-image-provenance.png" type="image/png" length="77844" />
      
    </item>
    
    <item>
      <title>Context &amp; Memory</title>
      
      
      
      
      <description>Manage an agent&#39;s working set: context engineering, the storage-versus-presentation split, hierarchical reduction (compaction then summarization)…</description>
      <link>https://ai-infrastructure.net/agent-context-memory/</link>
      <pubDate>Mon, 29 Jun 2026 18:22:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/agent-context-memory/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/agent-context-memory.png" type="image/png" length="62736" />
      
    </item>
    
    <item>
      <title>Evaluating Agents</title>
      
      
      
      
      <description>Evaluate agents on output, components, and trajectory: build datasets, do error analysis into a failure taxonomy, write PASS/FAIL rubrics, use…</description>
      <link>https://ai-infrastructure.net/agent-evaluation/</link>
      <pubDate>Mon, 29 Jun 2026 18:22:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/agent-evaluation/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/agent-evaluation.png" type="image/png" length="57435" />
      
    </item>
    
    <item>
      <title>Harness Architecture</title>
      
      
      
      
      <description>The harness around an LLM (context management, tool dispatch, error recovery, state, and memory) and why it drives agent reliability more than the model…</description>
      <link>https://ai-infrastructure.net/agent-harness-architecture/</link>
      <pubDate>Mon, 29 Jun 2026 18:22:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/agent-harness-architecture/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/agent-harness-architecture.png" type="image/png" length="61488" />
      
    </item>
    
    <item>
      <title>The Agent Loop</title>
      
      
      
      
      <description>The think-act-observe cycle at the core of every agent: ReAct, the run/step/think/act control flow, termination and loop guards, and when a loop beats a…</description>
      <link>https://ai-infrastructure.net/agent-loop/</link>
      <pubDate>Mon, 29 Jun 2026 18:22:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/agent-loop/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/agent-loop.png" type="image/png" length="55138" />
      
    </item>
    
    <item>
      <title>Agent Observability</title>
      
      
      
      
      <description>Tracing the agent inference path: capture the full prompt and response on every model call, span the trajectory with OpenTelemetry GenAI conventions, and…</description>
      <link>https://ai-infrastructure.net/agent-observability/</link>
      <pubDate>Mon, 29 Jun 2026 18:22:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/agent-observability/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/agent-observability.png" type="image/png" length="59356" />
      
    </item>
    
    <item>
      <title>Orchestration &amp; Control Plane</title>
      
      
      
      
      <description>Treat agent orchestration as a control plane, not a message bus: a decide() gate chain for authorization, mutation, budget, retries, and identity that…</description>
      <link>https://ai-infrastructure.net/agent-orchestration-control-plane/</link>
      <pubDate>Mon, 29 Jun 2026 18:22:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/agent-orchestration-control-plane/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/agent-orchestration-control-plane.png" type="image/png" length="75375" />
      
    </item>
    
    <item>
      <title>Planning &amp; Reasoning</title>
      
      
      
      
      <description>Give agents time to think: ReAct&#39;s limits on complex tasks, explicit planning and task decomposition, reflection and failure recovery, and the…</description>
      <link>https://ai-infrastructure.net/agent-planning-reasoning/</link>
      <pubDate>Mon, 29 Jun 2026 18:22:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/agent-planning-reasoning/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/agent-planning-reasoning.png" type="image/png" length="65078" />
      
    </item>
    
    <item>
      <title>Sandboxing &amp; Isolation</title>
      
      
      
      
      <description>Run agent-generated code and tool calls safely: the isolation spectrum from containers to gVisor and Firecracker microVMs, real container-escape CVEs…</description>
      <link>https://ai-infrastructure.net/agent-sandboxing-isolation/</link>
      <pubDate>Mon, 29 Jun 2026 18:22:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/agent-sandboxing-isolation/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/agent-sandboxing-isolation.png" type="image/png" length="70019" />
      
    </item>
    
    <item>
      <title>Threat Model</title>
      
      
      
      
      <description>Why agency turns model flaws into system compromise: the lethal trifecta, lost identity and intent, the OWASP LLM Top 10 and MITRE ATLAS, and the shift…</description>
      <link>https://ai-infrastructure.net/agent-security-threat-model/</link>
      <pubDate>Mon, 29 Jun 2026 18:22:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/agent-security-threat-model/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/agent-security-threat-model.png" type="image/png" length="68395" />
      
    </item>
    
    <item>
      <title>Tools &amp; Function Calling</title>
      
      
      
      
      <description>How agents act: the five-step tool-calling mechanism, JSON-Schema function definitions, tool-design rules, the Model Context Protocol, and code execution…</description>
      <link>https://ai-infrastructure.net/agent-tools-function-calling/</link>
      <pubDate>Mon, 29 Jun 2026 18:22:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/agent-tools-function-calling/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/agent-tools-function-calling.png" type="image/png" length="63035" />
      
    </item>
    
    <item>
      <title>Start here</title>
      
      
      
      
      <description>Build and secure LLM agents: the agent loop, tools and function calling, context and memory, harness architecture, orchestration, observability…</description>
      <link>https://ai-infrastructure.net/agentic-systems-index/</link>
      <pubDate>Mon, 29 Jun 2026 18:22:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/agentic-systems-index/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/agentic-systems-index.png" type="image/png" length="83483" />
      
    </item>
    
    <item>
      <title>Agentic AIOps &amp; Autonomous Operations</title>
      
      
      
      
      <description>using software (increasingly LLM agents) to run the incident lifecycle on a GPU cluster: detect an anomaly, localize the faulty component, perform…</description>
      <link>https://ai-infrastructure.net/aiops-agentic-operations/</link>
      <pubDate>Mon, 29 Jun 2026 18:22:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/aiops-agentic-operations/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/aiops-agentic-operations.png" type="image/png" length="81121" />
      
    </item>
    
    <item>
      <title>Distributed Training as a Platform Service</title>
      
      
      
      
      <description>running elastic, multi-worker distributed training as a managed service on a Kubernetes GPU cluster, the platform layer that schedules workers, brings up…</description>
      <link>https://ai-infrastructure.net/distributed-training-platform-service/</link>
      <pubDate>Mon, 29 Jun 2026 18:22:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/distributed-training-platform-service/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/distributed-training-platform-service.png" type="image/png" length="72006" />
      
    </item>
    
    <item>
      <title>GPU Confidential Computing &amp; Attestation</title>
      
      
      
      
      <description>running a workload on an NVIDIA GPU so that its code, weights, and data are protected in use from the host, the hypervisor, and the cloud operator. This…</description>
      <link>https://ai-infrastructure.net/gpu-confidential-computing/</link>
      <pubDate>Mon, 29 Jun 2026 18:22:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/gpu-confidential-computing/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/gpu-confidential-computing.png" type="image/png" length="80805" />
      
    </item>
    
    <item>
      <title>GPU Platform Split-Plane Architecture</title>
      
      
      
      
      <description>the reference architecture for a GPU-on-demand platform (a neocloud, an internal GPU cloud, or a multi-provider GPU service) that splits into a…</description>
      <link>https://ai-infrastructure.net/gpu-platform-split-plane-architecture/</link>
      <pubDate>Mon, 29 Jun 2026 18:22:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/gpu-platform-split-plane-architecture/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/gpu-platform-split-plane-architecture.png" type="image/png" length="69373" />
      
    </item>
    
    <item>
      <title>K8s Pod Networking over WireGuard</title>
      
      
      
      
      <description>making a real Kubernetes pod network work when the nodes are not on a shared L2 fabric but stitched together by a WireGuard overlay, the specific…</description>
      <link>https://ai-infrastructure.net/kubernetes-networking-wireguard-hybrid/</link>
      <pubDate>Mon, 29 Jun 2026 18:22:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/kubernetes-networking-wireguard-hybrid/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/kubernetes-networking-wireguard-hybrid.png" type="image/png" length="74328" />
      
    </item>
    
    <item>
      <title>Operator for GPU Allocation</title>
      
      
      
      
      <description>the custom Kubernetes operator pattern that turns declarative GPU-workload intents (“lease 8×H100 for this tenant”, “run this served deployment”…</description>
      <link>https://ai-infrastructure.net/kubernetes-operator-gpu-orchestration/</link>
      <pubDate>Mon, 29 Jun 2026 18:22:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/kubernetes-operator-gpu-orchestration/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/kubernetes-operator-gpu-orchestration.png" type="image/png" length="65096" />
      
    </item>
    
    <item>
      <title>Continuous NCCL Fabric Benchmarking</title>
      
      
      
      
      <description>treating inter-node collective bandwidth as a standing, monitored signal rather than a one-time bring-up check. A long-lived service periodically…</description>
      <link>https://ai-infrastructure.net/nccl-fabric-benchmarking-service/</link>
      <pubDate>Mon, 29 Jun 2026 18:22:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/nccl-fabric-benchmarking-service/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/nccl-fabric-benchmarking-service.png" type="image/png" length="70480" />
      
    </item>
    
    <item>
      <title>Overlay &amp; Mesh Networking (Geo-Distributed)</title>
      
      
      
      
      <description>the encrypted overlay network that stitches GPU nodes living behind different providers&#39; NATs and firewalls (across regions, clouds, or a decentralized…</description>
      <link>https://ai-infrastructure.net/overlay-mesh-networking-distributed-gpu/</link>
      <pubDate>Mon, 29 Jun 2026 18:22:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/overlay-mesh-networking-distributed-gpu/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/overlay-mesh-networking-distributed-gpu.png" type="image/png" length="85275" />
      
    </item>
    
    <item>
      <title>Prompt-Injection Defense</title>
      
      
      
      
      <description>Defend agents against direct, indirect, and streaming prompt injection: detector ensembles with honest accuracy limits, two-checkpoint streaming…</description>
      <link>https://ai-infrastructure.net/prompt-injection-defense/</link>
      <pubDate>Mon, 29 Jun 2026 18:22:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/prompt-injection-defense/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/prompt-injection-defense.png" type="image/png" length="58321" />
      
    </item>
    
    <item>
      <title>Remote GPU Verification (Rented Hardware)</title>
      
      
      
      
      <description>how to verify that a GPU you do not own (a rented neocloud instance, a node in a decentralized marketplace, a provider you cannot physically inspect)…</description>
      <link>https://ai-infrastructure.net/remote-gpu-verification/</link>
      <pubDate>Mon, 29 Jun 2026 18:22:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/remote-gpu-verification/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/remote-gpu-verification.png" type="image/png" length="76319" />
      
    </item>
    
    <item>
      <title>Agentic &amp; Tool-Use RL</title>
      
      
      
      
      <description>RL post-training for agentic, tool-using LLMs: multi-turn ReAct rollouts, tool-output loss masking, trajectory rewards, and the systems cost of tool…</description>
      <link>https://ai-infrastructure.net/agentic-rl/</link>
      <pubDate>Mon, 29 Jun 2026 15:32:55 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/agentic-rl/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/agentic-rl.png" type="image/png" length="71505" />
      
    </item>
    
    <item>
      <title>Recipe: Memory-Efficient GRPO Post-Training</title>
      
      
      
      
      <description>Run GRPO RL post-training on one node: a 4-bit QLoRA base, FP8 vLLM rollouts, a verifiable reward, and group-relative advantages, with apply and verify.</description>
      <link>https://ai-infrastructure.net/recipe-grpo-posttraining/</link>
      <pubDate>Mon, 29 Jun 2026 08:50:46 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/recipe-grpo-posttraining/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/recipe-grpo-posttraining.png" type="image/png" length="81677" />
      
    </item>
    
    <item>
      <title>PyTorch Custom CUDA Extensions</title>
      
      
      
      
      <description>Ship a hand-written CUDA kernel into PyTorch via torch.utils.cpp_extension or pybind11, wrapped as a torch.autograd.Function with forward and backward.</description>
      <link>https://ai-infrastructure.net/pytorch-cuda-extensions/</link>
      <pubDate>Mon, 29 Jun 2026 08:12:35 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/pytorch-cuda-extensions/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/pytorch-cuda-extensions.png" type="image/png" length="86724" />
      
    </item>
    
    <item>
      <title>Quantization for Inference</title>
      
      
      
      
      <description>Quantize LLM weights and activations for GPU serving with INT8, INT4, FP8 and NVFP4, plus GPTQ, AWQ, SmoothQuant and NF4 QLoRA for cheaper inference.</description>
      <link>https://ai-infrastructure.net/quantization-inference/</link>
      <pubDate>Mon, 29 Jun 2026 08:12:35 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/quantization-inference/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/quantization-inference.png" type="image/png" length="81520" />
      
    </item>
    
    <item>
      <title>Tensor Core Programming</title>
      
      
      
      
      <description>Program NVIDIA Tensor Cores from CUDA with the WMMA fragment API, Hopper WGMMA warp-group MMA in inline PTX, and Blackwell TCGen05 at a high level.</description>
      <link>https://ai-infrastructure.net/tensor-core-programming/</link>
      <pubDate>Mon, 29 Jun 2026 08:12:35 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/tensor-core-programming/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/tensor-core-programming.png" type="image/png" length="83779" />
      
    </item>
    
    <item>
      <title>Async &amp; Disaggregated RL Systems</title>
      
      
      
      
      <description>Scale LLM RL post-training with asynchronous, disaggregated rollout and training: off-policy staleness, truncated importance sampling, and weight sync.</description>
      <link>https://ai-infrastructure.net/async-rl-systems/</link>
      <pubDate>Mon, 29 Jun 2026 07:43:49 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/async-rl-systems/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/async-rl-systems.png" type="image/png" length="84152" />
      
    </item>
    
    <item>
      <title>Rejection Sampling &amp; Best-of-N</title>
      
      
      
      
      <description>Rejection sampling fine-tuning and Best-of-N for LLM post-training, the simple bridge from SFT to RL. Generate N completions, score, SFT on the best.</description>
      <link>https://ai-infrastructure.net/rejection-sampling-bon/</link>
      <pubDate>Mon, 29 Jun 2026 07:43:49 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/rejection-sampling-bon/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/rejection-sampling-bon.png" type="image/png" length="86018" />
      
    </item>
    
    <item>
      <title>Reward Model Training</title>
      
      
      
      
      <description>Train a reward model for RLHF with the Bradley-Terry log-sigmoid loss and a scalar head, plus ORM, PRM, and generative LLM-as-a-judge variants in TRL.</description>
      <link>https://ai-infrastructure.net/reward-model-training/</link>
      <pubDate>Mon, 29 Jun 2026 07:43:49 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/reward-model-training/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/reward-model-training.png" type="image/png" length="74348" />
      
    </item>
    
    <item>
      <title>PPO</title>
      
      
      
      
      <description>actor-critic RL for LLM RLHF, a trainable policy updated on a clipped surrogate objective, with per-token advantages from a separately-trained value…</description>
      <link>https://ai-infrastructure.net/rl-ppo/</link>
      <pubDate>Mon, 29 Jun 2026 07:43:49 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/rl-ppo/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/rl-ppo.png" type="image/png" length="48441" />
      
    </item>
    
    <item>
      <title>Reward Design for RL</title>
      
      
      
      
      <description>Design rewards for LLM RL post-training: verifiable rewards (RLVR), reward models, shaping and normalization, and how to detect and prevent reward hacking.</description>
      <link>https://ai-infrastructure.net/reward-design-rl-posttraining/</link>
      <pubDate>Mon, 29 Jun 2026 07:09:11 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/reward-design-rl-posttraining/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/reward-design-rl-posttraining.png" type="image/png" length="97092" />
      
    </item>
    
    <item>
      <title>Burn-Rate Alerting Rules</title>
      
      
      
      
      <description>a reusable multi-window, multi-burn-rate SLO alerting pattern: how to derive burn rate from an error budget, the fast+slow window pairing, the Prometheus…</description>
      <link>https://ai-infrastructure.net/alerting-burn-rate-rules/</link>
      <pubDate>Sun, 28 Jun 2026 15:17:25 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/alerting-burn-rate-rules/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/alerting-burn-rate-rules.png" type="image/png" length="62723" />
      
    </item>
    
    <item>
      <title>Build-vs-Rent Cost Model</title>
      
      
      
      
      <description>A concrete TCO model to decide owning vs renting GPUs: capex (GPUs, network, facility) plus opex (power, cooling, staff) against a cloud GPU-hour rate…</description>
      <link>https://ai-infrastructure.net/build-vs-rent-gpu-cost-model/</link>
      <pubDate>Sun, 28 Jun 2026 15:17:25 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/build-vs-rent-gpu-cost-model/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/build-vs-rent-gpu-cost-model.png" type="image/png" length="69913" />
      
    </item>
    
    <item>
      <title>GPU Capacity Planning</title>
      
      
      
      
      <description>sizing a GPU fleet by building a demand model from workloads, setting target utilization and headroom, fitting the power/cooling envelope, and folding in…</description>
      <link>https://ai-infrastructure.net/gpu-capacity-planning/</link>
      <pubDate>Sun, 28 Jun 2026 15:17:25 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/gpu-capacity-planning/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/gpu-capacity-planning.png" type="image/png" length="62046" />
      
    </item>
    
    <item>
      <title>GPU Consumption Models</title>
      
      
      
      
      <description>On-demand vs reserved/committed vs spot/preemptible vs colocation/owned for GPUs. The price/risk tradeoffs, which workload shape each fits (bursty vs…</description>
      <link>https://ai-infrastructure.net/gpu-consumption-models/</link>
      <pubDate>Sun, 28 Jun 2026 15:17:25 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/gpu-consumption-models/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/gpu-consumption-models.png" type="image/png" length="69075" />
      
    </item>
    
    <item>
      <title>GPU Provider Landscape</title>
      
      
      
      
      <description>the four categories you can rent GPUs from (hyperscalers, GPU neoclouds, decentralized/marketplace, and second-hand/distressed capacity), what each is…</description>
      <link>https://ai-infrastructure.net/gpu-provider-landscape/</link>
      <pubDate>Sun, 28 Jun 2026 15:17:25 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/gpu-provider-landscape/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/gpu-provider-landscape.png" type="image/png" length="65490" />
      
    </item>
    
    <item>
      <title>KubeRay (Ray on Kubernetes)</title>
      
      
      
      
      <description>running Ray on Kubernetes with the KubeRay operator, covering install, the RayCluster / RayJob / RayService CRDs, exposing GPUs and RDMA to Ray pods…</description>
      <link>https://ai-infrastructure.net/kuberay-integration/</link>
      <pubDate>Sun, 28 Jun 2026 15:17:25 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/kuberay-integration/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/kuberay-integration.png" type="image/png" length="68999" />
      
    </item>
    
    <item>
      <title>Orchestration Decision Guide</title>
      
      
      
      
      <description>choosing the orchestrator for GPU work, Slurm vs Kubernetes vs Ray (and hybrids), by workload shape (batch HPC vs services vs Python-native), team, and…</description>
      <link>https://ai-infrastructure.net/orchestration-decision-guide/</link>
      <pubDate>Sun, 28 Jun 2026 15:17:25 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/orchestration-decision-guide/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/orchestration-decision-guide.png" type="image/png" length="70800" />
      
    </item>
    
    <item>
      <title>Playbook: End-to-End Bring-Up</title>
      
      
      
      
      <description>the ordered, step-by-step playbook that takes a freshly built cluster to its first running workload (facility - fabric proof - nodes - K8s GPU stack…</description>
      <link>https://ai-infrastructure.net/playbook-end-to-end-workload-bringup/</link>
      <pubDate>Sun, 28 Jun 2026 15:17:25 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/playbook-end-to-end-workload-bringup/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/playbook-end-to-end-workload-bringup.png" type="image/png" length="64596" />
      
    </item>
    
    <item>
      <title>Ray on Slurm</title>
      
      
      
      
      <description>Composing Ray with a Slurm allocation: starting a Ray head and workers inside one sbatch job, address/port wiring, GPU binding, clean teardown, and when…</description>
      <link>https://ai-infrastructure.net/ray-on-slurm/</link>
      <pubDate>Sun, 28 Jun 2026 15:17:25 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/ray-on-slurm/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/ray-on-slurm.png" type="image/png" length="58507" />
      
    </item>
    
    <item>
      <title>Recipe: DiLoCo (geo-distributed)</title>
      
      
      
      
      <description>a standalone recipe to run low-communication DiLoCo training across datacenters or poorly-connected workers, covering inner/outer optimizer config, sync…</description>
      <link>https://ai-infrastructure.net/recipe-diloco-geo-distributed/</link>
      <pubDate>Sun, 28 Jun 2026 15:17:25 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/recipe-diloco-geo-distributed/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/recipe-diloco-geo-distributed.png" type="image/png" length="69651" />
      
    </item>
    
    <item>
      <title>Recipe: Fabric Validation (nccl-tests)</title>
      
      
      
      
      <description>a standalone, executable recipe to validate the cluster fabric with nccl-tests as a Kubeflow MPIJob, covering the manifest, how to apply it, the…</description>
      <link>https://ai-infrastructure.net/recipe-fabric-validation-nccl-tests/</link>
      <pubDate>Sun, 28 Jun 2026 15:17:25 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/recipe-fabric-validation-nccl-tests/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/recipe-fabric-validation-nccl-tests.png" type="image/png" length="69727" />
      
    </item>
    
    <item>
      <title>Recipe: FSDP (single datacenter)</title>
      
      
      
      
      <description>a standalone recipe to run FSDP2 (fully_shard) training inside one NVLink/InfiniBand datacenter: the launcher and config (sharding granularity, mixed…</description>
      <link>https://ai-infrastructure.net/recipe-fsdp-single-dc/</link>
      <pubDate>Sun, 28 Jun 2026 15:17:25 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/recipe-fsdp-single-dc/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/recipe-fsdp-single-dc.png" type="image/png" length="71392" />
      
    </item>
    
    <item>
      <title>Recipe: Gang-Scheduled Training</title>
      
      
      
      
      <description>a standalone recipe to launch a gang-scheduled distributed-training smoke job (Volcano Job + torchrun): the manifest, apply/verify, an MFU sanity check…</description>
      <link>https://ai-infrastructure.net/recipe-gang-scheduled-training/</link>
      <pubDate>Sun, 28 Jun 2026 15:17:25 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/recipe-gang-scheduled-training/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/recipe-gang-scheduled-training.png" type="image/png" length="70048" />
      
    </item>
    
    <item>
      <title>Recipe: vLLM Inference Deployment</title>
      
      
      
      
      <description>A model-agnostic recipe to deploy a vLLM OpenAI-compatible server on Kubernetes: Deployment, Service, HPA, model and token config, and health checks.</description>
      <link>https://ai-infrastructure.net/recipe-vllm-inference-deployment/</link>
      <pubDate>Sun, 28 Jun 2026 15:17:25 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/recipe-vllm-inference-deployment/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/recipe-vllm-inference-deployment.png" type="image/png" length="73793" />
      
    </item>
    
    <item>
      <title>Driver / Module Load Failure</title>
      
      
      
      
      <description>recover a node where the NVIDIA kernel modules fail to load or refuse to bind. Either modprobe nvidia errors, or nvidia-smi returns Failed to initialize…</description>
      <link>https://ai-infrastructure.net/runbook-driver-module-load-failure/</link>
      <pubDate>Sun, 28 Jun 2026 15:17:25 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/runbook-driver-module-load-failure/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/runbook-driver-module-load-failure.png" type="image/png" length="60928" />
      
    </item>
    
    <item>
      <title>Inference KV-Cache OOM</title>
      
      
      
      
      <description>Stabilize an LLM server thrashing on KV-cache pressure: size KV memory, cap concurrency, tune block allocation, and stop preemption and recompute loops.</description>
      <link>https://ai-infrastructure.net/runbook-inference-kv-cache-oom/</link>
      <pubDate>Sun, 28 Jun 2026 15:17:25 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/runbook-inference-kv-cache-oom/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/runbook-inference-kv-cache-oom.png" type="image/png" length="77054" />
      
    </item>
    
    <item>
      <title>NVLink Visibility / P2P Failure</title>
      
      
      
      
      <description>diagnose GPUs that cannot see each other over NVLink: nvidia-smi nvlink --status shows links inactive, CUDA P2P access is disabled, and collectives…</description>
      <link>https://ai-infrastructure.net/runbook-nvlink-visibility-failure/</link>
      <pubDate>Sun, 28 Jun 2026 15:17:25 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/runbook-nvlink-visibility-failure/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/runbook-nvlink-visibility-failure.png" type="image/png" length="62447" />
      
    </item>
    
    <item>
      <title>PCIe / P2P Bandwidth Regression</title>
      
      
      
      
      <description>investigate a PCIe link trained down (lower gen/width) or P2P blocked by ACS (H2D/D2H/P2P bandwidth far below expected) and restore full bandwidth.</description>
      <link>https://ai-infrastructure.net/runbook-pcie-p2p-bandwidth-regression/</link>
      <pubDate>Sun, 28 Jun 2026 15:17:25 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/runbook-pcie-p2p-bandwidth-regression/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/runbook-pcie-p2p-bandwidth-regression.png" type="image/png" length="73911" />
      
    </item>
    
    <item>
      <title>Scheduler: GPU Job Pending</title>
      
      
      
      
      <description>diagnose a Kubernetes (Pending) or Slurm (PD) GPU job that never starts (insufficient allocatable GPUs, taints/affinity, MIG/profile mismatch, quota, or…</description>
      <link>https://ai-infrastructure.net/runbook-scheduler-pending-gpu-job/</link>
      <pubDate>Sun, 28 Jun 2026 15:17:25 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/runbook-scheduler-pending-gpu-job/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/runbook-scheduler-pending-gpu-job.png" type="image/png" length="66440" />
      
    </item>
    
    <item>
      <title>Training OOM</title>
      
      
      
      
      <description>Triage CUDA out-of-memory in distributed training: tell a true OOM from fragmentation, then shrink the working set or fix the allocator and resume.</description>
      <link>https://ai-infrastructure.net/runbook-training-oom/</link>
      <pubDate>Sun, 28 Jun 2026 15:17:25 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/runbook-training-oom/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/runbook-training-oom.png" type="image/png" length="77889" />
      
    </item>
    
    <item>
      <title>SLOs: Cluster &amp; Fabric</title>
      
      
      
      
      <description>operational SLOs for the cluster substrate (allocatable-GPU health, fabric link health and bandwidth, node readiness, and thermal headroom) expressed as…</description>
      <link>https://ai-infrastructure.net/slo-cluster-fabric/</link>
      <pubDate>Sun, 28 Jun 2026 15:17:25 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/slo-cluster-fabric/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/slo-cluster-fabric.png" type="image/png" length="62429" />
      
    </item>
    
    <item>
      <title>SLOs: Inference Serving</title>
      
      
      
      
      <description>Define and measure inference serving SLOs: TTFT, TPOT/ITL, throughput, and error rate, with concrete PromQL SLIs and target-setting guidance.</description>
      <link>https://ai-infrastructure.net/slo-inference-serving/</link>
      <pubDate>Sun, 28 Jun 2026 15:17:25 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/slo-inference-serving/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/slo-inference-serving.png" type="image/png" length="71181" />
      
    </item>
    
    <item>
      <title>SLOs: Training Platform</title>
      
      
      
      
      <description>training-platform SLOs distinct from inference (scheduler queue wait, job success rate, goodput/MFU, checkpoint success, and infra-failure rate), with…</description>
      <link>https://ai-infrastructure.net/slo-training-platform/</link>
      <pubDate>Sun, 28 Jun 2026 15:17:25 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/slo-training-platform/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/slo-training-platform.png" type="image/png" length="63750" />
      
    </item>
    
    <item>
      <title>AI-Assisted Optimization</title>
      
      
      
      
      <description>using AI to optimize AI systems: LLM/agent-driven kernel generation and autotuning, AI-discovered algorithms (AlphaTensor-style faster GEMM), automated…</description>
      <link>https://ai-infrastructure.net/ai-driven-performance-optimization/</link>
      <pubDate>Sun, 28 Jun 2026 10:12:50 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/ai-driven-performance-optimization/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/ai-driven-performance-optimization.png" type="image/png" length="62125" />
      
    </item>
    
    <item>
      <title>Grace CPU</title>
      
      
      
      
      <description>the NVIDIA Grace CPU (a 72-core Arm Neoverse V2 server processor with high-bandwidth on-package LPDDR5X) and its role next to the GPU inside a Grace…</description>
      <link>https://ai-infrastructure.net/grace-cpu/</link>
      <pubDate>Sun, 28 Jun 2026 10:12:50 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/grace-cpu/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/grace-cpu.png" type="image/png" length="57770" />
      
    </item>
    
    <item>
      <title>Mechanical Sympathy &amp; Codesign</title>
      
      
      
      
      <description>the book&#39;s organizing principle: write software with the grain of the hardware (mechanical sympathy), and exploit the virtuous cycle where hardware…</description>
      <link>https://ai-infrastructure.net/mechanical-sympathy-codesign/</link>
      <pubDate>Sun, 28 Jun 2026 10:12:50 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/mechanical-sympathy-codesign/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/mechanical-sympathy-codesign.png" type="image/png" length="76589" />
      
    </item>
    
    <item>
      <title>GPU Roadmap (Rubin, Feynman)</title>
      
      
      
      
      <description>The datacenter GPU/platform cadence, Hopper to Blackwell to Blackwell Ultra (B300/GB300) to Vera Rubin to Feynman, covering what each generation changes…</description>
      <link>https://ai-infrastructure.net/nvidia-gpu-roadmap/</link>
      <pubDate>Sun, 28 Jun 2026 10:12:50 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/nvidia-gpu-roadmap/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/nvidia-gpu-roadmap.png" type="image/png" length="71320" />
      
    </item>
    
    <item>
      <title>Scaling to 100T Parameters</title>
      
      
      
      
      <description>The systems thesis for the next order of magnitude, covering the memory wall, sparse MoE holding per-token FLOPs flat while capacity grows, rack-scale…</description>
      <link>https://ai-infrastructure.net/scaling-100t-parameters/</link>
      <pubDate>Sun, 28 Jun 2026 10:12:50 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/scaling-100t-parameters/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/scaling-100t-parameters.png" type="image/png" length="64963" />
      
    </item>
    
    <item>
      <title>Constrained Decoding</title>
      
      
      
      
      <description>Force model output to a grammar, JSON schema, or regex by masking next-token logits to only the tokens a compiled FSM / pushdown automaton allows at each…</description>
      <link>https://ai-infrastructure.net/constrained-decoding/</link>
      <pubDate>Sun, 28 Jun 2026 10:03:32 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/constrained-decoding/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/constrained-decoding.png" type="image/png" length="83542" />
      
    </item>
    
    <item>
      <title>Continuous Batching Internals</title>
      
      
      
      
      <description>how a modern serving scheduler (vLLM, SGLang) iterates via token-level continuous (in-flight) batching that admits and retires requests every step, the…</description>
      <link>https://ai-infrastructure.net/continuous-batching-internals/</link>
      <pubDate>Sun, 28 Jun 2026 10:03:32 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/continuous-batching-internals/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/continuous-batching-internals.png" type="image/png" length="68523" />
      
    </item>
    
    <item>
      <title>Expert Parallelism (Inference)</title>
      
      
      
      
      <description>sharding MoE experts across GPUs (expert parallelism, EP), the all-to-all dispatch/combine it forces at every MoE layer, keeping that all-to-all on…</description>
      <link>https://ai-infrastructure.net/expert-parallelism-inference/</link>
      <pubDate>Sun, 28 Jun 2026 10:03:32 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/expert-parallelism-inference/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/expert-parallelism-inference.png" type="image/png" length="64402" />
      
    </item>
    
    <item>
      <title>Inference Parallelism Strategies</title>
      
      
      
      
      <description>choosing a parallelism layout for serving (distinct from training), tensor parallelism inside the NVLink domain for latency, pipeline parallelism across…</description>
      <link>https://ai-infrastructure.net/inference-parallelism-strategies/</link>
      <pubDate>Sun, 28 Jun 2026 10:03:32 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/inference-parallelism-strategies/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/inference-parallelism-strategies.png" type="image/png" length="67332" />
      
    </item>
    
    <item>
      <title>QoS &amp; Admission Control</title>
      
      
      
      
      <description>protecting inference SLOs under load. Priority classes and per-request SLO targets, admission control / load shedding when the queue grows, latency-aware…</description>
      <link>https://ai-infrastructure.net/inference-qos-admission-control/</link>
      <pubDate>Sun, 28 Jun 2026 10:03:32 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/inference-qos-admission-control/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/inference-qos-admission-control.png" type="image/png" length="67742" />
      
    </item>
    
    <item>
      <title>KV Cache Management</title>
      
      
      
      
      <description>managing the KV cache that dominates decode memory, covering PagedAttention block tables to kill fragmentation, prefix caching to reuse shared prefixes…</description>
      <link>https://ai-infrastructure.net/kv-cache-management/</link>
      <pubDate>Sun, 28 Jun 2026 10:03:32 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/kv-cache-management/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/kv-cache-management.png" type="image/png" length="61821" />
      
    </item>
    
    <item>
      <title>KV Cache Transfer (NIXL)</title>
      
      
      
      
      <description>moving the KV cache from prefill workers to decode workers (and across memory/storage tiers) in disaggregated serving, using NVIDIA&#39;s Inference Xfer…</description>
      <link>https://ai-infrastructure.net/kv-cache-transfer-nixl/</link>
      <pubDate>Sun, 28 Jun 2026 10:03:32 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/kv-cache-transfer-nixl/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/kv-cache-transfer-nixl.png" type="image/png" length="66179" />
      
    </item>
    
    <item>
      <title>MoE Routing &amp; Load Balancing</title>
      
      
      
      
      <description>The gating/router (top-k selection) in a Mixture-of-Experts layer, why uneven expert load wastes GPUs under expert parallelism (the straggler effect)…</description>
      <link>https://ai-infrastructure.net/moe-routing-load-balancing/</link>
      <pubDate>Sun, 28 Jun 2026 10:03:32 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/moe-routing-load-balancing/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/moe-routing-load-balancing.png" type="image/png" length="76062" />
      
    </item>
    
    <item>
      <title>MoE Sparse Scaling</title>
      
      
      
      
      <description>why sparse MoE decouples total parameters from per-token FLOPs: the routing math, shared vs routed experts, and the memory-to-fit vs compute-to-run…</description>
      <link>https://ai-infrastructure.net/moe-sparsity-scaling/</link>
      <pubDate>Sun, 28 Jun 2026 10:03:32 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/moe-sparsity-scaling/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/moe-sparsity-scaling.png" type="image/png" length="64222" />
      
    </item>
    
    <item>
      <title>Speculative Decoding</title>
      
      
      
      
      <description>accelerating the decode stage by proposing several tokens cheaply (small draft model, n-gram/suffix lookup, or EAGLE/Medusa/MTP heads) and verifying them…</description>
      <link>https://ai-infrastructure.net/speculative-decoding/</link>
      <pubDate>Sun, 28 Jun 2026 10:03:32 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/speculative-decoding/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/speculative-decoding.png" type="image/png" length="66485" />
      
    </item>
    
    <item>
      <title>Data Loading Pipeline Tuning</title>
      
      
      
      
      <description>Keeping GPUs fed from the host input pipeline: sizing num_workers and prefetch_factor, pin_memory plus non_blocking H2D copies, persistent_workers…</description>
      <link>https://ai-infrastructure.net/data-loading-pipeline-tuning/</link>
      <pubDate>Sun, 28 Jun 2026 09:32:58 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/data-loading-pipeline-tuning/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/data-loading-pipeline-tuning.png" type="image/png" length="60465" />
      
    </item>
    
    <item>
      <title>Parallel FS &amp; DeepSeek 3FS</title>
      
      
      
      
      <description>parallel/distributed filesystems for AI clusters (Lustre, IBM Storage Scale/GPFS, WekaFS) and DeepSeek&#39;s open-source Fire-Flyer File System (3FS): why…</description>
      <link>https://ai-infrastructure.net/deepseek-3fs-filesystem/</link>
      <pubDate>Sun, 28 Jun 2026 09:32:58 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/deepseek-3fs-filesystem/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/deepseek-3fs-filesystem.png" type="image/png" length="65987" />
      
    </item>
    
    <item>
      <title>GPU Decompression (nvCOMP)</title>
      
      
      
      
      <description>trading cheap GPU cycles for scarce storage/PCIe bandwidth by storing data compressed on disk and decompressing it on the GPU. Covers the Blackwell…</description>
      <link>https://ai-infrastructure.net/gpu-decompression-engine/</link>
      <pubDate>Sun, 28 Jun 2026 09:32:58 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/gpu-decompression-engine/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/gpu-decompression-engine.png" type="image/png" length="71505" />
      
    </item>
    
    <item>
      <title>GPUDirect Storage (GDS)</title>
      
      
      
      
      <description>the direct DMA path from NVMe/NVMe-oF/RDMA-NAS into GPU HBM that bypasses the CPU bounce buffer, via the cuFile API and the nvidia-fs kernel module…</description>
      <link>https://ai-infrastructure.net/gpudirect-storage-gds/</link>
      <pubDate>Sun, 28 Jun 2026 09:32:58 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/gpudirect-storage-gds/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/gpudirect-storage-gds.png" type="image/png" length="71810" />
      
    </item>
    
    <item>
      <title>NVIDIA DALI</title>
      
      
      
      
      <description>Offloading decode and augmentation (image/video/audio) onto the GPU with a NVIDIA DALI pipeline to remove the CPU preprocessing bottleneck: the…</description>
      <link>https://ai-infrastructure.net/nvidia-dali-pipeline/</link>
      <pubDate>Sun, 28 Jun 2026 09:32:58 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/nvidia-dali-pipeline/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/nvidia-dali-pipeline.png" type="image/png" length="75232" />
      
    </item>
    
    <item>
      <title>BlueField DPUs</title>
      
      
      
      
      <description>NVIDIA BlueField-3 offloading networking, storage, and security off the host CPU: line-rate RDMA/RoCE, NVMe over Fabrics, data-path isolation for secure…</description>
      <link>https://ai-infrastructure.net/bluefield-dpu-networking/</link>
      <pubDate>Sun, 28 Jun 2026 09:15:31 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/bluefield-dpu-networking/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/bluefield-dpu-networking.png" type="image/png" length="58313" />
      
    </item>
    
    <item>
      <title>Comms-Compute Overlap</title>
      
      
      
      
      <description>Hiding collective latency behind compute so the GPU stops waiting on the network: DDP gradient bucketing (wait-free backprop), FSDP all-gather prefetch…</description>
      <link>https://ai-infrastructure.net/comms-compute-overlap/</link>
      <pubDate>Sun, 28 Jun 2026 09:15:31 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/comms-compute-overlap/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/comms-compute-overlap.png" type="image/png" length="66277" />
      
    </item>
    
    <item>
      <title>NCCL Collectives &amp; Algorithms</title>
      
      
      
      
      <description>how NCCL implements all-reduce, all-gather, reduce-scatter, and broadcast, how it selects an algorithm (Ring/Tree/CollNet/NVLS) and protocol…</description>
      <link>https://ai-infrastructure.net/nccl-collectives-algorithms/</link>
      <pubDate>Sun, 28 Jun 2026 09:15:31 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/nccl-collectives-algorithms/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/nccl-collectives-algorithms.png" type="image/png" length="65266" />
      
    </item>
    
    <item>
      <title>NVSHMEM GPU-Initiated Comms</title>
      
      
      
      
      <description>NVSHMEM&#39;s PGAS one-sided model (GPU threads issuing put/get directly from kernel code with the CPU off the critical path) for fine-grained compute/comm…</description>
      <link>https://ai-infrastructure.net/nvshmem-gpu-communication/</link>
      <pubDate>Sun, 28 Jun 2026 09:15:31 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/nvshmem-gpu-communication/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/nvshmem-gpu-communication.png" type="image/png" length="69618" />
      
    </item>
    
    <item>
      <title>RDMA &amp; RoCE Tuning</title>
      
      
      
      
      <description>tuning RDMA over Converged Ethernet (RoCEv2) for GPU clusters, covering GPUDirect RDMA (NIC-to-HBM direct DMA), the lossless fabric (PFC + ECN/DCQCN)…</description>
      <link>https://ai-infrastructure.net/rdma-roce-tuning/</link>
      <pubDate>Sun, 28 Jun 2026 09:15:31 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/rdma-roce-tuning/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/rdma-roce-tuning.png" type="image/png" length="63727" />
      
    </item>
    
    <item>
      <title>SHARP In-Network Reduction</title>
      
      
      
      
      <description>NVIDIA SHARP offloading all-reduce / reduce-scatter / all-gather into the InfiniBand switch ASIC (and NVLink SHARP / NVLS in the NVSwitch) to halve…</description>
      <link>https://ai-infrastructure.net/sharp-in-network-reduction/</link>
      <pubDate>Sun, 28 Jun 2026 09:15:31 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/sharp-in-network-reduction/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/sharp-in-network-reduction.png" type="image/png" length="69898" />
      
    </item>
    
    <item>
      <title>GPU Containerization Performance</title>
      
      
      
      
      <description>Reaching near-bare-metal GPU throughput inside containers via host-driver/container-CUDA splitting, OverlayFS I/O avoidance, slim images, and rootless…</description>
      <link>https://ai-infrastructure.net/gpu-containerization-performance/</link>
      <pubDate>Sun, 28 Jun 2026 09:04:49 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/gpu-containerization-performance/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/gpu-containerization-performance.png" type="image/png" length="66829" />
      
    </item>
    
    <item>
      <title>Power, Clocks &amp; Thermal Tuning</title>
      
      
      
      
      <description>controlling GPU clocks, power draw, and heat for deterministic benchmarks and better performance-per-watt: locking core/memory clocks (nvidia-smi -lgc /…</description>
      <link>https://ai-infrastructure.net/gpu-power-thermal-tuning/</link>
      <pubDate>Sun, 28 Jun 2026 09:04:49 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/gpu-power-thermal-tuning/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/gpu-power-thermal-tuning.png" type="image/png" length="63757" />
      
    </item>
    
    <item>
      <title>Topology-Aware K8s Scheduling</title>
      
      
      
      
      <description>making Kubernetes co-locate the GPUs of a multi-GPU pod on the same NVLink/NUMA domain and align the pod&#39;s CPUs and memory to that domain, via the…</description>
      <link>https://ai-infrastructure.net/kubernetes-topology-aware-scheduling/</link>
      <pubDate>Sun, 28 Jun 2026 09:04:49 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/kubernetes-topology-aware-scheduling/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/kubernetes-topology-aware-scheduling.png" type="image/png" length="77579" />
      
    </item>
    
    <item>
      <title>Linux OS &amp; Kernel Tuning</title>
      
      
      
      
      <description>host kernel/OS knobs that keep GPUs fed, namely vm.swappiness/swapoff, transparent hugepages (training vs inference), the performance CPU governor and…</description>
      <link>https://ai-infrastructure.net/linux-os-tuning-gpu-nodes/</link>
      <pubDate>Sun, 28 Jun 2026 09:04:49 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/linux-os-tuning-gpu-nodes/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/linux-os-tuning-gpu-nodes.png" type="image/png" length="62937" />
      
    </item>
    
    <item>
      <title>Activation Checkpointing &amp; Offloading</title>
      
      
      
      
      <description>trade compute and bandwidth for HBM capacity. Recompute activations in the backward pass (activation/gradient checkpointing), and offload activations…</description>
      <link>https://ai-infrastructure.net/activation-checkpointing-offloading/</link>
      <pubDate>Sun, 28 Jun 2026 08:59:06 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/activation-checkpointing-offloading/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/activation-checkpointing-offloading.png" type="image/png" length="80779" />
      
    </item>
    
    <item>
      <title>Attention APIs (SDPA, FlexAttention)</title>
      
      
      
      
      <description>scaled_dot_product_attention backend selection and forcing, FlexAttention for custom masks/biases compiled to fused kernels, and how both map onto the…</description>
      <link>https://ai-infrastructure.net/pytorch-attention-apis/</link>
      <pubDate>Sun, 28 Jun 2026 08:59:06 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/pytorch-attention-apis/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/pytorch-attention-apis.png" type="image/png" length="71104" />
      
    </item>
    
    <item>
      <title>Caching Allocator Tuning</title>
      
      
      
      
      <description>PyTorch&#39;s native CUDA caching allocator and how to tune it with PYTORCH_ALLOC_CONF (expandable_segments, max_split_size_mb, garbage_collection_threshold…</description>
      <link>https://ai-infrastructure.net/pytorch-cuda-memory-allocator/</link>
      <pubDate>Sun, 28 Jun 2026 08:59:06 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/pytorch-cuda-memory-allocator/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/pytorch-cuda-memory-allocator.png" type="image/png" length="66957" />
      
    </item>
    
    <item>
      <title>Performance Regression CI</title>
      
      
      
      
      <description>standing up automated performance regression tests for PyTorch training/inference: capturing step-time, throughput, MFU, and peak-memory baselines…</description>
      <link>https://ai-infrastructure.net/pytorch-perf-regression-ci/</link>
      <pubDate>Sun, 28 Jun 2026 08:59:06 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/pytorch-perf-regression-ci/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/pytorch-perf-regression-ci.png" type="image/png" length="63433" />
      
    </item>
    
    <item>
      <title>PyTorch/XLA Backend</title>
      
      
      
      
      <description>lazy-tensor tracing into an XLA/HLO graph, the XLA fusion/layout compiler, mark_step/torch_xla.sync() graph boundaries, per-shape-signature compilation…</description>
      <link>https://ai-infrastructure.net/pytorch-xla-backend/</link>
      <pubDate>Sun, 28 Jun 2026 08:59:06 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/pytorch-xla-backend/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/pytorch-xla-backend.png" type="image/png" length="67881" />
      
    </item>
    
    <item>
      <title>torch.compile (Capture &amp; Backends)</title>
      
      
      
      
      <description>how torch.compile turns eager PyTorch into fused kernels, covering TorchDynamo bytecode capture, AOTAutograd, the TorchInductor (Triton) backend, the…</description>
      <link>https://ai-infrastructure.net/torch-compile/</link>
      <pubDate>Sun, 28 Jun 2026 08:59:06 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/torch-compile/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/torch-compile.png" type="image/png" length="78759" />
      
    </item>
    
    <item>
      <title>Compute Sanitizer</title>
      
      
      
      
      <description>compute-sanitizer and its four tools (memcheck, racecheck, initcheck, synccheck) for catching out-of-bounds accesses, shared-memory data races…</description>
      <link>https://ai-infrastructure.net/cuda-compute-sanitizer/</link>
      <pubDate>Sun, 28 Jun 2026 08:48:41 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/cuda-compute-sanitizer/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/cuda-compute-sanitizer.png" type="image/png" length="58334" />
      
    </item>
    
    <item>
      <title>Stream-Ordered Allocator</title>
      
      
      
      
      <description>cudaMallocAsync/cudaFreeAsync and the per-device memory pool behind them: stream-ordered allocation that reuses freed blocks without a device-wide sync…</description>
      <link>https://ai-infrastructure.net/cuda-stream-ordered-allocator/</link>
      <pubDate>Sun, 28 Jun 2026 08:48:41 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/cuda-stream-ordered-allocator/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/cuda-stream-ordered-allocator.png" type="image/png" length="61972" />
      
    </item>
    
    <item>
      <title>Unified Memory &amp; NVLink-C2C</title>
      
      
      
      
      <description>a single pointer addressable from CPU and GPU (cudaMallocManaged), on-demand page migration and the page-fault stalls it causes, defusing those stalls…</description>
      <link>https://ai-infrastructure.net/cuda-unified-memory/</link>
      <pubDate>Sun, 28 Jun 2026 08:48:41 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/cuda-unified-memory/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/cuda-unified-memory.png" type="image/png" length="65570" />
      
    </item>
    
    <item>
      <title>CUTLASS: Templated GEMM</title>
      
      
      
      
      <description>CUTLASS as the open, templated C++ library for high-performance GEMM/conv on Tensor Cores, its tiling/pipelining abstractions (CuTe, CollectiveBuilder)…</description>
      <link>https://ai-infrastructure.net/cutlass-gemm/</link>
      <pubDate>Sun, 28 Jun 2026 08:48:41 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/cutlass-gemm/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/cutlass-gemm.png" type="image/png" length="65270" />
      
    </item>
    
    <item>
      <title>Dynamic Parallelism &amp; Device Launch</title>
      
      
      
      
      <description>launching work from the device, via CUDA Dynamic Parallelism (a kernel launches kernels) and device graph launch (a kernel launches a preinstantiated…</description>
      <link>https://ai-infrastructure.net/dynamic-parallelism-device-launch/</link>
      <pubDate>Sun, 28 Jun 2026 08:48:41 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/dynamic-parallelism-device-launch/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/dynamic-parallelism-device-launch.png" type="image/png" length="71740" />
      
    </item>
    
    <item>
      <title>Inline PTX &amp; SASS Tuning</title>
      
      
      
      
      <description>dropping to inline PTX (asm volatile) for instructions the compiler will not emit, reading the real SASS with cuobjdump / nvdisasm to verify what…</description>
      <link>https://ai-infrastructure.net/inline-ptx-sass/</link>
      <pubDate>Sun, 28 Jun 2026 08:48:41 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/inline-ptx-sass/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/inline-ptx-sass.png" type="image/png" length="61404" />
      
    </item>
    
    <item>
      <title>Instruction-Level Parallelism</title>
      
      
      
      
      <description>instruction-level parallelism (ILP) on GPUs, using independent instructions within a thread to hide latency at low occupancy. Covers loop unrolling and…</description>
      <link>https://ai-infrastructure.net/instruction-level-parallelism-gpu/</link>
      <pubDate>Sun, 28 Jun 2026 08:48:41 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/instruction-level-parallelism-gpu/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/instruction-level-parallelism-gpu.png" type="image/png" length="61797" />
      
    </item>
    
    <item>
      <title>Triton: Python GPU Kernels</title>
      
      
      
      
      <description>Triton&#39;s block-based programming model for writing fused GPU kernels in Python (the language torch.compile emits), its autotuner, and when a Triton…</description>
      <link>https://ai-infrastructure.net/openai-triton/</link>
      <pubDate>Sun, 28 Jun 2026 08:48:41 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/openai-triton/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/openai-triton.png" type="image/png" length="62711" />
      
    </item>
    
    <item>
      <title>Persistent &amp; Megakernels</title>
      
      
      
      
      <description>launch one grid sized to the GPU and loop over a work queue to amortize per-launch overhead and keep SMs resident (persistent kernels); fuse a whole…</description>
      <link>https://ai-infrastructure.net/persistent-kernels-megakernels/</link>
      <pubDate>Sun, 28 Jun 2026 08:48:41 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/persistent-kernels-megakernels/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/persistent-kernels-megakernels.png" type="image/png" length="64977" />
      
    </item>
    
    <item>
      <title>Thread Block Clusters &amp; DSMEM</title>
      
      
      
      
      <description>the Hopper/Blackwell thread block cluster (an optional grouping level above the block that co-schedules a set of blocks on SMs within one GPC) and…</description>
      <link>https://ai-infrastructure.net/thread-block-clusters-dsmem/</link>
      <pubDate>Sun, 28 Jun 2026 08:48:41 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/thread-block-clusters-dsmem/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/thread-block-clusters-dsmem.png" type="image/png" length="70929" />
      
    </item>
    
    <item>
      <title>Warp Specialization &amp; Pipelining</title>
      
      
      
      
      <description>splitting a thread block into producer (load) and consumer (compute) warps and software-pipelining the stages with the CUDA Pipeline API and async copies…</description>
      <link>https://ai-infrastructure.net/warp-specialization-pipelining/</link>
      <pubDate>Sun, 28 Jun 2026 08:48:41 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/warp-specialization-pipelining/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/warp-specialization-pipelining.png" type="image/png" length="72849" />
      
    </item>
    
    <item>
      <title>CUDA Graphs</title>
      
      
      
      
      <description>amortizing per-kernel CPU launch overhead by capturing a fixed pipeline of kernels, copies, and events once and replaying it as a single submission…</description>
      <link>https://ai-infrastructure.net/cuda-graphs/</link>
      <pubDate>Sun, 28 Jun 2026 08:14:53 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/cuda-graphs/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/cuda-graphs.png" type="image/png" length="59520" />
      
    </item>
    
    <item>
      <title>Occupancy Tuning</title>
      
      
      
      
      <description>occupancy as active warps versus the SM hardware maximum, the three resource limiters (registers/thread, shared memory/block, block size), theoretical…</description>
      <link>https://ai-infrastructure.net/cuda-occupancy-tuning/</link>
      <pubDate>Sun, 28 Jun 2026 08:14:53 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/cuda-occupancy-tuning/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/cuda-occupancy-tuning.png" type="image/png" length="57997" />
      
    </item>
    
    <item>
      <title>CUDA Streams &amp; Concurrency</title>
      
      
      
      
      <description>CUDA streams as the unit of inter-kernel concurrency: the legacy default stream (stream 0) versus the per-thread default stream (PTDS) versus explicit…</description>
      <link>https://ai-infrastructure.net/cuda-streams-concurrency/</link>
      <pubDate>Sun, 28 Jun 2026 08:14:53 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/cuda-streams-concurrency/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/cuda-streams-concurrency.png" type="image/png" length="71416" />
      
    </item>
    
    <item>
      <title>FlashAttention &amp; MLA</title>
      
      
      
      
      <description>IO-aware exact attention (FlashAttention tiling, online softmax, versions 2 and 3) and DeepSeek&#39;s Multi-Head Latent Attention (MLA), which compresses the…</description>
      <link>https://ai-infrastructure.net/flashattention-mla/</link>
      <pubDate>Sun, 28 Jun 2026 08:14:53 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/flashattention-mla/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/flashattention-mla.png" type="image/png" length="62026" />
      
    </item>
    
    <item>
      <title>Goodput: Useful Throughput</title>
      
      
      
      
      <description>goodput as the north-star efficiency metric for GPU clusters, the useful work per unit time normalized to peak, how to compute it, why it beats raw FLOPS…</description>
      <link>https://ai-infrastructure.net/goodput-ai-systems/</link>
      <pubDate>Sun, 28 Jun 2026 08:14:53 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/goodput-ai-systems/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/goodput-ai-systems.png" type="image/png" length="63794" />
      
    </item>
    
    <item>
      <title>Execution Model (SM, Warp, SIMT)</title>
      
      
      
      
      <description>How an NVIDIA GPU actually executes a kernel: streaming multiprocessors, the 32-thread warp, SIMT lockstep, the thread/block/grid hierarchy, warp…</description>
      <link>https://ai-infrastructure.net/gpu-execution-model-simt/</link>
      <pubDate>Sun, 28 Jun 2026 08:14:53 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/gpu-execution-model-simt/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/gpu-execution-model-simt.png" type="image/png" length="74235" />
      
    </item>
    
    <item>
      <title>GPU Memory Hierarchy</title>
      
      
      
      
      <description>the on-chip-to-off-chip memory tiers of an NVIDIA GPU SM -- registers, shared memory/L1, the read-only/constant caches, L2, and HBM global memory…</description>
      <link>https://ai-infrastructure.net/gpu-memory-hierarchy/</link>
      <pubDate>Sun, 28 Jun 2026 08:14:53 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/gpu-memory-hierarchy/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/gpu-memory-hierarchy.png" type="image/png" length="63725" />
      
    </item>
    
    <item>
      <title>Kernel Fusion</title>
      
      
      
      
      <description>fusing a chain of operations into one CUDA kernel to raise arithmetic intensity, eliminate HBM round-trips for intermediate tensors, and remove…</description>
      <link>https://ai-infrastructure.net/kernel-fusion/</link>
      <pubDate>Sun, 28 Jun 2026 08:14:53 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/kernel-fusion/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/kernel-fusion.png" type="image/png" length="53081" />
      
    </item>
    
    <item>
      <title>Memory Coalescing</title>
      
      
      
      
      <description>how a warp&#39;s 32 lanes should address global (HBM) memory -- contiguous and aligned so the hardware coalesces lane requests into the fewest cache-line…</description>
      <link>https://ai-infrastructure.net/memory-coalescing/</link>
      <pubDate>Sun, 28 Jun 2026 08:14:53 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/memory-coalescing/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/memory-coalescing.png" type="image/png" length="61156" />
      
    </item>
    
    <item>
      <title>Nsight Profiling Workflow</title>
      
      
      
      
      <description>the profile-driven workflow that finds the bottleneck with Nsight Systems (timeline / system view), drills into the offending kernel with Nsight Compute…</description>
      <link>https://ai-infrastructure.net/nsight-profiling-workflow/</link>
      <pubDate>Sun, 28 Jun 2026 08:14:53 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/nsight-profiling-workflow/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/nsight-profiling-workflow.png" type="image/png" length="63235" />
      
    </item>
    
    <item>
      <title>NUMA &amp; CPU Pinning</title>
      
      
      
      
      <description>binding CPU threads, host memory, and PyTorch DataLoader workers to the GPU&#39;s local NUMA node so input pipelines never pay a cross-node hop. Covers…</description>
      <link>https://ai-infrastructure.net/numa-cpu-pinning/</link>
      <pubDate>Sun, 28 Jun 2026 08:14:53 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/numa-cpu-pinning/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/numa-cpu-pinning.png" type="image/png" length="59345" />
      
    </item>
    
    <item>
      <title>Roofline &amp; Arithmetic Intensity</title>
      
      
      
      
      <description>arithmetic intensity (FLOPs per byte), the roofline envelope of peak-compute and peak-bandwidth ceilings, the ridge point that separates memory-bound…</description>
      <link>https://ai-infrastructure.net/roofline-arithmetic-intensity/</link>
      <pubDate>Sun, 28 Jun 2026 08:14:53 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/roofline-arithmetic-intensity/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/roofline-arithmetic-intensity.png" type="image/png" length="65369" />
      
    </item>
    
    <item>
      <title>Shared Memory &amp; Tiling</title>
      
      
      
      
      <description>using on-chip shared memory as a software-managed cache, covering tiling for data reuse (tiled GEMM), the 32-bank conflict model, and padding/swizzling…</description>
      <link>https://ai-infrastructure.net/shared-memory-tiling/</link>
      <pubDate>Sun, 28 Jun 2026 08:14:53 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/shared-memory-tiling/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/shared-memory-tiling.png" type="image/png" length="64027" />
      
    </item>
    
    <item>
      <title>Tensor Cores &amp; Mixed Precision</title>
      
      
      
      
      <description>how Tensor Cores and reduced-precision formats (TF32, BF16/FP16, FP8, NVFP4, INT8) raise arithmetic intensity and throughput, why accumulation precision…</description>
      <link>https://ai-infrastructure.net/tensor-cores-mixed-precision/</link>
      <pubDate>Sun, 28 Jun 2026 08:14:53 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/tensor-cores-mixed-precision/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/tensor-cores-mixed-precision.png" type="image/png" length="70056" />
      
    </item>
    
    <item>
      <title>vLLM: DeepSeek-R1</title>
      
      
      
      
      <description>a production-oriented vLLM reference template for serving deepseek-ai/DeepSeek-R1 as an OpenAI-compatible endpoint: what the model is, why and when to…</description>
      <link>https://ai-infrastructure.net/cookbook-vllm-deepseek-r1/</link>
      <pubDate>Thu, 25 Jun 2026 17:46:06 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/cookbook-vllm-deepseek-r1/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/cookbook-vllm-deepseek-r1.png" type="image/png" length="59050" />
      
    </item>
    
    <item>
      <title>vLLM: GLM-4.7-FP8</title>
      
      
      
      
      <description>a vLLM reference template for serving zai-org/GLM-4.7-FP8: what GLM-4.7 is, why and when to use it for coding and agentic workloads, how to launch the…</description>
      <link>https://ai-infrastructure.net/cookbook-vllm-glm-4-7/</link>
      <pubDate>Thu, 25 Jun 2026 17:46:06 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/cookbook-vllm-glm-4-7/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/cookbook-vllm-glm-4-7.png" type="image/png" length="54020" />
      
    </item>
    
    <item>
      <title>vLLM: Kimi K2</title>
      
      
      
      
      <description>a vLLM reference template for serving moonshotai/Kimi-K2-Instruct: what Kimi K2 is, why and when to use it for agentic/tool workloads, how to size the…</description>
      <link>https://ai-infrastructure.net/cookbook-vllm-kimi-k2/</link>
      <pubDate>Thu, 25 Jun 2026 17:46:06 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/cookbook-vllm-kimi-k2/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/cookbook-vllm-kimi-k2.png" type="image/png" length="48743" />
      
    </item>
    
    <item>
      <title>vLLM: Llama 4 Maverick</title>
      
      
      
      
      <description>a vLLM reference template for serving meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8: what Llama 4 Maverick is, why and when to use it, how to handle…</description>
      <link>https://ai-infrastructure.net/cookbook-vllm-llama-4-maverick/</link>
      <pubDate>Thu, 25 Jun 2026 17:46:06 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/cookbook-vllm-llama-4-maverick/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/cookbook-vllm-llama-4-maverick.png" type="image/png" length="57286" />
      
    </item>
    
    <item>
      <title>vLLM: Qwen3-235B-A22B</title>
      
      
      
      
      <description>a vLLM reference template for serving Qwen/Qwen3-235B-A22B-Instruct-2507: what the model is, why and when to use it, how to deploy the 235B/22B MoE…</description>
      <link>https://ai-infrastructure.net/cookbook-vllm-qwen3-235b/</link>
      <pubDate>Thu, 25 Jun 2026 17:46:06 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/cookbook-vllm-qwen3-235b/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/cookbook-vllm-qwen3-235b.png" type="image/png" length="64073" />
      
    </item>
    
    <item>
      <title>Helm: DRA Driver</title>
      
      
      
      
      <description>install nvidia-dra-driver-gpu via Helm so the kubelet plugin publishes a ResourceSlice per node and pods can claim GPUs through Dynamic Resource…</description>
      <link>https://ai-infrastructure.net/helm-dra-driver/</link>
      <pubDate>Thu, 25 Jun 2026 15:21:25 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/helm-dra-driver/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/helm-dra-driver.png" type="image/png" length="54798" />
      
    </item>
    
    <item>
      <title>Helm: GPU Operator</title>
      
      
      
      
      <description>helm repo add nvidia + helm install gpu-operator, the load-bearing chart values (driver.enabled, driver.version, mig.strategy, toolkit.enabled…</description>
      <link>https://ai-infrastructure.net/helm-gpu-operator/</link>
      <pubDate>Thu, 25 Jun 2026 15:21:25 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/helm-gpu-operator/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/helm-gpu-operator.png" type="image/png" length="57762" />
      
    </item>
    
    <item>
      <title>Helm: Kueue Quota</title>
      
      
      
      
      <description>install Kueue (release manifest or Helm OCI chart), model fair-share GPU quota with ResourceFlavor + ClusterQueue + LocalQueue, share spare capacity…</description>
      <link>https://ai-infrastructure.net/helm-kueue-quota/</link>
      <pubDate>Thu, 25 Jun 2026 15:21:25 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/helm-kueue-quota/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/helm-kueue-quota.png" type="image/png" length="57725" />
      
    </item>
    
    <item>
      <title>Helm: Network Operator</title>
      
      
      
      
      <description>helm install network-operator with the RDMA shared device plugin and the secondary-network path, so pods get an IB/RoCE device and GPUDirect RDMA engages…</description>
      <link>https://ai-infrastructure.net/helm-network-operator/</link>
      <pubDate>Thu, 25 Jun 2026 15:21:25 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/helm-network-operator/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/helm-network-operator.png" type="image/png" length="66049" />
      
    </item>
    
    <item>
      <title>Helm: Volcano Scheduler</title>
      
      
      
      
      <description>install Volcano via Helm; configure the scheduler/controller/admission, stand up queues, and make gang (minMember) scheduling place distributed jobs…</description>
      <link>https://ai-infrastructure.net/helm-volcano-scheduler/</link>
      <pubDate>Thu, 25 Jun 2026 15:21:25 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/helm-volcano-scheduler/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/helm-volcano-scheduler.png" type="image/png" length="58792" />
      
    </item>
    
    <item>
      <title>Inventory &amp; Variables</title>
      
      
      
      
      <description>the inventory model that the bring-up roles read: the gpu_nodes group, the group_vars//host_vars/ layout, the per-tier variables (gpu_tier…</description>
      <link>https://ai-infrastructure.net/inventory-node-model/</link>
      <pubDate>Thu, 25 Jun 2026 15:21:25 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/inventory-node-model/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/inventory-node-model.png" type="image/png" length="62823" />
      
    </item>
    
    <item>
      <title>Manifest: DCGM Exporter</title>
      
      
      
      
      <description>deploy dcgm-exporter (via the GPU Operator or standalone), wire a Prometheus ServiceMonitor/scrape config, understand the DCGM_FI_DEV_/DCGM_FI_PROF_…</description>
      <link>https://ai-infrastructure.net/manifest-dcgm-exporter/</link>
      <pubDate>Thu, 25 Jun 2026 15:21:25 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/manifest-dcgm-exporter/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/manifest-dcgm-exporter.png" type="image/png" length="69125" />
      
    </item>
    
    <item>
      <title>Manifest: DRA ResourceClaim</title>
      
      
      
      
      <description>a ResourceClaimTemplate that selects a GPU by attribute/capacity via a CEL expression (e.g. memory = 40Gi), a Pod that consumes it through pod-level…</description>
      <link>https://ai-infrastructure.net/manifest-dra-resourceclaim/</link>
      <pubDate>Thu, 25 Jun 2026 15:21:25 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/manifest-dra-resourceclaim/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/manifest-dra-resourceclaim.png" type="image/png" length="69632" />
      
    </item>
    
    <item>
      <title>Manifest: GPU Operator ClusterPolicy</title>
      
      
      
      
      <description>the ClusterPolicy CRD (apiVersion: nvidia.com/v1) that is the single source of truth for an NVIDIA GPU Operator install. It covers the driver, toolkit…</description>
      <link>https://ai-infrastructure.net/manifest-gpu-operator-clusterpolicy/</link>
      <pubDate>Thu, 25 Jun 2026 15:21:25 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/manifest-gpu-operator-clusterpolicy/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/manifest-gpu-operator-clusterpolicy.png" type="image/png" length="83063" />
      
    </item>
    
    <item>
      <title>Manifest: Kueue ClusterQueue</title>
      
      
      
      
      <description>the ResourceFlavor + ClusterQueue + LocalQueue triad that fences nvidia.com/gpu into team quota, plus a Job labelled to a LocalQueue and the kubectl get…</description>
      <link>https://ai-infrastructure.net/manifest-kueue-clusterqueue/</link>
      <pubDate>Thu, 25 Jun 2026 15:21:25 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/manifest-kueue-clusterqueue/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/manifest-kueue-clusterqueue.png" type="image/png" length="65607" />
      
    </item>
    
    <item>
      <title>Manifest: MIG Mode</title>
      
      
      
      
      <description>driving MIG declaratively through the GPU Operator&#39;s mig-manager, covering the nvidia.com/mig.config node label, the default-mig-parted-config ConfigMap…</description>
      <link>https://ai-infrastructure.net/manifest-mig-mode/</link>
      <pubDate>Thu, 25 Jun 2026 15:21:25 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/manifest-mig-mode/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/manifest-mig-mode.png" type="image/png" length="61039" />
      
    </item>
    
    <item>
      <title>Manifest: NicClusterPolicy</title>
      
      
      
      
      <description>the NicClusterPolicy CRD that drives the NVIDIA Network Operator. It wires the OFED/DOCA driver, the RDMA shared device plugin (resourceName, ifNames)…</description>
      <link>https://ai-infrastructure.net/manifest-nic-cluster-policy/</link>
      <pubDate>Thu, 25 Jun 2026 15:21:25 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/manifest-nic-cluster-policy/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/manifest-nic-cluster-policy.png" type="image/png" length="70251" />
      
    </item>
    
    <item>
      <title>Manifest: Time-Slicing</title>
      
      
      
      
      <description>the NVIDIA k8s device-plugin ConfigMap (replicas under sharing.timeSlicing.resources), wiring it through the GPU Operator via devicePlugin.config, the…</description>
      <link>https://ai-infrastructure.net/manifest-time-slicing/</link>
      <pubDate>Thu, 25 Jun 2026 15:21:25 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/manifest-time-slicing/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/manifest-time-slicing.png" type="image/png" length="67609" />
      
    </item>
    
    <item>
      <title>Manifest: Volcano Job</title>
      
      
      
      
      <description>a Volcano Job (batch.volcano.sh/v1alpha1) that gang-schedules a multi-pod GPU training run (minAvailable, multiple tasks, schedulerName: volcano) so…</description>
      <link>https://ai-infrastructure.net/manifest-volcano-job/</link>
      <pubDate>Thu, 25 Jun 2026 15:21:25 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/manifest-volcano-job/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/manifest-volcano-job.png" type="image/png" length="58152" />
      
    </item>
    
    <item>
      <title>Site Playbook</title>
      
      
      
      
      <description>the site.yml that orchestrates a fleet bring-up: host selection, privilege escalation, staged role order (base_tuning - acs_disable - rdma_fabric OFED…</description>
      <link>https://ai-infrastructure.net/playbook-site-bring-up/</link>
      <pubDate>Thu, 25 Jun 2026 15:21:25 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/playbook-site-bring-up/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/playbook-site-bring-up.png" type="image/png" length="53305" />
      
    </item>
    
    <item>
      <title>RBAC for Operators</title>
      
      
      
      
      <description>the ServiceAccounts, ClusterRoles and ClusterRoleBindings that the GPU Operator, Network Operator, DRA driver and Kueue install and run as. What each is…</description>
      <link>https://ai-infrastructure.net/rbac-gpu-operators/</link>
      <pubDate>Thu, 25 Jun 2026 15:21:25 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/rbac-gpu-operators/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/rbac-gpu-operators.png" type="image/png" length="66227" />
      
    </item>
    
    <item>
      <title>Role: base_tuning</title>
      
      
      
      
      <description>host prep before any NVIDIA package lands. Blacklist nouveau, set the GRUB kernel cmdline (IOMMU mode per platform, pci=realloc for large GPU BARs), pin…</description>
      <link>https://ai-infrastructure.net/role-base-tuning/</link>
      <pubDate>Thu, 25 Jun 2026 15:21:25 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/role-base-tuning/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/role-base-tuning.png" type="image/png" length="61349" />
      
    </item>
    
    <item>
      <title>Role: mig</title>
      
      
      
      
      <description>enable MIG mode and lay out one requested profile per GPU via nvidia-smi mig, wrapped so the role is idempotent. It reads current state (nvidia-smi…</description>
      <link>https://ai-infrastructure.net/role-mig-configuration/</link>
      <pubDate>Thu, 25 Jun 2026 15:21:25 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/role-mig-configuration/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/role-mig-configuration.png" type="image/png" length="52769" />
      
    </item>
    
    <item>
      <title>Role: nvidia_stack</title>
      
      
      
      
      <description>install the NVIDIA GPU software stack on a prepared node, namely the kernel driver (tier-aware package, branch-pinned), the CUDA toolkit…</description>
      <link>https://ai-infrastructure.net/role-nvidia-stack/</link>
      <pubDate>Thu, 25 Jun 2026 15:21:25 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/role-nvidia-stack/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/role-nvidia-stack.png" type="image/png" length="58132" />
      
    </item>
    
    <item>
      <title>Role: rdma_fabric</title>
      
      
      
      
      <description>install the DOCA-OFED host stack, load nvidia_peermem for GPUDirect RDMA, and write /etc/nccl.conf defaults (NCCL_IB_HCA, NCCL_IB_GID_INDEX…</description>
      <link>https://ai-infrastructure.net/role-rdma-fabric/</link>
      <pubDate>Thu, 25 Jun 2026 15:21:25 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/role-rdma-fabric/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/role-rdma-fabric.png" type="image/png" length="58601" />
      
    </item>
    
    <item>
      <title>Role: validate</title>
      
      
      
      
      <description>fail the play if a node is not fit to take work. The validate role is the last role in site.yml: it asserts GPUs enumerate, persistence and (on NVSwitch…</description>
      <link>https://ai-infrastructure.net/role-validate-health/</link>
      <pubDate>Thu, 25 Jun 2026 15:21:25 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/role-validate-health/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/role-validate-health.png" type="image/png" length="52195" />
      
    </item>
    
    <item>
      <title>PCIe ACS-Disable Service</title>
      
      
      
      
      <description>a systemd one-shot that clears the PCIe ACS Control redirect bits on every bridge at boot (setpci loop) so GPU/NIC peer-to-peer (GPUDirect P2P/RDMA) is…</description>
      <link>https://ai-infrastructure.net/service-acs-disable/</link>
      <pubDate>Thu, 25 Jun 2026 15:21:25 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/service-acs-disable/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/service-acs-disable.png" type="image/png" length="66413" />
      
    </item>
    
    <item>
      <title>Smoke Tests</title>
      
      
      
      
      <description>a consolidated acceptance suite for a freshly built GPU platform. It runs a CUDA pod running nvidia-smi, an nccl-tests Job over RDMA (GDRDMA confirmed…</description>
      <link>https://ai-infrastructure.net/smoke-tests-gpu-platform/</link>
      <pubDate>Thu, 25 Jun 2026 15:21:25 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/smoke-tests-gpu-platform/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/smoke-tests-gpu-platform.png" type="image/png" length="52563" />
      
    </item>
    
    <item>
      <title>Bare-Metal Provisioning &amp; PXE</title>
      
      
      
      
      <description>network boot for fleet-scale OS install: the PXE/iPXE chain, the DHCP options that drive it (next-server, filename, client-arch, HTTPClient), TFTP vs…</description>
      <link>https://ai-infrastructure.net/bare-metal-pxe/</link>
      <pubDate>Thu, 25 Jun 2026 13:50:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/bare-metal-pxe/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/bare-metal-pxe.png" type="image/png" length="66938" />
      
    </item>
    
    <item>
      <title>Container Toolkit &amp; CDI</title>
      
      
      
      
      <description>how the NVIDIA Container Toolkit and Container Device Interface (CDI) expose host GPUs to OCI containers: install, runtime wiring (Docker/containerd)…</description>
      <link>https://ai-infrastructure.net/container-toolkit/</link>
      <pubDate>Thu, 25 Jun 2026 13:50:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/container-toolkit/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/container-toolkit.png" type="image/png" length="62705" />
      
    </item>
    
    <item>
      <title>CUDA Driver</title>
      
      
      
      
      <description>the CUDA driver (libcuda.so, the user-mode half of the NVIDIA GPU driver), its Driver API vs the runtime API, the driver version vs the CUDA version…</description>
      <link>https://ai-infrastructure.net/cuda-driver/</link>
      <pubDate>Thu, 25 Jun 2026 13:50:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/cuda-driver/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/cuda-driver.png" type="image/png" length="54723" />
      
    </item>
    
    <item>
      <title>CUDA Libraries</title>
      
      
      
      
      <description>the math and collective-communication libraries that sit between the CUDA runtime and the frameworks (cuBLAS, cuDNN, NCCL, CUTLASS, cuFFT): how they are…</description>
      <link>https://ai-infrastructure.net/cuda-libraries/</link>
      <pubDate>Thu, 25 Jun 2026 13:50:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/cuda-libraries/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/cuda-libraries.png" type="image/png" length="57028" />
      
    </item>
    
    <item>
      <title>CUDA Toolkit &amp; Runtime</title>
      
      
      
      
      <description>the difference between the CUDA Toolkit (compiler, headers, static libs), the CUDA runtime that ships inside applications, and the driver underneath…</description>
      <link>https://ai-infrastructure.net/cuda-toolkit-runtime/</link>
      <pubDate>Thu, 25 Jun 2026 13:50:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/cuda-toolkit-runtime/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/cuda-toolkit-runtime.png" type="image/png" length="62601" />
      
    </item>
    
    <item>
      <title>Diagnostics &amp; Validation</title>
      
      
      
      
      <description>the tooling that proves a GPU node is healthy enough to take work (dcgmi diag run levels, DCGM health watches, nvbandwidth, gpu-burn, and nvidia-smi…</description>
      <link>https://ai-infrastructure.net/diagnostics-tools/</link>
      <pubDate>Thu, 25 Jun 2026 13:50:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/diagnostics-tools/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/diagnostics-tools.png" type="image/png" length="70024" />
      
    </item>
    
    <item>
      <title>Driver Support by Tier</title>
      
      
      
      
      <description>which of NVIDIA&#39;s four driver families a given GPU class runs (datacenter/Tesla, GeForce, RTX Enterprise Production Branch, DGX OS) and which platform…</description>
      <link>https://ai-infrastructure.net/driver-by-tier/</link>
      <pubDate>Thu, 25 Jun 2026 13:50:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/driver-by-tier/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/driver-by-tier.png" type="image/png" length="62395" />
      
    </item>
    
    <item>
      <title>Driver Versions &amp; Branches</title>
      
      
      
      
      <description>NVIDIA data center (Tesla) GPU driver branches, Long Term Support (LTSB) vs Production, their support windows, how to choose and pin a fleet to one…</description>
      <link>https://ai-infrastructure.net/driver-versions-branches/</link>
      <pubDate>Thu, 25 Jun 2026 13:50:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/driver-versions-branches/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/driver-versions-branches.png" type="image/png" length="70146" />
      
    </item>
    
    <item>
      <title>ECC</title>
      
      
      
      
      <description>ECC on GPU memory across the tiers: what carries it, how nvidia-smi toggles and reports it, the volatile-vs-aggregate counters, and row remapping on…</description>
      <link>https://ai-infrastructure.net/ecc-support/</link>
      <pubDate>Thu, 25 Jun 2026 13:50:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/ecc-support/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/ecc-support.png" type="image/png" length="45499" />
      
    </item>
    
    <item>
      <title>Fabric Manager</title>
      
      
      
      
      <description>Run NVIDIA Fabric Manager (nv-fabricmanager) on NVSwitch systems: what it does, how it is versioned with the driver, and how to operate and debug it.</description>
      <link>https://ai-infrastructure.net/fabric-manager/</link>
      <pubDate>Thu, 25 Jun 2026 13:50:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/fabric-manager/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/fabric-manager.png" type="image/png" length="84378" />
      
    </item>
    
    <item>
      <title>GPU Firmware &amp; GSP</title>
      
      
      
      
      <description>the on-board firmware a GPU carries, namely the VBIOS and the GSP (GPU System Processor) firmware the kernel driver loads at init. How to read it, where…</description>
      <link>https://ai-infrastructure.net/gpu-firmware-gsp/</link>
      <pubDate>Thu, 25 Jun 2026 13:50:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/gpu-firmware-gsp/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/gpu-firmware-gsp.png" type="image/png" length="65593" />
      
    </item>
    
    <item>
      <title>Frameworks (PyTorch/JAX/TensorRT)</title>
      
      
      
      
      <description>how ML frameworks ship their own CUDA/cuDNN/NCCL inside wheels and containers, why this decouples the application stack from the host stack, the driver…</description>
      <link>https://ai-infrastructure.net/gpu-frameworks/</link>
      <pubDate>Thu, 25 Jun 2026 13:50:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/gpu-frameworks/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/gpu-frameworks.png" type="image/png" length="74089" />
      
    </item>
    
    <item>
      <title>GPU Health Gating</title>
      
      
      
      
      <description>keep jobs off bad GPUs. The control layer that runs a health check, turns a verdict into a scheduler state change, and stops work landing on a degraded…</description>
      <link>https://ai-infrastructure.net/gpu-health-gating/</link>
      <pubDate>Thu, 25 Jun 2026 13:50:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/gpu-health-gating/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/gpu-health-gating.png" type="image/png" length="55342" />
      
    </item>
    
    <item>
      <title>MIG Partitioning</title>
      
      
      
      
      <description>hardware partitioning of a single datacenter/workstation GPU into isolated GPU instances: profiles, the nvidia-smi mig lifecycle, isolation guarantees…</description>
      <link>https://ai-infrastructure.net/gpu-partitioning-mig/</link>
      <pubDate>Thu, 25 Jun 2026 13:50:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/gpu-partitioning-mig/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/gpu-partitioning-mig.png" type="image/png" length="56788" />
      
    </item>
    
    <item>
      <title>MPS (Multi-Process Service)</title>
      
      
      
      
      <description>CUDA Multi-Process Service, software space-sharing that lets many cooperative processes run CUDA kernels concurrently on one GPU through a shared server…</description>
      <link>https://ai-infrastructure.net/gpu-partitioning-mps/</link>
      <pubDate>Thu, 25 Jun 2026 13:50:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/gpu-partitioning-mps/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/gpu-partitioning-mps.png" type="image/png" length="70801" />
      
    </item>
    
    <item>
      <title>Image &amp; Config Management</title>
      
      
      
      
      <description>keeping a GPU fleet bit-for-bit consistent: golden OS images with pinned driver/CUDA/firmware baselines, the immutable-image vs config-management…</description>
      <link>https://ai-infrastructure.net/image-management/</link>
      <pubDate>Thu, 25 Jun 2026 13:50:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/image-management/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/image-management.png" type="image/png" length="70288" />
      
    </item>
    
    <item>
      <title>Install &amp; Lifecycle</title>
      
      
      
      
      <description>how the NVIDIA driver and surrounding stack get onto a GPU node and stay correct over its life. Covers apt network repo vs runfile, open vs proprietary…</description>
      <link>https://ai-infrastructure.net/install-lifecycle/</link>
      <pubDate>Thu, 25 Jun 2026 13:50:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/install-lifecycle/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/install-lifecycle.png" type="image/png" length="60385" />
      
    </item>
    
    <item>
      <title>IPMI (Legacy OOB)</title>
      
      
      
      
      <description>operating a BMC over the network with ipmitool -I lanplus (chassis power, sensors, SEL, LAN and user config), plus the protocol&#39;s security weaknesses…</description>
      <link>https://ai-infrastructure.net/ipmi-protocol/</link>
      <pubDate>Thu, 25 Jun 2026 13:50:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/ipmi-protocol/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/ipmi-protocol.png" type="image/png" length="66238" />
      
    </item>
    
    <item>
      <title>Kernel Modules</title>
      
      
      
      
      <description>the kernel-space half of the GPU driver on a single Linux node: the five nvidia modules, the open-vs-proprietary flavor choice, and the install-time…</description>
      <link>https://ai-infrastructure.net/kernel-modules/</link>
      <pubDate>Thu, 25 Jun 2026 13:50:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/kernel-modules/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/kernel-modules.png" type="image/png" length="55656" />
      
    </item>
    
    <item>
      <title>nvidia-smi Reference</title>
      
      
      
      
      <description>nvidia-smi (the NVIDIA System Management Interface) as a working instrument: inspection (-q, --query-gpu ... --format=csv), live monitoring (dmon, pmon)…</description>
      <link>https://ai-infrastructure.net/nvidia-smi-reference/</link>
      <pubDate>Thu, 25 Jun 2026 13:50:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/nvidia-smi-reference/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/nvidia-smi-reference.png" type="image/png" length="60137" />
      
    </item>
    
    <item>
      <title>NVSwitch &amp; NVLink</title>
      
      
      
      
      <description>the GPU-to-GPU interconnect (NVLink links, NVSwitch ASICs, the Fabric Manager that fuses them into one all-to-all domain) across both an 8-GPU baseboard…</description>
      <link>https://ai-infrastructure.net/nvswitch-nvlink/</link>
      <pubDate>Thu, 25 Jun 2026 13:50:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/nvswitch-nvlink/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/nvswitch-nvlink.png" type="image/png" length="59104" />
      
    </item>
    
    <item>
      <title>Out-of-Band Management &amp; BMC</title>
      
      
      
      
      <description>the baseboard management controller (BMC) and the lights-out plane it serves (remote power, serial-over-LAN console, sensor telemetry, firmware update…</description>
      <link>https://ai-infrastructure.net/oob-management-bmc/</link>
      <pubDate>Thu, 25 Jun 2026 13:50:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/oob-management-bmc/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/oob-management-bmc.png" type="image/png" length="72366" />
      
    </item>
    
    <item>
      <title>OOB Network Infrastructure</title>
      
      
      
      
      <description>the physically separate management network that carries lights-out control traffic: dedicated 1G switches, the management VLAN/subnet, BMC addressing via…</description>
      <link>https://ai-infrastructure.net/oob-network-infra/</link>
      <pubDate>Thu, 25 Jun 2026 13:50:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/oob-network-infra/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/oob-network-infra.png" type="image/png" length="63908" />
      
    </item>
    
    <item>
      <title>Persistence Mode</title>
      
      
      
      
      <description>keeping the NVIDIA kernel driver initialized between jobs on a headless GPU node: the nvidia-persistenced daemon (preferred) versus the deprecated…</description>
      <link>https://ai-infrastructure.net/persistence-mode/</link>
      <pubDate>Thu, 25 Jun 2026 13:50:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/persistence-mode/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/persistence-mode.png" type="image/png" length="58757" />
      
    </item>
    
    <item>
      <title>Provisioning Tooling</title>
      
      
      
      
      <description>the bare-metal provisioning systems that turn racked GPU nodes into a fleet of identical, schedulable machines: Canonical MAAS, Warewulf, xCAT, OpenStack…</description>
      <link>https://ai-infrastructure.net/provisioning-tools/</link>
      <pubDate>Thu, 25 Jun 2026 13:50:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/provisioning-tools/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/provisioning-tools.png" type="image/png" length="57748" />
      
    </item>
    
    <item>
      <title>Redfish (Modern OOB)</title>
      
      
      
      
      <description>DMTF Redfish, the REST/JSON-over-HTTPS out-of-band management API exposed by a server&#39;s BMC. What the resource model is, how to drive…</description>
      <link>https://ai-infrastructure.net/redfish-protocol/</link>
      <pubDate>Thu, 25 Jun 2026 13:50:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/redfish-protocol/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/redfish-protocol.png" type="image/png" length="68967" />
      
    </item>
    
    <item>
      <title>ECC Toggle Recovery</title>
      
      
      
      
      <description>recover a datacenter GPU whose ECC mode was toggled but is stuck with Current disagreeing with Pending, or that needs a reset/reboot after enabling ECC…</description>
      <link>https://ai-infrastructure.net/runbook-ecc-toggle-recovery/</link>
      <pubDate>Thu, 25 Jun 2026 13:50:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/runbook-ecc-toggle-recovery/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/runbook-ecc-toggle-recovery.png" type="image/png" length="63440" />
      
    </item>
    
    <item>
      <title>Fabric Manager Failure</title>
      
      
      
      
      <description>nvidia-fabricmanager is inactive or aborting on an NVSwitch system (HGX/DGX 8-GPU baseboard, GB200/GB300 NVL72), so GPUs do not form their NVLink domain…</description>
      <link>https://ai-infrastructure.net/runbook-fabric-manager-failure/</link>
      <pubDate>Thu, 25 Jun 2026 13:50:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/runbook-fabric-manager-failure/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/runbook-fabric-manager-failure.png" type="image/png" length="63934" />
      
    </item>
    
    <item>
      <title>GSP Firmware / Driver Mismatch</title>
      
      
      
      
      <description>recover a node where a partial driver change left the kernel modules and the GSP (GPU System Processor) firmware on different branches, so nvidia-smi…</description>
      <link>https://ai-infrastructure.net/runbook-gsp-firmware-mismatch/</link>
      <pubDate>Thu, 25 Jun 2026 13:50:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/runbook-gsp-firmware-mismatch/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/runbook-gsp-firmware-mismatch.png" type="image/png" length="69949" />
      
    </item>
    
    <item>
      <title>Image Drift Across Fleet</title>
      
      
      
      
      <description>converge a GPU fleet back to one pinned software baseline when non-reproducible failures trace to nodes running different driver / CUDA / GSP-firmware /…</description>
      <link>https://ai-infrastructure.net/runbook-image-drift/</link>
      <pubDate>Thu, 25 Jun 2026 13:50:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/runbook-image-drift/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/runbook-image-drift.png" type="image/png" length="62798" />
      
    </item>
    
    <item>
      <title>Kernel Upgrade — GPU Missing</title>
      
      
      
      
      <description>a single node that booted into a new kernel and now shows no GPUs (nvidia-smi: No devices were found) because the NVIDIA kernel module was never rebuilt…</description>
      <link>https://ai-infrastructure.net/runbook-kernel-gpu-missing/</link>
      <pubDate>Thu, 25 Jun 2026 13:50:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/runbook-kernel-gpu-missing/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/runbook-kernel-gpu-missing.png" type="image/png" length="67632" />
      
    </item>
    
    <item>
      <title>Stale MIG State</title>
      
      
      
      
      <description>a node whose actual MIG geometry (nvidia-smi -L / nvidia-smi mig -lgi) has drifted from what the scheduler believes: pods stuck Pending on a MIG…</description>
      <link>https://ai-infrastructure.net/runbook-mig-state-stale/</link>
      <pubDate>Thu, 25 Jun 2026 13:50:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/runbook-mig-state-stale/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/runbook-mig-state-stale.png" type="image/png" length="54012" />
      
    </item>
    
    <item>
      <title>OOB / BMC Unreachable</title>
      
      
      
      
      <description>a node has hung and its baseboard management controller (BMC) does not answer (no ping, ipmitool and Redfish both time out), so there is no lights-out…</description>
      <link>https://ai-infrastructure.net/runbook-oob-unreachable/</link>
      <pubDate>Thu, 25 Jun 2026 13:50:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/runbook-oob-unreachable/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/runbook-oob-unreachable.png" type="image/png" length="60948" />
      
    </item>
    
    <item>
      <title>Persistence Mode / Clock Bounce</title>
      
      
      
      
      <description>a node (or fleet) showing slow job starts, bouncing clocks and non-reproducible benchmarks because the NVIDIA driver is de-initializing idle GPUs…</description>
      <link>https://ai-infrastructure.net/runbook-persistence-mode/</link>
      <pubDate>Thu, 25 Jun 2026 13:50:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/runbook-persistence-mode/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/runbook-persistence-mode.png" type="image/png" length="68334" />
      
    </item>
    
    <item>
      <title>Topology-Unaware Scheduling</title>
      
      
      
      
      <description>a tightly-coupled training job runs but crawls because its ranks landed scattered across the spine instead of rail-local on the fewest leaf switches, so…</description>
      <link>https://ai-infrastructure.net/runbook-topology-scheduling/</link>
      <pubDate>Thu, 25 Jun 2026 13:50:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/runbook-topology-scheduling/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/runbook-topology-scheduling.png" type="image/png" length="69115" />
      
    </item>
    
    <item>
      <title>Slurm Topology Placement</title>
      
      
      
      
      <description>making Slurm pack a tightly-coupled job onto the fewest, closest network leaves so its collectives stay rail-local. Covers topology.conf with the…</description>
      <link>https://ai-infrastructure.net/slurm-topology-placement/</link>
      <pubDate>Thu, 25 Jun 2026 13:50:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/slurm-topology-placement/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/slurm-topology-placement.png" type="image/png" length="62824" />
      
    </item>
    
    <item>
      <title>Slurm vs Kubernetes</title>
      
      
      
      
      <description>a decision guide for picking the workload manager on a GPU cluster, batch HPC (Slurm) versus service orchestration (Kubernetes), across gang scheduling…</description>
      <link>https://ai-infrastructure.net/slurm-vs-kubernetes/</link>
      <pubDate>Thu, 25 Jun 2026 13:50:44 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/slurm-vs-kubernetes/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/slurm-vs-kubernetes.png" type="image/png" length="61459" />
      
    </item>
    
    <item>
      <title>Fabric Bring-Up, Validation &amp; Benchmarking</title>
      
      
      
      
      <description>the shared procedure for bringing a GPU interconnect up, proving it healthy, and benchmarking it to line rate (InfiniBand, RoCE, and NVLink) before any…</description>
      <link>https://ai-infrastructure.net/fabric-bringup-benchmarking/</link>
      <pubDate>Thu, 25 Jun 2026 13:05:27 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/fabric-bringup-benchmarking/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/fabric-bringup-benchmarking.png" type="image/png" length="76147" />
      
    </item>
    
    <item>
      <title>Recipes &amp;amp; manifests (index)</title>
      
      
      
      
      <description>the index of runnable recipes (Ansible playbooks, Kubernetes/Helm manifests, telemetry stacks, and workload bring-up cookbooks). Where operational…</description>
      <link>https://ai-infrastructure.net/recipes-index/</link>
      <pubDate>Thu, 25 Jun 2026 13:05:27 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/recipes-index/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/recipes-index.png" type="image/png" length="66475" />
      
    </item>
    
    <item>
      <title>Vendor Sourcing &amp; Procurement</title>
      
      
      
      
      <description>turning a validated bill of materials into placed orders and received, asset-tagged hardware: who to buy from per region and component tier, what an RFQ…</description>
      <link>https://ai-infrastructure.net/vendor-sourcing-procurement/</link>
      <pubDate>Thu, 25 Jun 2026 13:05:27 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/vendor-sourcing-procurement/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/vendor-sourcing-procurement.png" type="image/png" length="69650" />
      
    </item>
    
    <item>
      <title>DGX Spark (GB10 Desktop)</title>
      
      
      
      
      <description>NVIDIA&#39;s GB10 Grace Blackwell desktop AI computer (formerly “Project DIGITS”), a single-board Arm + Blackwell system with 128 GB of unified memory, two…</description>
      <link>https://ai-infrastructure.net/dgx-spark/</link>
      <pubDate>Wed, 24 Jun 2026 11:19:14 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/dgx-spark/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/dgx-spark.png" type="image/png" length="75882" />
      
    </item>
    
    <item>
      <title>DGX &amp; HGX Systems</title>
      
      
      
      
      <description>NVIDIA&#39;s turnkey AI systems (DGX) and the OEM baseboards they share their silicon with (HGX). What makes a DGX operationally different from a self-built…</description>
      <link>https://ai-infrastructure.net/dgx-systems/</link>
      <pubDate>Wed, 24 Jun 2026 11:19:14 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/dgx-systems/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/dgx-systems.png" type="image/png" length="64279" />
      
    </item>
    
    <item>
      <title>Ampere (A100 / A30 / A40)</title>
      
      
      
      
      <description>the Ampere datacenter generation (GA100/GA10x dies, TSMC 7nm / Samsung 8nm, 2020): the A100, A30, A40, and A10, their interconnect and MIG behaviour, and…</description>
      <link>https://ai-infrastructure.net/gpu-ampere/</link>
      <pubDate>Wed, 24 Jun 2026 11:19:14 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/gpu-ampere/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/gpu-ampere.png" type="image/png" length="68814" />
      
    </item>
    
    <item>
      <title>Generations &amp; Families</title>
      
      
      
      
      <description>Compare NVIDIA GPU generations for clusters: Ampere, Hopper, and Blackwell datacenter GPUs, plus RTX and DGX, and what changed each generation.</description>
      <link>https://ai-infrastructure.net/gpu-generations/</link>
      <pubDate>Wed, 24 Jun 2026 11:19:14 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/gpu-generations/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/gpu-generations.png" type="image/png" length="81358" />
      
    </item>
    
    <item>
      <title>Hopper (H100 / H200 / GH200)</title>
      
      
      
      
      <description>the Hopper datacenter generation (GH100 die, TSMC 4N, 2022-2024): the H100 and H200 GPUs, the GH200 Grace Hopper superchip, their HGX/DGX systems, and…</description>
      <link>https://ai-infrastructure.net/gpu-hopper/</link>
      <pubDate>Wed, 24 Jun 2026 11:19:14 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/gpu-hopper/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/gpu-hopper.png" type="image/png" length="64258" />
      
    </item>
    
    <item>
      <title>RTX Consumer &amp; Workstation (5090 / 4090 / RTX PRO)</title>
      
      
      
      
      <description>NVIDIA&#39;s consumer (GeForce RTX 50/40) and professional workstation/server (RTX PRO 6000 Blackwell, RTX 6000 Ada, L40S) GPUs, and how they differ…</description>
      <link>https://ai-infrastructure.net/gpu-rtx-workstation/</link>
      <pubDate>Wed, 24 Jun 2026 11:19:14 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/gpu-rtx-workstation/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/gpu-rtx-workstation.png" type="image/png" length="100057" />
      
    </item>
    
    <item>
      <title>Overview</title>
      
      
      
      
      <description>Ansible playbooks to take a freshly-imaged GPU node from bare OS to ready for Kubernetes or Slurm: driver stack, Fabric Manager, InfiniBand, tuning.</description>
      <link>https://ai-infrastructure.net/ansible-node-fabric-bringup/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/ansible-node-fabric-bringup/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/ansible-node-fabric-bringup.png" type="image/png" length="85660" />
      
    </item>
    
    <item>
      <title>BOM Validation</title>
      
      
      
      
      <description>Validate a GPU cluster bill of materials before procurement: catch SKU, NVLink, power, cooling, and networking errors while they are cheap to fix.</description>
      <link>https://ai-infrastructure.net/bom-validation/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/bom-validation/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/bom-validation.png" type="image/png" length="79140" />
      
    </item>
    
    <item>
      <title>Cloud, Neoclouds &amp; Cost</title>
      
      
      
      
      <description>overview and decision index for GPU beyond the owned hall: hyperscaler instances, the neoclouds, decentralized/permissionless GPU, and the economics that…</description>
      <link>https://ai-infrastructure.net/cloud-neoclouds-cost/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/cloud-neoclouds-cost/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/cloud-neoclouds-cost.png" type="image/png" length="63942" />
      
    </item>
    
    <item>
      <title>k3s</title>
      
      
      
      
      <description>k3s as a single-binary, CNCF-conformant Kubernetes distribution for edge, small, CI, and dev clusters: the same API as full Kubernetes (Kubernetes) at a…</description>
      <link>https://ai-infrastructure.net/cluster-k3s/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/cluster-k3s/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/cluster-k3s.png" type="image/png" length="51152" />
      
    </item>
    
    <item>
      <title>Kubernetes</title>
      
      
      
      
      <description>Kubernetes as the orchestration technology under a GPU platform (its objects, control loops, and CRD model) and how training, inference, and fine-tuning…</description>
      <link>https://ai-infrastructure.net/cluster-kubernetes/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/cluster-kubernetes/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/cluster-kubernetes.png" type="image/png" length="54082" />
      
    </item>
    
    <item>
      <title>Orchestration Overview</title>
      
      
      
      
      <description>overview, decision, and index page for the orchestration layer that decides what runs where: the HPC batch world (Slurm), the cloud-native world…</description>
      <link>https://ai-infrastructure.net/cluster-orchestration/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/cluster-orchestration/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/cluster-orchestration.png" type="image/png" length="66305" />
      
    </item>
    
    <item>
      <title>Ray</title>
      
      
      
      
      <description>Ray as a Python-native distributed runtime for GPU clusters: tasks, actors, the head/worker model, and the Train/Serve/Data/RLlib libraries that make it…</description>
      <link>https://ai-infrastructure.net/cluster-ray/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/cluster-ray/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/cluster-ray.png" type="image/png" length="48197" />
      
    </item>
    
    <item>
      <title>Slurm</title>
      
      
      
      
      <description>Slurm as the HPC batch workload manager for GPU clusters, covering partitions, gang scheduling, GRES, topology-aware placement, and multi-node training…</description>
      <link>https://ai-infrastructure.net/cluster-slurm/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/cluster-slurm/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/cluster-slurm.png" type="image/png" length="49747" />
      
    </item>
    
    <item>
      <title>Commissioning &amp; Acceptance</title>
      
      
      
      
      <description>Commission a GPU cluster from racked hardware to production sign-off: the bring-up sequence and the acceptance tests that prove readiness.</description>
      <link>https://ai-infrastructure.net/commissioning-acceptance/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/commissioning-acceptance/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/commissioning-acceptance.png" type="image/png" length="80971" />
      
    </item>
    
    <item>
      <title>Datacentre Physical Readiness</title>
      
      
      
      
      <description>reading datacentre drawings and confirming the facility can actually host the cluster. Power, UPS, cooling, airflow, weight, and the schematics that…</description>
      <link>https://ai-infrastructure.net/datacentre-physical/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/datacentre-physical/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/datacentre-physical.png" type="image/png" length="68838" />
      
    </item>
    
    <item>
      <title>Disaggregated Inference</title>
      
      
      
      
      <description>splitting LLM inference into separate prefill and decode pools that scale independently, the KV-cache transfer that connects them, and how to run it with…</description>
      <link>https://ai-infrastructure.net/disaggregated-inference/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/disaggregated-inference/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/disaggregated-inference.png" type="image/png" length="58996" />
      
    </item>
    
    <item>
      <title>FSDP / DiLoCo Recipes</title>
      
      
      
      
      <description>an index / decision page for the two scale-out paradigms, FSDP for high-bandwidth single-DC sharding and DiLoCo for low-communication / geo-distributed…</description>
      <link>https://ai-infrastructure.net/distributed-training-recipes/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/distributed-training-recipes/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/distributed-training-recipes.png" type="image/png" length="64205" />
      
    </item>
    
    <item>
      <title>Distributed Training Platform</title>
      
      
      
      
      <description>the frameworks and mechanics of training across many GPUs: launchers, parallelism strategies, the libraries, numerics, checkpoint/resume, fault…</description>
      <link>https://ai-infrastructure.net/distributed-training/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/distributed-training/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/distributed-training.png" type="image/png" length="61642" />
      
    </item>
    
    <item>
      <title>Fine-tuning &amp; Post-training</title>
      
      
      
      
      <description>adapting open-weight models through supervised fine-tuning, parameter-efficient LoRA/QLoRA, preference optimisation (DPO), and reinforcement learning…</description>
      <link>https://ai-infrastructure.net/finetuning-posttraining/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/finetuning-posttraining/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/finetuning-posttraining.png" type="image/png" length="62714" />
      
    </item>
    
    <item>
      <title>SFT &amp; LoRA/QLoRA</title>
      
      
      
      
      <description>supervised fine-tuning on demonstrations, and the parameter-efficient adapters, LoRA (low-rank) and QLoRA (4-bit base), that make it fit large models on…</description>
      <link>https://ai-infrastructure.net/ft-sft-lora/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/ft-sft-lora/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/ft-sft-lora.png" type="image/png" length="61390" />
      
    </item>
    
    <item>
      <title>Glossary</title>
      
      
      
      
      <description>concise definitions of the terms used across the knowledge base. Reference glossary, not a single WHAT/WHY/WHEN/HOW topic.</description>
      <link>https://ai-infrastructure.net/glossary/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/glossary/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/glossary.png" type="image/png" length="50557" />
      
    </item>
    
    <item>
      <title>Start here</title>
      
      
      
      
      <description>Navigate the AI infrastructure knowledge base by role and topic: hardware, fabric, Kubernetes, Slurm, training, inference, operations, and recipes.</description>
      <link>https://ai-infrastructure.net/gpu-cluster-kb-index/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/gpu-cluster-kb-index/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/gpu-cluster-kb-index.png" type="image/png" length="87497" />
      
    </item>
    
    <item>
      <title>GPU Performance &amp; Health</title>
      
      
      
      
      <description>tuning the collective-communication and GPU layer, and monitoring GPU health, where interconnect saturation and telemetry decide whether the hardware is…</description>
      <link>https://ai-infrastructure.net/gpu-performance-health/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/gpu-performance-health/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/gpu-performance-health.png" type="image/png" length="64576" />
      
    </item>
    
    <item>
      <title>Overview &amp; Node Admin</title>
      
      
      
      
      <description>the software stack on a single GPU node and the day-to-day administration of it. Driver, CUDA, the management interfaces, GPU partitioning, and the…</description>
      <link>https://ai-infrastructure.net/gpu-software-stack/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/gpu-software-stack/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/gpu-software-stack.png" type="image/png" length="64387" />
      
    </item>
    
    <item>
      <title>Inference Serving &amp; Optimization</title>
      
      
      
      
      <description>Serve LLMs in production: vLLM and SGLang, continuous batching, KV cache, quantization, and disaggregated prefill/decode for cost and latency.</description>
      <link>https://ai-infrastructure.net/inference-serving/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/inference-serving/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/inference-serving.png" type="image/png" length="80026" />
      
    </item>
    
    <item>
      <title>Containers &amp; Kubernetes for GPUs</title>
      
      
      
      
      <description>Run GPU workloads on Kubernetes: the NVIDIA device plugin and GPU Operator, DRA, MIG partitioning, time-slicing, and GPUDirect RDMA in pods.</description>
      <link>https://ai-infrastructure.net/kubernetes-gpu/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/kubernetes-gpu/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/kubernetes-gpu.png" type="image/png" length="76552" />
      
    </item>
    
    <item>
      <title>Overview</title>
      
      
      
      
      <description>Helm manifests to turn a Kubernetes cluster into a GPU platform: GPU Operator, Network/RDMA operator, DRA, sharing models, and gang scheduling.</description>
      <link>https://ai-infrastructure.net/kubernetes-helm-gpu-platform/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/kubernetes-helm-gpu-platform/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/kubernetes-helm-gpu-platform.png" type="image/png" length="79172" />
      
    </item>
    
    <item>
      <title>HPC Networking Fabric</title>
      
      
      
      
      <description>Design and validate the GPU cluster network fabric: InfiniBand vs RoCE, NVLink and NVSwitch, topologies, and bandwidth validation with nccl-tests.</description>
      <link>https://ai-infrastructure.net/networking-fabric/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/networking-fabric/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/networking-fabric.png" type="image/png" length="80893" />
      
    </item>
    
    <item>
      <title>Blackwell Datacenter (B200/B300, GB200/GB300)</title>
      
      
      
      
      <description>NVIDIA Blackwell datacenter platform: B200 and B300 GPUs, GB200/GB300 Grace-Blackwell superchips, and the GB300 NVL72 rack for AI training.</description>
      <link>https://ai-infrastructure.net/nvidia-blackwell-platform/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/nvidia-blackwell-platform/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/nvidia-blackwell-platform.png" type="image/png" length="95684" />
      
    </item>
    
    <item>
      <title>Observability &amp; Monitoring</title>
      
      
      
      
      <description>See what GPUs and the cluster are doing: the metrics that matter, DCGM telemetry, profiling, logging, and alerting for GPU cluster operations.</description>
      <link>https://ai-infrastructure.net/observability-monitoring/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/observability-monitoring/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/observability-monitoring.png" type="image/png" length="79655" />
      
    </item>
    
    <item>
      <title>Operational Runbooks Index</title>
      
      
      
      
      <description>the index of operational runbooks. Each recurring use-case is its own page with trigger, pre-checks, procedure, verification, and rollback. Where the…</description>
      <link>https://ai-infrastructure.net/operational-runbooks/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/operational-runbooks/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/operational-runbooks.png" type="image/png" length="66698" />
      
    </item>
    
    <item>
      <title>Performance Optimization &amp; Tuning</title>
      
      
      
      
      <description>Tune GPU cluster performance end to end: the roofline method, NCCL and PCIe, NUMA pinning, kernel optimization, and finding low-MFU bottlenecks.</description>
      <link>https://ai-infrastructure.net/performance-optimization/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/performance-optimization/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/performance-optimization.png" type="image/png" length="73167" />
      
    </item>
    
    <item>
      <title>Overview</title>
      
      
      
      
      <description>Turn bare-metal nodes into a schedulable GPU cluster: out-of-band/BMC, PXE imaging, health gating, and choosing Slurm vs Kubernetes for scheduling.</description>
      <link>https://ai-infrastructure.net/provisioning-scheduling/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/provisioning-scheduling/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/provisioning-scheduling.png" type="image/png" length="81437" />
      
    </item>
    
    <item>
      <title>Reliability, RAS &amp; Failure Modes</title>
      
      
      
      
      <description>what goes wrong with GPUs at scale and how it is detected, classified, and remediated. XID/SXID errors, ECC, HBM row remapping, thermal and bus failures…</description>
      <link>https://ai-infrastructure.net/reliability-ras/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/reliability-ras/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/reliability-ras.png" type="image/png" length="70718" />
      
    </item>
    
    <item>
      <title>DPO</title>
      
      
      
      
      <description>offline preference alignment, training a policy directly on (chosen, rejected) pairs against a frozen reference, with no reward model and no rollouts…</description>
      <link>https://ai-infrastructure.net/rl-dpo/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/rl-dpo/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/rl-dpo.png" type="image/png" length="46630" />
      
    </item>
    
    <item>
      <title>GRPO</title>
      
      
      
      
      <description>critic-free reinforcement learning for LLMs, sampling a group of completions per prompt, scoring each against a (often verifiable) reward, and updating…</description>
      <link>https://ai-infrastructure.net/rl-grpo/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/rl-grpo/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/rl-grpo.png" type="image/png" length="47795" />
      
    </item>
    
    <item>
      <title>RL Libraries Overview</title>
      
      
      
      
      <description>comparison/selection overview and index for the open-source RL post-training libraries: how the systems are structured, which inference and training…</description>
      <link>https://ai-infrastructure.net/rl-libraries-llms/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/rl-libraries-llms/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/rl-libraries-llms.png" type="image/png" length="58545" />
      
    </item>
    
    <item>
      <title>NeMo-RL</title>
      
      
      
      
      <description>NVIDIA&#39;s open-source RL post-training library in the NeMo stack: Ray-orchestrated, with FSDP2/Megatron training and vLLM rollouts, built for scalable…</description>
      <link>https://ai-infrastructure.net/rllib-nemo-rl/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/rllib-nemo-rl/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/rllib-nemo-rl.png" type="image/png" length="55200" />
      
    </item>
    
    <item>
      <title>OpenRLHF</title>
      
      
      
      
      <description>OpenRLHF, a community Ray + vLLM RLHF/agentic-RL framework on DeepSpeed, an early pioneer of asynchronous RL execution with strong reward-model support…</description>
      <link>https://ai-infrastructure.net/rllib-openrlhf/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/rllib-openrlhf/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/rllib-openrlhf.png" type="image/png" length="50108" />
      
    </item>
    
    <item>
      <title>SkyRL</title>
      
      
      
      
      <description>SkyRL, a modular full-stack RL library for LLMs from the Berkeley Sky Computing Lab (NovaSky), built for multi-turn agentic RL with flexible…</description>
      <link>https://ai-infrastructure.net/rllib-skyrl/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/rllib-skyrl/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/rllib-skyrl.png" type="image/png" length="50835" />
      
    </item>
    
    <item>
      <title>slime</title>
      
      
      
      
      <description>Tsinghua / Z.ai&#39;s decoupled, async-first RL post-training framework, the RL stack behind the GLM model family.</description>
      <link>https://ai-infrastructure.net/rllib-slime/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/rllib-slime/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/rllib-slime.png" type="image/png" length="50892" />
      
    </item>
    
    <item>
      <title>TRL</title>
      
      
      
      
      <description>Hugging Face&#39;s Transformer Reinforcement Learning library, the HF-native post-training stack (SFT/DPO/GRPO and more) built on the Trainer + Accelerate…</description>
      <link>https://ai-infrastructure.net/rllib-trl/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/rllib-trl/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/rllib-trl.png" type="image/png" length="46771" />
      
    </item>
    
    <item>
      <title>verl</title>
      
      
      
      
      <description>ByteDance&#39;s flexible, high-performance RL post-training library for LLMs, built on the HybridFlow programming model.</description>
      <link>https://ai-infrastructure.net/rllib-verl/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/rllib-verl/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/rllib-verl.png" type="image/png" length="48197" />
      
    </item>
    
    <item>
      <title>Add GPU Capacity</title>
      
      
      
      
      <description>safely add GPU capacity (new nodes or scale-up) to a running cluster: burn-in, fabric and health validation, then admit to scheduling, with a rollback…</description>
      <link>https://ai-infrastructure.net/runbook-capacity-add/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/runbook-capacity-add/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/runbook-capacity-add.png" type="image/png" length="59280" />
      
    </item>
    
    <item>
      <title>Checkpoint Recovery / Resume</title>
      
      
      
      
      <description>recover and resume a distributed training job from its last good checkpoint after a crash, preemption, or hardware fault.</description>
      <link>https://ai-infrastructure.net/runbook-checkpoint-recovery/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/runbook-checkpoint-recovery/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/runbook-checkpoint-recovery.png" type="image/png" length="69448" />
      
    </item>
    
    <item>
      <title>Rolling Driver / CUDA Upgrade</title>
      
      
      
      
      <description>the longform procedure for rolling a GPU fleet to a new NVIDIA driver branch (with Fabric Manager and CUDA moved in step), one node at a time behind…</description>
      <link>https://ai-infrastructure.net/runbook-driver-upgrade/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/runbook-driver-upgrade/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/runbook-driver-upgrade.png" type="image/png" length="70424" />
      
    </item>
    
    <item>
      <title>GPU Fault — Drain, Reset, RMA</title>
      
      
      
      
      <description>drain, reset, and (if needed) RMA a faulted GPU after an XID/ECC alert, returning the node to service safely.</description>
      <link>https://ai-infrastructure.net/runbook-gpu-fault-rma/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/runbook-gpu-fault-rma/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/runbook-gpu-fault-rma.png" type="image/png" length="63133" />
      
    </item>
    
    <item>
      <title>Inference SLO Breach</title>
      
      
      
      
      <description>diagnose and remediate an inference SLO breach (TTFT/TPOT burn-rate) without taking the service down.</description>
      <link>https://ai-infrastructure.net/runbook-inference-slo-breach/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/runbook-inference-slo-breach/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/runbook-inference-slo-breach.png" type="image/png" length="60709" />
      
    </item>
    
    <item>
      <title>Training MFU Regression</title>
      
      
      
      
      <description>localize and fix a training throughput (MFU) regression that has fallen below baseline.</description>
      <link>https://ai-infrastructure.net/runbook-mfu-regression/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/runbook-mfu-regression/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/runbook-mfu-regression.png" type="image/png" length="56669" />
      
    </item>
    
    <item>
      <title>NCCL Hang / Collective Stall</title>
      
      
      
      
      <description>Diagnose and clear an NCCL hang or collective stall: when step time goes to infinity with no XID and the whole world-size blocks on a collective.</description>
      <link>https://ai-infrastructure.net/runbook-nccl-hang/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/runbook-nccl-hang/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/runbook-nccl-hang.png" type="image/png" length="76964" />
      
    </item>
    
    <item>
      <title>Thermal / Cooling Emergency</title>
      
      
      
      
      <description>respond to a thermal or cooling emergency (GPU throttling or a CDU alarm) to protect hardware and restore service.</description>
      <link>https://ai-infrastructure.net/runbook-thermal-emergency/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/runbook-thermal-emergency/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/runbook-thermal-emergency.png" type="image/png" length="66338" />
      
    </item>
    
    <item>
      <title>Security, Isolation &amp; Multi-tenancy</title>
      
      
      
      
      <description>securing GPU infrastructure and isolating tenants. Hardware isolation (MIG, vGPU), Blackwell confidential computing, the out-of-band/firmware attack…</description>
      <link>https://ai-infrastructure.net/security-multitenancy/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/security-multitenancy/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/security-multitenancy.png" type="image/png" length="75742" />
      
    </item>
    
    <item>
      <title>Overview</title>
      
      
      
      
      <description>Pick and serve open-weight LLMs with vLLM: DeepSeek, Kimi K2, GLM, Qwen, and Llama, with model selection and links to per-model deployment recipes.</description>
      <link>https://ai-infrastructure.net/serving-oss-models/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/serving-oss-models/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/serving-oss-models.png" type="image/png" length="80875" />
      
    </item>
    
    <item>
      <title>SLO / SLI Catalog</title>
      
      
      
      
      <description>index and decision page for service-level indicators and objectives across GPU services. It frames the SLI → SLO → error-budget → burn-rate paradigm…</description>
      <link>https://ai-infrastructure.net/slo-sli-catalog/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/slo-sli-catalog/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/slo-sli-catalog.png" type="image/png" length="57566" />
      
    </item>
    
    <item>
      <title>SRE, Platform &amp; MLOps Practices</title>
      
      
      
      
      <description>the operating-excellence layer over the whole stack: SLOs and error budgets, GitOps and policy-as-code, IaC, and the MLOps lifecycle. The practices that…</description>
      <link>https://ai-infrastructure.net/sre-platform-mlops-practices/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/sre-platform-mlops-practices/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/sre-platform-mlops-practices.png" type="image/png" length="75188" />
      
    </item>
    
    <item>
      <title>Storage &amp; Data Platform</title>
      
      
      
      
      <description>Feed the GPUs: parallel filesystems, object storage, local scratch, checkpoint strategy, GPUDirect Storage, and a data-loading path that keeps them busy.</description>
      <link>https://ai-infrastructure.net/storage-data-platform/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/storage-data-platform/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/storage-data-platform.png" type="image/png" length="76473" />
      
    </item>
    
    <item>
      <title>Telemetry, Monitoring &amp; Alerting</title>
      
      
      
      
      <description>Build a GPU monitoring stack: DCGM exporter to Prometheus to Grafana and Alertmanager, with scrape config, dashboards, and PromQL alert rules.</description>
      <link>https://ai-infrastructure.net/telemetry-monitoring-alerting/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/telemetry-monitoring-alerting/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/telemetry-monitoring-alerting.png" type="image/png" length="74452" />
      
    </item>
    
    <item>
      <title>DDP (Distributed Data Parallel)</title>
      
      
      
      
      <description>PyTorch DistributedDataParallel, which replicates the model on every GPU, shards the data, and all-reduces gradients each step. The simplest, fastest…</description>
      <link>https://ai-infrastructure.net/train-ddp/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/train-ddp/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/train-ddp.png" type="image/png" length="64297" />
      
    </item>
    
    <item>
      <title>DeepSpeed &amp; ZeRO</title>
      
      
      
      
      <description>the ZeRO family of sharded-data-parallel optimisations (stages 1/2/3 + CPU/NVMe offload) and the deepspeed launcher, as one of the memory-scaling…</description>
      <link>https://ai-infrastructure.net/train-deepspeed-zero/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/train-deepspeed-zero/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/train-deepspeed-zero.png" type="image/png" length="62286" />
      
    </item>
    
    <item>
      <title>DiLoCo (Low-Communication)</title>
      
      
      
      
      <description>a two-level optimisation method that lets workers train mostly independently and synchronise rarely. It is the low-bandwidth alternative to per-step data…</description>
      <link>https://ai-infrastructure.net/train-diloco/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/train-diloco/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/train-diloco.png" type="image/png" length="62295" />
      
    </item>
    
    <item>
      <title>FSDP (Fully Sharded Data Parallel)</title>
      
      
      
      
      <description>PyTorch FSDP2 (fully_shard), sharding parameters, gradients, and optimizer state across ranks to train models too large for DDP, and how that scales…</description>
      <link>https://ai-infrastructure.net/train-fsdp/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/train-fsdp/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/train-fsdp.png" type="image/png" length="65164" />
      
    </item>
    
    <item>
      <title>Pipeline Parallelism</title>
      
      
      
      
      <description>splitting a model&#39;s layer stack into stages across devices/nodes and streaming micro-batches through them. The third axis of distributed training&#39;s 3D…</description>
      <link>https://ai-infrastructure.net/train-pipeline-parallel/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/train-pipeline-parallel/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/train-pipeline-parallel.png" type="image/png" length="52379" />
      
    </item>
    
    <item>
      <title>Tensor Parallelism</title>
      
      
      
      
      <description>Megatron-style intra-layer model parallelism (splitting a single layer&#39;s matmuls across GPUs) as one axis of the parallelism stack in distributed training.</description>
      <link>https://ai-infrastructure.net/train-tensor-parallel/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/train-tensor-parallel/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/train-tensor-parallel.png" type="image/png" length="56249" />
      
    </item>
    
    <item>
      <title>Troubleshooting (symptom → fix)</title>
      
      
      
      
      <description>a triage index. When something breaks, match the symptom to the runbook that fixes it. This page is the dispatcher: the detailed step-by-step HOW lives…</description>
      <link>https://ai-infrastructure.net/troubleshooting-runbook/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/troubleshooting-runbook/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/troubleshooting-runbook.png" type="image/png" length="71417" />
      
    </item>
    
    <item>
      <title>Workload &amp; Bring-Up Recipes</title>
      
      
      
      
      <description>index and decision overview for runnable GPU-cluster workloads (fabric validation, distributed training, inference serving) and the order they run in…</description>
      <link>https://ai-infrastructure.net/workload-bringup-recipes/</link>
      <pubDate>Wed, 24 Jun 2026 10:27:21 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/workload-bringup-recipes/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/workload-bringup-recipes.png" type="image/png" length="69492" />
      
    </item>
    
    <item>
      <title>Home</title>
      
      
      
      
      <description>Deploy and operate GPU clusters: NVIDIA GPUs, DGX/HGX, InfiniBand/RoCE, Kubernetes, Slurm, distributed training, and LLM inference serving.</description>
      <link>https://ai-infrastructure.net/</link>
      <pubDate>Wed, 24 Jun 2026 09:43:35 +0000</pubDate>
      <source url="https://ai-infrastructure.net/feed_rss_created.xml">AI Infrastructure Knowledge Base</source>
      
      <guid isPermaLink="true">https://ai-infrastructure.net/</guid>
      
      <enclosure url="https://ai-infrastructure.net/assets/images/social/index.png" type="image/png" length="86688" />
      
    </item>
    
  </channel>
</rss>