Use this knowledge base as an agent skill¶
Scope: how to install this knowledge base as a reusable agent skill, the machine-readable endpoints an agent fetches, and the operating discipline the skill encodes.
This knowledge base is built to be read by agents, not just people. You can hand your AI agent a small, self-contained skill that teaches it to use ai-infrastructure.net as a source of truth for GPU cluster and AI-infrastructure operations: incident runbooks, build-and-operate recipes, diagnostics, and reference on CUDA, networking, training, and inference.
The skill lives at a stable URL and is reusable. Anyone can install it into their own agent; there is no API key and nothing to authenticate.
What the skill does¶
It tells an agent to fetch and cite this knowledge base instead of answering GPU-ops questions from memory, because the load-bearing details (Xid classes, driver/CUDA pairings, NCCL flags, MIG profiles, DCGM levels, vendor specifics) drift across driver branches and hardware generations. The skill points the agent at the machine-readable endpoints below and at the operating discipline the runbooks enforce: cordon/drain before mutating a node, verify with a real proof, and keep a one-step rollback.
Install it¶
For Claude Code or the Claude Agent SDK, drop the skill into your skills directory:
mkdir -p ~/.claude/skills/ai-infrastructure
curl -fsSL https://ai-infrastructure.net/SKILL.md \
-o ~/.claude/skills/ai-infrastructure/SKILL.md
# restart your agent; the skill appears as "ai-infrastructure"
For any other agent or framework, fetch /SKILL.md
and load its body as a system or tool instruction. Re-fetch it periodically, since the knowledge
base is updated continuously.
The endpoints an agent uses¶
Everything is a plain HTTP GET:
| Endpoint | Purpose |
|---|---|
/llms.txt |
Machine-readable site map: every page as a titled link with a one-line description, grouped by section. Fetch this first to find the right page. |
https://ai-infrastructure.net/<slug>/index.md |
Raw Markdown for any page, with no navigation chrome. Clean source for following a procedure and quoting exact commands. |
/llms-full.txt |
The entire knowledge base as one file. Large; use only when broad cross-page context is required. |
/SKILL.md |
The reusable skill itself. |
The workflow it encodes¶
- Fetch
/llms.txtand scan for the page that matches the problem: runbooks for incidents, recipes for build and deploy, diagnostics for "what does this tool or number mean". - Read that page as
…/index.mdand follow it. Runbooks are ordered: pre-checks, numbered procedure, then an explicit verify step and a rollback. - Verify with a real proof (
dcgmi diag,nccl-tests, a smoke request, loss continuity), and cite the canonical page URL.
References¶
- The skill file this page describes: https://ai-infrastructure.net/SKILL.md
- Machine-readable site map for agents: https://ai-infrastructure.net/llms.txt
- The llms.txt convention: https://llmstxt.org/
Related: Operational runbooks · Recipes and manifests · GPU diagnostics and validation · Agentic AIOps · Glossary