Skip to content
Markdown

Agentic AIOps: autonomous incident operations

Scope: using software (increasingly LLM agents) to run the incident lifecycle on a GPU cluster: detect an anomaly, localize the faulty component, perform root-cause analysis (RCA), and mitigate, all from telemetry rather than a human paging through dashboards. Covers the task taxonomy, how raw metrics/logs/traces are compacted into something an agent can reason over, the closed-loop discipline that separates a real fix from a hopeful one, guardrails, and how to evaluate an ops agent honestly. This is operations of the cluster; for AI that writes kernels or tunes performance knobs see AI-assisted performance optimization.

What it is

AIOps applies machine learning and now agentic reasoning to operations data to shorten the path from "something is wrong" to "it is fixed." The field's canonical decomposition, used by Microsoft Research's AIOpsLab benchmark, is four task types: [AIOpsLab]

  • Detection: is there an incident right now? (anomaly on metrics/logs/traces.)
  • Localization: which service/component/node is responsible?
  • Analysis (RCA): what is wrong, the failing system and the fault type (config, resource exhaustion, code/image, network, hardware). The status work splits this axis into RCA-system and RCA-fault-type, scored separately.
  • Mitigation: take the action that restores service, then confirm recovery.

An agentic AIOps system wires an LLM (or a small multi-agent team) into a loop with the cluster through a standardized Agent–Cloud Interface (ACI): the agent reads telemetry and issues actions (exec_shell, kubectl, read a log) through one interface, and an orchestrator grades the outcome. [AIOpsLab] The agent is not a chatbot over dashboards. It observes, hypothesizes, acts, and re-observes, the same OODA loop an SRE runs, but driven by a model.

flowchart LR
  TEL["Telemetry: metrics + logs + traces"] --> COMPACT["Compaction: anomalies, log templates, slow spans"]
  COMPACT --> DETECT["Detect"]
  DETECT --> LOCAL["Localize"]
  LOCAL --> RCA["RCA: system + fault-type"]
  RCA --> MIT["Mitigate (action)"]
  MIT --> CONFIRM{"Re-read telemetry:<br/>recovered?"}
  CONFIRM -->|"no"| RCA
  CONFIRM -->|"yes"| DONE["Resolved"]
  GUARD["Guardrails: human sign-off,<br/>blast-radius, authz"] -.-> MIT

Why it matters

At cluster scale the bottleneck is rarely a single missing metric. It is correlation under time pressure across thousands of nodes and three telemetry modalities. Manual RCA on a microservice or training incident "can take hours even for experienced Site Reliability Engineers." [AIOpsLab] That time is not free: failure detection and recovery are exactly the overheads that erode realized throughput. A cluster at 80%+ headline utilization still loses goodput to the mean-time-to-recovery of every fault, and unresolved or slowly-resolved incidents drag ETTR down (Meta reliability study). Shrinking MTTR is a direct goodput lever.

The payoff is concrete when it works. OpsAgent, a 14B-model multi-agent system, reports a production deployment that cut incident resolution from ~2.5 hours (three engineers) to 126 seconds at 84.09% diagnostic accuracy, and state-of-the-art results on the OpenRCA benchmark of 335 real incidents, using local sub-20B models that avoid the cost and privacy exposure of a frontier API. [OpsAgent] For a GPU fleet, the recurring incident classes (NCCL hangs, ECC storms, thermal throttling, fabric-manager failures, scheduler starvation; see the runbooks) are precisely the structured, telemetry-rich problems agents are good at triaging.

When to use it (and when not)

Use agentic AIOps when:

  • The fleet is large and well-instrumented (rich DCGM metrics, NCCL/driver logs, and traces; see observability & monitoring, telemetry, monitoring & alerting) and incidents recur often enough that automation pays back.
  • MTTR dominates your reliability budget and human RCA does not scale to the incident rate.
  • You want a triage assistant that localizes and explains while a human approves the fix, the highest-value, lowest-risk entry point.

Do not reach for it when:

  • A deterministic runbook already resolves the class of incident (troubleshooting runbook, operational runbooks). A scripted check is cheaper, faster, and auditable; reserve the agent for the open-ended cases.
  • The action is destructive and unattended (draining a rack, deleting a PVC, power-cycling a node) without human sign-off or a tested rollback. Auto-mitigation amplifies a wrong diagnosis into an outage.
  • The fault is observability-dark: the telemetry needed to diagnose it does not exist. No agent localizes a signal the substrate never emitted; fix the instrumentation first.

How to build and operate it

The patterns below are architectural references, not a turnkey system. Validate every threshold and tool call against your own telemetry before trusting it.

1. Compact telemetry into something a model can reason over

Raw metrics/logs/traces overflow any context window. The load-bearing preprocessing step turns each modality into a short, ranked, structured description. OpsAgent's training-free recipe generalizes across systems because it is statistics, not a learned model: [OpsAgent]

  • Metrics: flag anomalies with a sliding-window 3-sigma rule (keep samples where |x − μ|/σ > 3, retaining the deviation in units of σ), classify each anomaly's shape (spike, level-shift, steady increase), then aggregate to the top-k pods and top-k metrics by deviation.
  • Logs: drop routine lines by keyword (fatal, error, crash, fail), parse the rest into templates with Drain3, rank templates by TF-IDF, keep those above an adaptive percentile (≈80th), and dedup same-template entries within a 1-minute window.
  • Traces: find high-latency spans with a per-call-type threshold (≈p95; a single global threshold mislabels), then aggregate by callee and by three-hop call path to surface systemic bottlenecks rather than one slow leaf.

For a GPU cluster the same shape applies to DCGM time-series (thermal, ECC, XID, NVLink, power), NCCL/driver logs, and job traces.

2. Split agents by diagnostic task, not by data modality

A deliberate OpsAgent finding: assigning one agent to metrics, one to logs, one to traces starves each of complementary signal and yields divergent, irreconcilable conclusions. Instead, give every expert agent all modalities and split by task (detection, triage/fault-type, localization), then let them cross-review each other's reasoning. The cross-review is the verification step that catches a confident-but-wrong hypothesis, and it produces an auditable trail. [OpsAgent]

3. Close the loop: confirm recovery from telemetry, never self-report

The single most important discipline: a mitigation counts as successful only when re-read cluster telemetry confirms the system recovered, never the agent's own claim that it fixed it. An agent will report success because its action returned exit 0; whether the SLO actually recovered is an independent measurement. Wire the post-action check to the same signals that detected the incident, and treat a still-anomalous reading as "not mitigated, re-diagnose." This closed loop is what separates an ops agent from a plausible-sounding narrator.

4. Guardrail the actions

  • Human-in-the-loop for anything destructive or wide-blast-radius; start the agent in suggest mode (localize + propose) and graduate specific, well-tested remediations to auto only after they prove out.
  • Blast-radius limits: cap how many nodes/pods one action can touch; require a tested rollback.
  • Authorization and provenance: every remediation must answer who/what authorized this action? Agent identity and intent tend to collapse into a generic service account at the tool-call hop; bind actions to an attested, least-privilege identity so an agent cannot exceed its mandate. [zero-trust for agents]

5. Evaluate honestly: uniform config, leak-free

Benchmark the agent against a realistic suite (AIOpsLab injects live faults into microservice clusters and grades the four tasks; OpenRCA scores RCA on real incidents). [AIOpsLab; OpsAgent] Two integrity rules that decide whether a score means anything:

  • One declared config across all problems and all repetitions. Per-problem flag-toggling or cherry-picking is gaming; declare every non-default setting and run it uniformly. Report each axis separately (detection/localization/RCA-system/RCA-fault-type/mitigation), never a blended number.
  • Beware the oracle leak. Some fault injectors leave an artifact whose name is the answer: a config-map flag named for the fault, an annotation naming the target service. An agent that reads the answer key "localizes" without diagnosing, inflating the score. Grade provenance-blind for the leaderboard number if you must, but also report a leak-free number that discounts wins decided by reading a fault-naming artifact rather than the real symptom. The honest question is "did it diagnose, or did it cheat?"

Failure modes

  • Trusting self-report. The agent says "fixed"; the SLO is still breached. Always confirm recovery from telemetry (§3).
  • Hallucinated RCA stated as fact. A fluent but wrong root cause sends responders down a blind alley. Require cross-review and evidence citations; an unevidenced cause is a hypothesis, not a finding.
  • Oracle-leak-inflated scores. A benchmark number that looks strong because the substrate leaked the answer. Re-tally leak-free (§5).
  • Auto-mitigation amplifies the incident. A wrong fix applied at scale turns a degraded service into an outage. Guardrails and blast-radius limits (§4) exist for this.
  • Detection on noisy thresholds drives alert fatigue. Too-tight thresholds bury the real signal; tune against historical incidents and prefer burn-rate logic (alerting & burn-rate rules).
  • Substrate blindness. The agent cannot localize a fault that emits no observable signal; the gap is instrumentation, not intelligence.

Open questions & validation

  • Capability vs. substrate: does localization work because the agent is good, or because the fault was trivially visible? Characterize each fault's gold signal (oracle-only / trivial-direct-break / genuine-diagnosable / observability-dark) and test on the genuine-diagnosable set, the only class that measures real capability.
  • A leak-free measurement protocol whose every number is re-derivable from raw trajectories by an independent reviewer.
  • Which remediations are safe to fully automate vs. always human-gated, and the tested-rollback bar for promotion.
  • The standing cost of running agents continuously against fleet telemetry, versus the MTTR/goodput it returns.
  • Mapping the four AIOpsLab tasks onto GPU-specific incident classes (NCCL hang, XID/ECC, thermal, fabric-manager, scheduler starvation) and their telemetry.

References

  • AIOpsLab — A Holistic Framework to Design, Develop, and Evaluate AI Agents for Enabling Autonomous Clouds (Microsoft Research). Repo: https://github.com/microsoft/AIOpsLab · Paper: https://arxiv.org/abs/2501.06706
  • OpsAgent — An Evolving Multi-agent System for Incident Management in Microservices, arXiv:2510.24145 (telemetry compaction recipe, task-not-modality agents, Lenovo 2.5 h → 126 s / 84.09%, OpenRCA SOTA): https://arxiv.org/abs/2510.24145
  • AIOps Solutions for Incident Management: Technical Guidelines and a Comprehensive Literature Review, arXiv:2404.01363: https://arxiv.org/abs/2404.01363
  • Drain3 — streaming log-template parsing (logpai): https://github.com/logpai/Drain3
  • C. Duan et al. (Meta), Revisiting Reliability in Large-Scale ML Research Clusters, arXiv:2410.21680 (MTTR/ETTR and the goodput cost of recovery): https://arxiv.org/abs/2410.21680
  • Chris Fregly, AI Systems Performance Engineering (O'Reilly), Ch. 20 — AI-assisted cluster operations (anomaly detection, automated failure analysis, RL-based control).

Related: Observability & Monitoring · Telemetry, Monitoring & Alerting · Operational Runbooks Index · Troubleshooting (symptom → fix) · Reliability, RAS & Failure Modes · Goodput · AI-Assisted Performance Optimization · Glossary