Markdown

AIOpsLab: evaluating AIOps agents end to end¶

Scope: AIOpsLab, the Microsoft Research framework (MLSys 2025) for designing, developing, and evaluating LLM-based AIOps agents against live cloud environments, and the benchmark built with it: the problem formalization, the Agent-Cloud Interface (ACI), the fault library, the 48-problem benchmark, the metrics, and what the four evaluated agents actually scored. This page is the benchmark deep dive; the field view of running agentic operations on a GPU cluster (task taxonomy, telemetry compaction, guardrails) is agentic AIOps, general agent scoring is evaluating agents, and the security-domain analogue of this page is cybersecurity agent evaluation.

Numbers and code fragments are from the MLSys 2025 paper and describe the published evaluation (GPT-3.5/GPT-4-era agents); the framework has evolved since, so verify current tasks and APIs on the repo (as of 2026-07). The aiopslab snippets are reference templates quoted from the paper, unexecuted. The Python protocol model is executed and asserted.

flowchart TB
  POOL["Problem pool: P = (task T, context C, oracle S)<br/>48 problems over 4 task levels (all evaluated)"] --> ORCH["Orchestrator: one session per agent-problem pair"]
  ORCH --> DEPLOY["Deploy service via Helm/K8s APIs"]
  DEPLOY --> SVC["Service under test:<br/>SocialNetwork, HotelReservation, Astronomy Shop, TiDB Operator"]
  WL["Workload generator (wrk2, Locust)"] --> SVC
  FAULT["Fault generator: symptomatic (ChaosMesh)<br/>+ functional (application/virtualization injectors)"] --> SVC
  SVC --> TEL["Telemetry: Prometheus metrics, Jaeger traces,<br/>Filebeat/Logstash logs"]
  TEL --> ORCH
  AGENT["Agent: async get_action(state)"] -->|"ACI actions: get_logs, get_metrics,<br/>get_traces, exec_shell, submit"| ORCH
  ORCH -->|"observations, API docs, feedback"| AGENT
  ORCH --> EVAL["Evaluator: success, TTD/TTM, steps,<br/>tokens, optional LLM-as-Judge"]

What it is¶

AIOpsLab is a framework that automates the whole evaluation loop for AIOps agents: it deploys a real microservice application on Kubernetes, drives it with a workload generator, injects a fault, exposes telemetry, mediates the agent's interaction through a typed interface, and scores the outcome against a per-problem oracle. The paper frames the goal as AgentOps: agents that manage the full incident lifecycle (detect, localize, diagnose, mitigate) toward self-healing clouds, and argues that progress is gated on exactly this kind of interactive benchmark, because static datasets and fixed question-answer formats cannot capture dynamic, evolving cloud incidents.¹

Every evaluation scenario is a problem formalized as P = (T, C, S): a task T (detection, localization, root-cause analysis, or mitigation), a context C, and an expected solution S (the oracle). The context splits into an operational environment E (the service, fault model, and workload model, hidden from the agent) and problem information I (service and task descriptions plus API documentation, shared with the agent, and telemetry queryable at runtime). Mitigation oracles deliberately check the general state of the entire system, for example that all services are up and running, rather than only the injected resource, because a mitigation can fix the target while breaking a neighbor.¹

The four task levels form a difficulty ladder: detection asks for a binary fault-present answer, localization for the faulty service name, RCA for two sub-answers (the affected system layer and the fault type), and mitigation for a sequence of actions that restores the environment.²

Why use it¶

It closes the Dev/Ops benchmark gap. SWE-bench-class benchmarks pushed the Dev side of DevOps; the Ops side lacked realistic, interactive evaluation. AIOpsLab integrates what previously existed only as separate pieces (application suites, chaos tools, observability stacks) into one scored loop.¹
Agent onboarding is deliberately cheap. An agent registers by implementing one method, async def get_action(state: str) -> str; the four evaluated agents needed 41 to 60 lines of code each to integrate.⁵
It measured something real. All four LLM agents beat the traditional non-LLM baselines on detection and localization (MKSMC 15.38% detection; PDiagnose 15.38% and RMLAD 7.69% localization), and the ranking among agents is informative: FLASH 59.32% overall, ReAct 55.93%, GPT-4 with shell 49.15%, GPT-3.5 with shell 15.25%.⁵
The hard tasks are exposed, not averaged away. FLASH answered 100% of detection problems but only 54.55% of mitigation; GPT-3.5 mitigated exactly nothing (0%). Per-task tables are the value: an agent that looks mid-pack overall can be the only one that never false-alarms.⁵
Cost is a first-class metric. Tokens in and out are recorded per task; ReAct spent 16,941 tokens on average against FLASH's 6,484 for about 3.4 fewer accuracy points overall, which is the kind of trade a platform team needs visible before deploying an ops agent.⁵

When to use it (and when not)¶

Use it to compare candidate ops agents (or agent upgrades) under identical faults, workloads, and step budgets before letting any of them near production telemetry, the same pre-deployment gating this KB recommends for agent evaluation generally.
Use it to build custom problems for your own failure classes: the fault library is parametric and extensible, and a new localization problem is a few lines extending a task interface (the paper's K8STargetPortMisconf example).
Use it for research on agent behavior: the orchestrator logs full trajectories (actions and resulting states), which is what made the paper's failure-pattern analysis possible.
Do not treat the benchmark scores as current capability. The published numbers are GPT-3.5/GPT-4-era with a simplified FLASH reimplementation (the original was not public); they rank approaches at a point in time.
Do not assume testbed transfer. DeathStarBench applications on a lab cluster are not a production GPU platform; the paper itself positions AIOpsLab as a framework for building scenario suites, not a universal score.

Architecture¶

The Orchestrator owns the evaluation lifecycle and enforces separation between agent and service. Its Agent-Cloud Interface (ACI) defines the valid action set and how service state flows back as observations: default APIs include get_logs, get_metrics, get_traces (Jaeger extraction over a time window), and exec_shell (behind security policy filters), plus the final submit. On problem initialization the Orchestrator auto-extracts each API's docstring into the agent's context, and after every action it returns high-quality feedback including outputs, error messages, and tracebacks. Sessions are created per agent-problem pair; the step budget is set at start_problem(max_steps=...).³

Problem initializers deploy the required service with Helm and Kubernetes APIs and start the workload and fault generators. Services integrated in the paper: SocialNetwork (28 microservices with Memcached, MongoDB, Redis) and HotelReservation (Go and gRPC), both from DeathStarBench, with workloads from wrk2 (DeathStarBench). The current repo has since added the OpenTelemetry Astronomy Shop demo (multi-language, gRPC and HTTP), custom TiDB Operator applications, and the Locust workload generator.¹

The fault library splits into two categories. Symptomatic faults (network loss, pod failure, container kill) manifest as observable symptoms with no deeper root cause, injected via Chaos-Mesh; they can only instantiate detection and localization problems. Functional faults model fine-grained root causes (misconfigurations, revoked database credentials, buggy application images, wrong binaries, bad scaling operations) at the application and virtualization levels, and most support all four task levels including mitigation. The evaluation's 10 published faults instantiate problems such as a Kubernetes target-port misconfiguration (extensible to 12 problems across services), MongoDB authentication revocation, a pod scaled to zero, assignment to a non-existent node, a persistent volume that blocks redeployment, and a TiDB operator misconfigured to a very high replica count.⁴

The observability layer stores Prometheus metrics, Jaeger traces, and Filebeat/Logstash logs on disk, supports offline export for evaluating non-agentic AIOps algorithms on the same incidents, and is designed to admit new dimensions such as syscall logs.¹

Evaluators combine common metrics (success criteria per task, Time-to-Detect, Time-to-Mitigate, step count, token cost) with problem-specific checks and an optional LLM-as-Judge pass over trajectories, which the paper motivates with a case where an agent answered a detection problem correctly while its stated reasoning referenced a normal workload rather than the injected fault.[^metrics]

How to use it¶

Defining a problem and onboarding an agent, as printed in the paper (reference templates, unexecuted; verify current APIs on the repo):

# Reference template (paper Example 2.1): a localization problem in a few lines.
from aiopslab import LocalizationTask, SocialNetwork
from aiopslab import Wrk, VirtFaultInjector

class K8STargetPortMisconf(LocalizationTask):
    def __init__(self):
        self.app = SocialNetwork()
        self.ans = "user-service"           # oracle: the injected service
    def start_workload(self):
        wrk = Wrk(rate=100, duration=10)
        wrk.start_workload(url=self.app.frontend_url)
    def inject_fault(self):
        inj = VirtFaultInjector(self.app.ns)
        inj.inject([self.ans], "misconfig_k8s")
    def eval(self, soln, trace, duration):
        res["TTL"] = duration
        res["success"] = is_exact_match(soln, self.ans)
        return res

# Reference template (paper Example 2.3): agent onboarding is one method.
from aiopslab import Orchestrator

class Agent:
    def __init__(self, prob, instructs, apis):
        self.prompt = self.set_prompt(prob, instructs, apis)
        self.llm = GPT4()
    async def get_action(self, state: str) -> str:
        return self.llm.generate(self.prompt + state)

orch = Orchestrator()
pid = "misconfig_app_hotel_res-mitigation-1"
prob_desc, instructs, apis = orch.init_problem(pid)
orch.register_agent(Agent(prob_desc, instructs, apis), name="myAgent")
asyncio.run(orch.start_problem(max_steps=10))

How to develop with it¶

The scoring protocol is the part worth internalizing before writing problems, and it is small enough to validate directly. This executed model implements the session lifecycle and the per-task oracles, and asserts the properties that matter when interpreting scores: top-k localization semantics (the paper's ReAct and FLASH lose accuracy from Acc@3 to Acc@1), hand-computed TTD and TTM, budget exhaustion as a scored failure rather than a hang, the Noop false-positive check, and the whole-system mitigation oracle's lack of an attribution requirement:

# aiopslab_protocol.py - validated: the AIOpsLab evaluation protocol in miniature:
# session lifecycle (init, agent loop via ACI, submit, evaluate), per-task metrics
# (success, TTD, TTM, steps), budget exhaustion, and the paper's whole-system
# mitigation oracle. A model of the protocol in pure stdlib; does not run AIOpsLab.
from __future__ import annotations

from dataclasses import dataclass
from typing import Callable, Optional, Union


@dataclass(frozen=True)
class Act:
    name: str            # ACI call: get_logs / get_metrics / get_traces / exec_shell
    t: float             # session clock, seconds


@dataclass(frozen=True)
class Submit:
    solution: object
    t: float


Step = Union[Act, Submit]


def run_session(agent: Callable[[int], Step], max_steps: int) -> tuple[list[Step], Optional[Submit]]:
    """Orchestrator loop: poll get_action until submit or budget exhaustion."""
    trace: list[Step] = []
    for i in range(max_steps):
        step = agent(i)
        trace.append(step)
        if isinstance(step, Submit):
            return trace, step
    return trace, None       # budget exhausted: scored failed, never hung


def eval_detection(sub: Optional[Submit], fault_present: bool, t_fault: float) -> dict:
    """Level 1: binary yes/no answer; TTD = submit time - fault time when a fault exists."""
    if sub is None:
        return {"success": False, "TTD": None}
    ok = sub.solution == ("yes" if fault_present else "no")
    ttd = round(sub.t - t_fault, 2) if (ok and fault_present) else None
    return {"success": ok, "TTD": ttd}


def eval_localization(sub: Optional[Submit], ans: str, k: int) -> dict:
    """Level 2: match against the faulty service name; ranked lists scored at top-k."""
    if sub is None:
        return {"success": False}
    ranked = list(sub.solution) if isinstance(sub.solution, (list, tuple)) else [sub.solution]
    return {"success": ans in ranked[:k]}


def eval_mitigation(sub: Optional[Submit], healthy: Callable[[float], bool],
                    t_detect: float) -> dict:
    """Level 4, the paper's oracle: after the session the WHOLE system must be
    healthy (all services up), not only the injected resource; TTM runs from
    detection to completed mitigation."""
    if sub is None:
        return {"success": False, "TTM": None}
    ok = healthy(sub.t)
    return {"success": ok, "TTM": round(sub.t - t_detect, 2) if ok else None}


def healthy_after_45(t: float) -> bool:
    """System state model: every service reports healthy from t=45 onward."""
    return t >= 45.0


T_FAULT = 10.0

# 1) Localization: exact match scores 1, a wrong service scores 0; a top-3 list
#    holding the answer at rank 3 passes Acc@3 but fails Acc@1 (the paper's
#    REACT and FLASH Acc@1 drops in miniature).
assert eval_localization(Submit("user-service", 25.0), "user-service", k=1)["success"] is True
assert eval_localization(Submit("geo", 25.0), "user-service", k=1)["success"] is False
ranked = Submit(["post-storage-service", "text-service", "user-service"], t=30.0)
assert eval_localization(ranked, "user-service", k=3)["success"] is True
assert eval_localization(ranked, "user-service", k=1)["success"] is False

# 2) Detection with hand-computed TTD: fault at t=10, correct submit at t=28.61,
#    so TTD = 18.61. A Noop problem (Fault 10: nothing injected) answered "yes"
#    is a false positive and scores 0.
det = eval_detection(Submit("yes", 28.61), fault_present=True, t_fault=T_FAULT)
assert det == {"success": True, "TTD": 18.61}
assert eval_detection(Submit("yes", 15.0), fault_present=False, t_fault=T_FAULT)["success"] is False
assert eval_detection(Submit("no", 15.0), fault_present=False, t_fault=T_FAULT)["success"] is True

# 3) Mitigation with hand-computed TTM: detected at t=20, system healthy from
#    t=45, agent submits at t=50, so TTM = 30.0 measured from detection.
mit = eval_mitigation(Submit("done", 50.0), healthy_after_45, t_detect=20.0)
assert mit == {"success": True, "TTM": 30.0}
assert eval_mitigation(Submit("done", 40.0), healthy_after_45, 20.0)["success"] is False

# 4) Budget exhaustion: an agent that never submits is scored failed after
#    exactly max_steps ACI interactions, not left hanging.
trace, none_sub = run_session(lambda i: Act("get_logs", float(i)), max_steps=10)
assert none_sub is None and len(trace) == 10
assert eval_mitigation(none_sub, healthy_after_45, 20.0) == {"success": False, "TTM": None}

# 5) Adversarial: the environment heals before the agent does anything useful.
#    The paper's stated oracle checks the general state of the entire system
#    after resolution, with no attribution requirement, so a no-op agent that
#    submits into a healed system passes; the oracle measures outcomes, not
#    causality (see Failure modes).
assert eval_mitigation(Submit("done", 60.0), healthy_after_45, 20.0)["success"] is True

print("localization@3:", eval_localization(ranked, "user-service", k=3),
      "@1:", eval_localization(ranked, "user-service", k=1))
print("detection:", det, "| mitigation:", mit)
print("budget-exhausted agent scored failed after", len(trace), "steps")
print("all protocol assertions passed")

Output: localization@3: {'success': True} @1: {'success': False}, detection: {'success': True, 'TTD': 18.61} | mitigation: {'success': True, 'TTM': 30.0}, budget-exhausted agent scored failed after 10 steps, all protocol assertions passed. When designing new problems, follow the paper's fault-extensibility pattern: prefer parametric faults that inject into any target service (the target-port misconfiguration reaches 10 services by changing one argument, and different targets change the blast radius and difficulty), and expect extra setup for some functional faults (ConfigMap updates and trigger scripts at the application level, Helm TLS configuration for the virtualization-level authentication fault).

How to maintain it¶

Pin the environment stack. Scores depend on the application versions, Chaos-Mesh, Helm charts, and the Kubernetes version underneath; re-baseline the same agent when any of them moves, exactly as for any evaluation harness.
Keep step budgets explicit and swept. The paper's step-limit sweep (3 to 20) shows agent ranking is budget-dependent: FLASH peaks at 59.32% with 20 steps while GPT-3.5 stops improving past 5 and only burns tokens. A single-budget comparison can invert a ranking.⁶
Watch environment flakiness separately. A deploy that fails, a workload generator that stalls, or a fault that does not land scores as agent failure unless the harness distinguishes them; log injection success and service readiness as part of each episode's metadata.
Version the problem pool. The paper's benchmark is 48 problems, fully evaluated (288 cases = 48 x 6 agents); the current repo has grown to ~90+ problems. Adding problems shifts averages, so report scores against a named pool revision.

Running it in production¶

The production role of AIOpsLab is as a gate, not a demo: before an ops agent gets write access to a cluster, it should clear the mitigation-level problems that correspond to your real failure classes, at your step budget, with token cost recorded, and it should pass the Noop problems without raising false positives. The paper's behavioral findings give the checklist its content: agents waste steps repeating identical API calls and inventing scripts that do not exist; they dump raw metrics into context (cat on telemetry files) and drown their own reasoning, which the paper connects to token exhaustion; malformed API calls repeat up to 14 times in a 20-step session for the weakest agent; and only one of four agents correctly said "nothing is wrong" on both fault-free problems.⁷ Map those to deployment policy: cap repeated identical actions, force telemetry through compaction before it reaches the model (the approach detailed in agentic AIOps), and require a clean false-positive record before paging anyone on an agent's word. Re-run the suite on every model or prompt upgrade, treating agent regressions like performance regressions: silent until measured.

Failure modes¶

Testbed-to-production gap. Social networks and hotel reservation demos exercise generic microservice failures; a GPU platform adds NCCL hangs, XID errors, and fabric faults the pool does not contain. Extend the fault library toward your incident history before trusting a score (the open direction the field page flags for GPU-specific incident classes).
Fault-catalog overfitting. Ten fault types with parametric targets is a benchmark, not the world; an agent tuned to the catalog can memorize symptom-to-fix mappings that fail on the first novel incident. Rotate held-out faults, the same discipline as evaluation integrity.
Flaky environments misread as agent failure. Kubernetes deployments, chaos injectors, and workload generators fail on their own schedule; without per-episode environment health checks, infrastructure noise lands in the agent's accuracy column.
Cost blowups from long sessions. Mitigation runs averaged up to 216 seconds (FLASH) and about 29K input tokens (ReAct) per problem in the paper's tables; a large pool times several agents times repeated trials multiplies quickly. Budget runs like any eval campaign and prune the pool to representative problems.
Outcome oracles without attribution. The mitigation oracle checks final system health, so self-healing effects (a pod restart loop that clears the fault) can credit an idle agent, and the validated model above demonstrates the pass. Pair outcome checks with trajectory review (the optional LLM-as-Judge) when scores gate real access.
Leaderboard gaming via task memorization. Problem descriptions, service names, and fault labels are public; an agent whose training data includes the repo can pattern-match rather than diagnose. Test with renamed services and fresh faults before believing a headline number.

References¶

Chen, Shetty, Somashekar, Ma, Simmhan, Mace, Bansal, Wang, Rajmohan, AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds (MLSys 2025): https://arxiv.org/abs/2501.06706
Paper PDF (Microsoft Research): https://www.microsoft.com/en-us/research/wp-content/uploads/2024/10/AIOpsLab-1.pdf
AIOpsLab repository (Microsoft): https://github.com/microsoft/AIOpsLab
DeathStarBench microservice suite (SocialNetwork, HotelReservation): https://github.com/delimitrou/DeathStarBench
OpenTelemetry Astronomy Shop demo: https://opentelemetry.io/docs/demo/
Chaos Mesh (symptomatic fault injection): https://chaos-mesh.org/
ReAct (evaluated agent pattern, arXiv 2210.03629): https://arxiv.org/abs/2210.03629

AIOpsLab (MLSys 2025): AgentOps vision; problem P = (T, C, S) with context C = (E, I) (environment hidden, problem information shared); Orchestrator + ACI over deployed services (SocialNetwork 28 microservices, HotelReservation) with wrk2 workloads; Helm/Kubernetes deployment; telemetry via Prometheus, Jaeger, Filebeat/Logstash with on-disk storage and offline export; mitigation oracles check whole-system state; source at https://aka.ms/aiopslab-repo. The OpenTelemetry Astronomy Shop demo, TiDB Operator apps, and the Locust workload generator are later additions in the current repo, not part of the paper. ↩↩↩↩↩
Task taxonomy (paper Table 1): Level 1 detection (binary), Level 2 localization (faulty service), Level 3 RCA with two sub-tasks (system level and fault type), Level 4 mitigation (restore the environment); higher level, later lifecycle stage, harder. ↩
ACI: default APIs get_logs, get_metrics, get_traces, exec_shell (security-filtered); API docstrings auto-extracted into agent context; feedback includes outputs, error messages, tracebacks; agent contract is async def get_action(state: str) -> str; sessions with start_problem(max_steps=...). ↩
Fault library: symptomatic (performance degradation, crash failures; Chaos-Mesh; levels 1-2 only, e.g. NetworkLoss, PodFailure, ContainerKill) vs functional (fine-grained root causes at application/virtualization level; most reach levels 1-4, e.g. AuthenticationMissing, TargetPortMisconfig with 12 problems, RevokeAuth, UserUnregistered, BuggyAppImage, ScalePod, AssignNonExistentNode, OperatorOverload, PVRedeployment, WrongBinUsage); Noop problems test false positives; 48 problems, fully evaluated (288 cases = 48 x 6 agents). ↩
Paper Tables 3-4. Overall: FLASH 59.32% (60 LoC, 99.64 s, 8.48 steps, 6,484.25 tokens), REACT 55.93% (49, 43.79 s, 11.50, 16,941.46), GPT-4-W-SHELL 49.15% (41, 28.61 s, 6.44, 6,394.5), GPT-3.5-W-SHELL 15.25% (41, 12.44 s, 14.70, 2,557.95). Detection: FLASH 100%, REACT 76.92%, GPT-4 69.23%, GPT-3.5 23.07%, MKSMC 15.38%. Localization Acc@3/Acc@1: REACT 69.23/53.85, GPT-4 61.54/61.54, FLASH 61.54/46.15, GPT-3.5 30.77/30.77, PDiagnose 15.38, RMLAD 7.69. RCA: REACT 45.45%, GPT-4 40.90%, FLASH 36.36%, GPT-3.5 9.09%. Mitigation: FLASH 54.55% (216.41 s, 16.09 steps), REACT 36.36%, GPT-4 27.27%, GPT-3.5 0%. FLASH is the authors' simplified reimplementation (original not public). ↩↩↩↩
Step-limit sweep (paper Figure 5): FLASH reaches 59.32% at a 20-step limit; GPT-3.5 gains nothing past 5 steps; accuracy plateaus indicate self-repair from environment feedback saturates quickly on AIOps tasks, unlike Dev tasks with linters and tests; the authors call for better task decomposition, intermediate feedback, and approaches beyond self-repair. ↩
Behavior analysis (paper Section 3.6): get_logs is the most-used API; FLASH never calls get_traces; successful runs use get_metrics/get_traces sparingly (2.6% each) vs failures (8.2%/5.8%); wasted steps on repeated calls and generating non-existent APIs; raw telemetry consumed via cat overloads context; GPT-3.5 repeated a malformed API call up to 14 times in one 20-step case; only GPT-4-W-SHELL passed both Noop (no-fault) detection problems, all others reported false positives. ↩