Markdown

Agentic incident management: OpsAgent¶

Scope: OpsAgent (arXiv 2510.24145), a lightweight self-evolving multi-agent system for incident management in microservices, as the worked example of a production-shaped diagnosis pipeline: the training-free observability-data processor, the task-aligned expert agents with cross-review, the dual self-evolution mechanism (PPO fine-tuning plus reflection into a validated experience store), and what all of that scores on the OpenRCA benchmark and in a 53-day production deployment. The field view of agentic operations (task taxonomy, guardrails, evaluation discipline) is agentic AIOps; the team-of-agents design pattern in general is multi-agent collaboration.

All numbers on this page are the paper's (OpenRCA benchmark runs and the Lenovo deployment) and were not reproduced here. The Python example is executed and asserted; it models the paper's evaluation metrics and the value of its validation-gated experience reuse, it does not run OpsAgent. The paper's code link points to an anonymized repository that requires authorization (as of 2026-07), so no code link is cited.

flowchart TB
  Q["Query (incident ticket)"] --> II["Intent Interpreter: time window + requested tasks"]
  II --> ORCH["Orchestrator: fetch metrics/logs/traces, invoke processor"]
  ORCH --> DP["Training-free data processor: unified textual descriptions"]
  DP --> A1["Anomaly Sentinel (AD)"]
  DP --> A2["Failure Diagnoser (FT)"]
  DP --> A3["Root Detective (RCL)"]
  A1 <-->|"cross-review"| A2
  A2 <-->|"cross-review"| A3
  A1 <-->|"cross-review"| A3
  A1 --> RPT["Root Cause Report: compiled by the Orchestrator\nfrom all agents' results + reasoning trail"]
  A2 --> RPT
  A3 --> RPT
  RPT --> OCE{"OCE mitigation succeeds?"}
  OCE -->|"no: feedback appended"| ORCH
  OCE -->|"yes"| EVO["Self-evolution: PPO updates + reflection into per-agent knowledge bases"]
  EVO -.->|"RAG at inference (per-agent stores, edge shown once)"| A1

What it is¶

OpsAgent is a multi-agent incident-management (IM) system built around a deliberately small reasoning core (a 14B-parameter open-source model, Qwen2.5-14B-Instruct-1M in the reference configuration) rather than a closed frontier API. It decomposes IM into the three tasks on-call engineers actually run: anomaly detection (AD), failure triage (FT), and root cause localization (RCL).¹ Three design decisions target the three reasons the authors argue automated IM has not been adopted: deep-learning pipelines overfit one system's data and explain nothing; closed-model LLM pipelines cost too much and skew toward shallow similarity matching; and neither accumulates expertise the way an on-call rotation does.

The system answers each with a module. A training-free data processor converts heterogeneous metrics, logs, and traces into unified textual descriptions with statistics and heuristics, no learned feature extractors to retrain per system. A multi-agent collaboration framework assigns roles by diagnostic task, not by data modality, and makes the agents critique one another before a conclusion is issued, producing an auditable trail. A dual self-evolution mechanism improves the agents after deployment, internally by PPO fine-tuning against a reward that scores both accuracy and reasoning quality, and externally by distilling validated diagnoses into per-agent knowledge bases retrieved at inference time.¹

Why use it¶

The pipeline shape survives contact with production. Deployed in Lenovo's environment (on the order of 25,000 infrastructure instances, VictoriaMetrics metrics, OpenTelemetry logs and traces), OpsAgent processed 10,492 incidents over 53 days at 84.09% diagnostic accuracy, where accuracy means the validated report named the right root-cause reason and component, confirmed by mitigation.⁶
It beats bigger models on the benchmark. On OpenRCA (335 incidents, three real microservice systems, over 68 GB of telemetry), OpsAgent on the 14B seed averages 16.54% Correct and 41.35% Partial, versus 11.28%/32.33% for RCA-Agent driven by Claude 3.5 Sonnet, a relative gain the paper states as 46.63% Correct and 27.90% Partial over the prior state of the art.³
Cost and latency fit an on-call loop. Roughly one minute per case (71.18 s/case on the 14B seed) against nearly five minutes for RCA-Agent's generate-run-repair code loop; in production, routine incidents (about 70% of volume, disk-space exhaustion and proxy misconfiguration types) diagnosed in about 30 seconds at 97% accuracy. The paper contrasts this with the traditional cost of 3 engineers for about 2.5 hours per incident.⁶
The audit trail is the product. Every expert answer, peer critique, and refinement lands in the Root Cause Report; frontline engineers rating 150 sampled production reports scored consistency 4.6, clarity 4.3, relevance 4.7, and rationality 4.0 on a 1-to-5 scale. The authors' first lesson learned: interpretability is a prerequisite for adoption, not a nicety, and a wrong-but-evidenced report still shortens manual investigation.⁶

When to use it (and when not)¶

Use the pattern where incidents arrive with heterogeneous telemetry and the diagnosis must be auditable before anyone acts on it; the task-aligned roles plus cross-review produce exactly the evidence trail a reviewer needs.
Use a small local model when observability data cannot leave the estate; the entire design exists to make a 14B model sufficient, and the paper shows general-purpose scaffolds (CoT, ReAct, Reflexion) collapse at this scale on IM (averages of 0.00 to 2.26% Correct across seeds).³
Do not expect benchmark numbers in production, in either direction. Lenovo accuracy (84.09%) is far above the OpenRCA average (16.54% Correct) because production queries carry alert context that narrows the search space; conversely, the hard remaining share of incidents, roughly 30% (cross-component, ambiguous, incomplete evidence), reached only 54% even with up to three re-analysis rounds.⁶
Do not skip the mitigation-validated feedback loop. The evolution mechanism only ingests diagnoses confirmed by successful mitigation; without that closure signal the system cannot safely learn from its own outputs.
Budget for the judge at training time. The PPO reward uses Qwen3-235B-A22B to score reasoning quality along four dimensions; that cost is training-time only, but it is real.

Architecture¶

Training-free data processor. Metrics pass through 3-sigma sliding-window anomaly detection (keeping deviation scores in sigma units), a small pre-trained CNN assigns each anomaly a shape label (spike, steady increase, level shift; 20 steps of context before, 10 after), and anomalies aggregate into records of the form service_instance, metric_name, anomaly_pattern, timestamp, deviation_score, keeping the top-5 pods and top-5 metrics by deviation. Logs are filtered by an incident lexicon (fatal, error, crash, fail), parsed into templates with Drain3, ranked by TF-IDF with an adaptive threshold (default 80th percentile), and deduplicated per template within one-minute windows. Traces mark spans above a per-call-type latency threshold (default 95th percentile), aggregate them by 60-second window and callee (count, max latency, caller distribution), and tally three-hop call paths (grandparent to caller to callee) to expose recurrent bottlenecks.² The ablation shows why this layer exists: feeding filtered but raw data instead of descriptions drops average Correct from 16.54% to 2.26%.⁴

Task-aligned agents and cross-review. An Intent Interpreter extracts the analysis window and requested tasks from the query; the Orchestrator fetches telemetry, invokes the processor, and dispatches the shared descriptions to three expert agents defined by structured profiles: Anomaly Sentinel (AD), Failure Diagnoser (FT), Root Detective (RCL), each reasoning with chain-of-thought in parallel. Roles map to tasks rather than modalities deliberately: splitting by metrics/logs/traces would give each agent a partial view and irreconcilable conclusions. Cross-review then has each agent critique the other two's answer and rationale (overlooked evidence, unclear reasoning, alternative hypotheses), refinements follow, and everything is recorded into the Root Cause Report.¹ Removing cross-review alone drops average Correct from 16.54% to 6.77%.⁴

Dual self-evolution. Internally, the three expert agents are fine-tuned with PPO; the reward combines binary accuracy (5 for a correct diagnosis, 0 otherwise) with a reasoning-quality score from an external judge (Qwen3-235B-A22B rating consistency, clarity, relevance, rationality on 0-5), blended by a tunable coefficient. Externally, after a correctly resolved case, the responsible agent reflects and distills a symptoms-to-experience entry; entries are embedded (Sentence-Transformer) into a per-agent faiss store and retrieved by symptom similarity at inference (RAG), with conflicting entries superseded by the newer and complementary ones merged. Only mitigation-validated trajectories enter the store, and failed retrieval falls back to plain chain-of-thought.⁵ The two halves are complementary in the ablation (without reflection: 10.53% Correct; without PPO: 12.78%), and capability grows with deployment data: 8.27% Correct with no evolution, 12.03% at 20% of cases, 14.29% at 40%, 16.54% at 60%.⁴

How to use it¶

The paper's evaluation setup is the reference deployment recipe: a sub-20B open-weight seed model served locally (Qwen2.5-14B-Instruct-1M was the strongest of the three tested; gpt-oss-20b and Phi-3-medium-128k-instruct also run the pipeline), the data processor pointed at the estate's metrics, log, and trace stores, and the agent profiles (name, task description, operational instructions, examples) adapted to local component vocabularies. The implementation stack is ordinary Python/PyTorch (Python 3.10.16, PyTorch 2.6.0, Transformers 4.51.1), and experiments ran on a single 8-GPU server with 48 GB per GPU, which is the scale of a modest inference node rather than a training cluster.¹ Queries are natural-language tickets; each response is a Root Cause Report that names the anomaly onset time, failure type, and culprit component with the full reasoning and review trail attached. On failed mitigation, the Orchestrator re-runs the analysis with feedback appended to the query; production allowed up to three rounds.

How to develop with it¶

The evaluation arithmetic and the experience-store discipline are the two parts worth internalizing before extending any of this, and both are checkable without a model. The executed block below implements the paper's Correct/Partial query metrics (a query bundles AD/FT/RCL subtasks; Correct requires all requested subtasks solved, Partial at least one), top-k localization accuracy, and a gated-versus-ungated experience reuse comparison: a rare recurring incident family has a validated store entry, a poisoned entry (a wrong past diagnosis) sits in the same store, and the similarity gate with pure-reasoning fallback is what separates net gain from net damage:

# opsagent_model.py - validated: the paper's Correct/Partial query metrics, plus
# top-k localization arithmetic (this page's extension; the paper scores top-1) and
# why validation-gated experience reuse (the reflection store's discipline) matters. A model of the mechanisms, not a rerun of OpsAgent.
import numpy as np

NOT_REQUESTED = -1  # sentinel for a subtask absent from a query


def correct_partial(results: np.ndarray) -> tuple[float, float]:
    """Paper Sec. 4.1.3: results[q, s] is 1 solved / 0 failed / -1 not requested.
    Correct counts queries with every requested subtask solved; Partial at least one."""
    assert results.ndim == 2 and np.isin(results, (-1, 0, 1)).all()
    requested = results != NOT_REQUESTED
    assert requested.any(axis=1).all(), "every query must request at least one subtask"
    solved = results == 1
    all_ok = (solved | ~requested).all(axis=1)
    any_ok = (solved & requested).any(axis=1)
    return float(all_ok.mean()), float(any_ok.mean())


def ac_at_k(rankings: np.ndarray, truth: np.ndarray, k: int) -> float:
    """Top-k localization accuracy: culprit within the first k ranked candidates."""
    assert rankings.ndim == 2 and 1 <= k <= rankings.shape[1]
    return float((rankings[:, :k] == truth[:, None]).any(axis=1).mean())


def rank_candidates(base_scores: np.ndarray, hint: int | None) -> np.ndarray:
    """Rank component ids by score, descending; a retrieved experience hint is
    trusted first (injected context steers the diagnosis), the rest follow score."""
    order = np.argsort(-base_scores, kind="stable")
    if hint is None:
        return order
    return np.concatenate(([hint], order[order != hint]))


def retrieve(store: list[tuple[np.ndarray, int]], symptom: np.ndarray,
             threshold: float | None) -> int | None:
    """Reflection-store retrieval: nearest <symptoms, experience> entry by cosine
    similarity. With a threshold (the gate), a weak match returns None and the
    agent falls back to ungrounded ranking, the paper's pure-CoT fallback."""
    sims = [float(s @ symptom / (np.linalg.norm(s) * np.linalg.norm(symptom)))
            for s, _ in store]
    best = int(np.argmax(sims))
    if threshold is not None and sims[best] < threshold:
        return None
    return store[best][1]


# 1) Correct/Partial on a hand-computed 4-query case (AD/FT/RCL columns):
# q0 all three solved; q1 AD solved, FT failed; q2 AD failed; q3 AD solved.
hand = np.array([[1, 1, 1], [1, 0, NOT_REQUESTED],
                 [0, NOT_REQUESTED, NOT_REQUESTED], [1, NOT_REQUESTED, NOT_REQUESTED]])
c, p = correct_partial(hand)
assert (c, p) == (0.5, 0.75), (c, p)
rng = np.random.default_rng(0)
batch = rng.choice([-1, 0, 1], size=(500, 3), p=[0.3, 0.35, 0.35])
batch[(batch == NOT_REQUESTED).all(axis=1), 0] = 0  # every query requests something
cb, pb = correct_partial(batch)
assert cb <= pb, "Partial can never be below Correct"

# 2) Localization: 60 incidents over 14 components (the Bank system's size).
n_inc, n_comp = 60, 14
truth = rng.integers(0, n_comp, n_inc)
base_scores = rng.normal(0.0, 1.0, (n_inc, n_comp))
base_scores[np.arange(n_inc), truth] += 1.0          # weak signal toward the culprit
base_rank = np.stack([rank_candidates(s, None) for s in base_scores])
a1, a3, a5 = (ac_at_k(base_rank, truth, k) for k in (1, 3, 5))
assert a1 <= a3 <= a5, "AC@k must be monotone in k"

# 3) Gated vs ungated experience reuse. Incident families live in symptom space;
# the store holds one validated entry (recurring family A, whose culprit really is
# component 3) and one poisoned entry (a wrong past diagnosis that slipped in),
# whose symptom key matches neither family. Family A is the rare recurring case
# (20%); the majority family B is well served by the base evidence alone.
n_ev = 400
fam_a, fam_b = np.array([1.0, 0.0, 0.0]), np.array([0.0, 1.0, 0.0])
store = [(fam_a, 3), (np.array([0.0, 0.0, 1.0]), 9)]  # A: culprit 3; poison: 9
is_a = rng.random(n_ev) < 0.2
truth_ev = np.where(is_a, 3, rng.integers(0, n_comp, n_ev))
symptoms = np.where(is_a[:, None], fam_a, fam_b) + rng.normal(0, 0.05, (n_ev, 3))
ev_scores = rng.normal(0.0, 1.0, (n_ev, n_comp))
ev_scores[np.arange(n_ev), truth_ev] += np.where(is_a, 0.5, 2.0)  # A is the hard family

def run(threshold: float | None) -> float:
    ranks = []
    for i in range(n_ev):
        hint = retrieve(store, symptoms[i], threshold)
        ranks.append(rank_candidates(ev_scores[i], hint))
    return ac_at_k(np.stack(ranks), truth_ev, 1)

base_ac1 = ac_at_k(np.stack([rank_candidates(s, None) for s in ev_scores]),
                   truth_ev, 1)
gated_ac1 = run(threshold=0.8)     # weak matches fall back to pure ranking
ungated_ac1 = run(threshold=None)  # nearest entry always applied
assert gated_ac1 > base_ac1, (gated_ac1, base_ac1)
assert ungated_ac1 < base_ac1, (ungated_ac1, base_ac1)

print(f"hand case Correct/Partial: {c:.2f}/{p:.2f}; random batch {cb:.3f}/{pb:.3f}")
print(f"AC@1/3/5 baseline: {a1:.3f}/{a3:.3f}/{a5:.3f}")
print(f"AC@1 baseline {base_ac1:.3f} | gated reuse {gated_ac1:.3f} | ungated {ungated_ac1:.3f}")
print("all assertions passed")

Output of the run: hand case Correct/Partial: 0.50/0.75; random batch 0.266/0.762, AC@1/3/5 baseline: 0.200/0.417/0.667, AC@1 baseline 0.537 | gated reuse 0.725 | ungated 0.260, all assertions passed. Gated retrieval lifts top-1 localization from 0.537 to 0.725 by helping only the incidents it genuinely matches; the same store applied without the gate drops it to 0.260, because every unmatched incident inherits someone else's diagnosis. That is the paper's reflection discipline in miniature: only validated experience enters, and only confident matches are injected.

How to maintain it¶

Keep mitigation as the only teacher. The store admits entries from confirmed resolutions and the PPO loop trains on rewarded rollouts; wiring either to unvalidated model output converts self-evolution into self-contamination.
Curate the knowledge bases like code. The paper's reconciliation rule (newer supersedes conflicting, complementary entries merge) handles growth, but topology changes (a service renamed, a database migrated) silently strand symptom keys; review entries after platform changes the way agent memory needs curation.
Re-tune the processor's thresholds per estate. The 3-sigma rule, the 80th-percentile log threshold, and the per-call-type 95th-percentile latency cut are defaults tuned on the benchmark systems; a noisier estate needs different cuts, and the top-5 aggregation caps assume incident-scale blast radii.
Re-run the ablation locally after any change. The module contributions are large and separable (2.26/6.77/10.53/12.78 versus 16.54 average Correct); a local regression harness over a replayed incident set catches a broken module before on-call does.

Running it in production¶

The Lenovo deployment is the production template: a local 14B seed, the observability stack it already had (VictoriaMetrics, OpenTelemetry), reports delivered to on-call engineers who validate by mitigating, and confirmed cases fed back into offline evolution. Two operational facts deserve emphasis. First, the difficulty distribution is bimodal: about 70% of incidents are routine and recurring (97% accuracy, about 30 seconds), and the remaining cross-component or ambiguous cases needed up to three re-analysis rounds for 54% accuracy, so staffing and escalation policy should assume the agent clears the floor while humans keep the tail. Second, self-evolution showed up operationally: incidents that first required multiple rounds later resolved in one attempt, and initially unresolvable failure types became diagnosable on recurrence.⁶

Applied to a GPU cluster (an application of the pattern, not a paper claim): the same three modules map directly onto this KB's operational stack. The data processor's role is what telemetry pipelines already half-do (DCGM metric anomalies, Xid log events, fabric-counter deviations rendered as compact text); the expert-agent split mirrors detect/triage/localize over runbook-shaped incidents such as NCCL hangs or thermal events; and the reflection store is a machine-curated sibling of the troubleshooting symptom-to-fix table, with mitigation-confirmed diagnoses as the only write path and the agentic AIOps guardrails unchanged: read-only telemetry access, human sign-off on mutating actions, and closed-loop verification that the fix actually restored service.

Failure modes¶

Experience-store poisoning. A wrong diagnosis that slips past validation becomes retrievable precedent and, as the executed example shows, indiscriminate reuse can cost more than the store contributes. Keep the mitigation-confirmed write path, similarity-gated reads, and periodic audits of high-traffic entries.
Benchmark-to-production gap. OpenRCA queries are context-poor and score low; production tickets carry alerts and score high. Neither number transfers to a new estate; replay a sample of local incidents before trusting either.
Alert storms multiply cost. Each query fans out to three agents plus cross-review rounds; an alert storm that opens hundreds of tickets multiplies token spend and can queue the diagnosis pipeline behind its own backlog. Deduplicate upstream and cap re-analysis rounds (production used three).
Non-deterministic diagnoses. Same telemetry, different runs, different rationales; the paper reports five-run averages for every benchmark number. Treat single-run production evaluations as anecdotes and track accuracy as a rolling, mitigation-confirmed rate.
Stale experience after topology change. Symptom keys embed component names and shapes; re-architecting the estate leaves entries that match nothing (harmless) or match the wrong thing (harmful). Expire or re-validate entries on service-catalog changes.
Judge-model drift at training time. The reasoning-quality half of the PPO reward comes from an external judge; swapping or upgrading it changes the reward surface, so pin the judge version per training campaign the way any reward design is pinned.

References¶

Luo et al., OpsAgent: An Evolving Multi-agent System for Incident Management in Microservices (arXiv 2510.24145): https://arxiv.org/abs/2510.24145
OpenRCA benchmark (ICLR '25, the evaluation dataset and the RCA-Agent baseline): https://github.com/microsoft/OpenRCA
Drain3 (log template mining used by the data processor): https://github.com/logpai/Drain3
Schulman et al., Proximal Policy Optimization Algorithms (arXiv 1707.06347): https://arxiv.org/abs/1707.06347
Google SRE Workbook, Incident Response (the manual practice the pipeline mirrors): https://sre.google/workbook/incident-response/

OpsAgent (arXiv 2510.24145v3): 14B open-source reasoning core (Qwen2.5-14B-Instruct-1M reference; gpt-oss-20b and Phi-3-medium-128k-instruct also evaluated); agent profiles specify name, task description, instructions, examples; Intent Interpreter and Orchestrator coordinate three task-aligned experts (Anomaly Sentinel, Failure Diagnoser, Root Detective) with chain-of-thought, cross-review, and a compiled Root Cause Report; implementation Python 3.10.16, PyTorch 2.6.0, Transformers 4.51.1, accelerate 1.7.0; experiments on one server with 8x 48GB GPUs, five repeats averaged; 60/40 train/test split. ↩↩↩↩
Section 3.2: metrics via 3-sigma sliding window, CNN shape classification (PatternMatcher-style, 20-before/10-after context), aggregation to top-5 pods and top-5 metrics by deviation score; logs via incident lexicon, Drain3 templates, TF-IDF ranking (80th-percentile adaptive threshold), one-minute per-template dedup; traces via per-call-type 95th-percentile latency thresholds, 60 s window-by-callee aggregation, and three-hop call-path tallies. ↩
Table 1 (OpenRCA: 335 cases; Telecom 5 failure types/43 components, Bank 8/14, Market 15/56; metrics: Correct = all requested subtasks solved, Partial = at least one; AD correct within +-60 s, FT/RCL exact top-1): OpsAgent on Qwen2.5-14B averages 16.54 Correct / 41.35 Partial at 71.18 s/case (Telecom 30.00/45.00, Bank 18.52/40.74, Market 10.17/40.68); RCA-Agent with Claude 3.5 Sonnet 11.28/32.33 at 287.71 s/case; ART 0.75/17.29 at 2.27 s/case; CoT/ReAct/Reflexion average 1.50/9.77, 0.75/6.01, 2.26/8.27 on the same seed; headline relative gains 46.63% Correct, 27.90% Partial over SOTA. ↩↩
Tables 2-3 (Qwen2.5-14B seed, dataset-averaged Correct/Partial %): without data processor 2.26/6.02; without cross-review 6.77/21.80; without reflection 10.53/26.32; without PPO 12.78/33.08; OpsAgent with no self-evolution 8.27/27.07 and with 60% of cases 16.54/41.35; capability grows monotonically with training budget (0/20/40/60% gives 8.27, 12.03, 14.29, 16.54 Correct). ↩↩↩
Section 3.4: PPO reward blends binary accuracy (5 or 0) with a 0-5 reasoning-quality average (consistency, clarity, relevance, rationality) scored by Qwen3-235B-A22B, weighted by a tunable coefficient; reflection distills symptoms-to-experience pairs from successfully resolved cases only, embedded with Sentence-Transformer into per-agent faiss stores, retrieved via RAG; conflicting entries superseded by newer, complementary merged; retrieval failure falls back to pure chain-of-thought. ↩
Section 5 (Lenovo production, 53 days, Qwen2.5-14B seed): about 2.5e4 infrastructure instances, VictoriaMetrics and OpenTelemetry stack; 10,492 incidents at 84.09% accuracy (mitigation-confirmed root-cause reason and component); routine incidents about 70% of volume at 97% accuracy and about 30 s; hard cases up to three re-analysis rounds at 54%; versus 3 OCEs for about 2.5 h traditionally, 126 s average; interpretability rated by three OCEs over 150 sampled reports at 4.6/4.3/4.7/4.0 (consistency/clarity/relevance/rationality). ↩↩↩↩↩