Markdown

CTI-REALM: benchmarking detection-rule generation agents¶

Scope: the CTI-REALM benchmark (arXiv 2603.13517, Microsoft Security AI) in depth: how the emulated-attack environment is built, how the five-checkpoint reward decomposes task scoring, what the 16-model evaluation, ablation, variance, and memory studies found, and what it takes to construct a benchmark of this shape. The survey view of security-agent evaluation, where CTI-REALM sits beside CyberGym, Inspect Evals, and AgentGym, is cybersecurity agent evaluation; this page is the deep dive on this one benchmark. General benchmark anatomy is covered in LLM benchmarks and eval-gating discipline in evaluating agents.

All numbers are from arXiv 2603.13517v2 (March 2026) and were not reproduced here. The paper itself names no release URL, but the benchmark is publicly released as an Inspect Evals task (inspect_evals/cti_realm, with cti_realm_25/cti_realm_50 variants) with data on Hugging Face; links under References (as of 2026-07). The Python example is executed and asserted; it implements the paper's Appendix C scoring mechanics, not the benchmark itself.

flowchart TB
  SIM["Attack emulations on sandboxed Azure tenant<br/>(Linux endpoints, AKS clusters, Azure cloud)"] --> COLLECT["Telemetry via Azure Monitor Agent + MDE"]
  COLLECT --> CLEAN["Cleaning, PII anonymization,<br/>contamination scrubbing"]
  CLEAN --> ENV["Containerized environment (Docker + Inspect AI):<br/>37 CTI reports, Kusto cluster, 12 log sources,<br/>MITRE ATT&CK DB, Sigma rules DB"]
  ENV --> AGENT["ReAct agent, 8-tool API,<br/>max 70 messages"]
  AGENT --> OUT["Sigma rule + KQL query + query results"]
  OUT --> SCORE["Checkpoint scoring C0..C4<br/>(deterministic + GPT-5-Mini judge)"]

What it is¶

CTI-REALM (Cyber Threat Real World Evaluation and LLM Benchmarking) evaluates AI agents on the complete detection-engineering workflow: read cyber-threat-intelligence reports, explore heterogeneous telemetry, iteratively construct and test queries, and produce validated detection rules in both Sigma and KQL. Prior benchmarks tested parametric knowledge or isolated subtasks (rule synthesis, TTP classification, threat-actor attribution); CTI-REALM measures the end-to-end analytical pipeline a SOC analyst runs, with tools, iteration, and ground truth.¹

The dataset derives from attack simulations executed on real infrastructure in an isolated, sandboxed Azure tenant, drawn from 37 public CTI reports and detection references (Microsoft Security, Datadog Security Labs, Palo Alto Networks, Splunk Security Content). Tasks span three difficulty tiers: easy (atomic single-step attacks), medium (multi-step sequences), and hard (complex chains requiring cross-source correlation); all cloud simulations are hard by construction. Two stratified evaluation sets exist: CTI-REALM-25 (12 Linux, 9 AKS, 4 Cloud) for cheap iteration, a strict subset of CTI-REALM-50 (25 Linux, 17 AKS, 8 Cloud) for the robust surface.²

The benchmark's second identity is an RL environment: each task is modeled as a Markov Decision Process whose reward function pays out at five checkpoints along the trajectory, so the same scoring that ranks frontier models can serve as a training signal for hill climbing or RL over detection-engineering policies.³

Why use it¶

It measures workflow, not recall. Agents must find the relevant CTI report, map MITRE techniques, discover the right telemetry tables by schema exploration, refine queries against a live Kusto engine, and land a rule that fires on the attack rows; each stage is scored separately, so the benchmark localizes where a model fails, not only that it fails.³
Ground truth is executable. Every task carries expected MITRE technique IDs, expected data sources, and per-field regex patterns over telemetry rows; detection quality is an F1 computed by matching the agent's actual query results, not a judge's opinion of the rule text alone.⁴
The telemetry is real. Logs come from attacks executed on real Linux, AKS, and Azure infrastructure, collected through Azure Monitor Agent and Microsoft Defender for Endpoint, then cleaned, PII-anonymized, and scrubbed of identifying resource names to limit contamination.²
It discriminates. Across 16 frontier configurations the normalized reward spans 0.360 to 0.637, and single checkpoints separate model families sharply: on C3 (query execution) Claude models score 0.86 to 0.92 while most OpenAI models fall below 0.50, with GPT-4.1 at 0.02.⁶

When to use it (and when not)¶

Use it for model selection in security tooling: the paper's own reading is Claude Opus for quality and GPT-5 (Low) for cost efficiency (best Pareto point at roughly 179K tokens per sample for 0.541 reward).⁷
Use the checkpoints for capability-gap analysis: a model strong on C2 (data exploration) but weak on C3 (query execution) needs query-construction help, not more CTI context; the memory study shows exactly this split (seeded guidance doubled C1 but left C3 unchanged).¹⁰
Use it to justify tools. The ablation quantifies what CTI-specific tools buy (drops of 0.077 to 0.150 reward when removed, Cohen's d 0.30 to 0.64), a template for measuring any tool's contribution before shipping it (tools and function calling).⁹
Do not read it as SOC-scale evidence. The paper's own limitations: no production-scale telemetry volumes, no long-baseline anomaly detection, and an Azure/KQL-centric design whose results may not transfer to AWS/GCP stacks or other SIEM query languages.¹
Do not treat the trajectory reward as the deliverable. 65% of the weight sits on final detection quality; a pipeline that games intermediate checkpoints still fails the benchmark, by design.

Architecture¶

Environment. A Docker container integrated with the Inspect AI evaluation framework hosts the analyst workspace: the 37-report CTI repository, a Kusto cluster executing real KQL, telemetry from 12 log sources (endpoint deviceprocessevents and devicefilevents; AKS aksaudit and aksauditadmin; cloud azureactivity and azurediagnostics; identity signinlogs, auditlogs, aadserviceprincipalsigninlogs; application-layer officeactivity, microsoftgraphactivitylogs, storagebloblogs), each holding one to three days of telemetry mixing attack events with benign background activity, plus a MITRE ATT&CK database and a Sigma-rule reference database. Attack execution and agent evaluation are deliberately separated: attacks run on real infrastructure, agents only ever see the cleaned logs.²

Tool API. Agents act through eight functions: list_cti_report_tags, get_cti_reports_by_tag, list_kusto_tables, get_table_schema, execute_kql_query, get_mitre_techniques, search_sigma_rules, and submit_detection. The ReAct agent interleaves reasoning with these calls under a 70-message budget.⁵

Scoring. The total reward is R = sum(w_i * r_i) over five checkpoints with weights summing to 1 and each r_i in [0, 1]: C0, CTI report analysis (w=0.125, GPT-5-Mini judge); C1, threat context via MITRE technique mapping (w=0.075, Jaccard similarity against ground truth); C2, data exploration (w=0.10, Jaccard over telemetry sources); C3, query execution (w=0.05, binary: at least two successful queries); C4, detection quality (w=0.65: KQL correctness as F1 over regex-matched result rows plus a Sigma-quality judge weighted 0.25 syntax and 0.75 specificity). C0 through C3 form the trajectory reward (35% total), C4 the ground-truth reward (65%). Judge outputs were spot-validated by security researchers against expert assessment.³⁴

How to use it¶

A run follows the paper's protocol. Each task hands the agent a detection objective derived from a CTI report; the agent works inside the container (no external access, telemetry immutable) and must return JSON with a Sigma rule (YAML), a working KQL query, and the actual execute_kql_query results. Evaluation proceeds in phases: all candidate models on CTI-REALM-50 for baseline ranking; the top five re-run three epochs on CTI-REALM-25 for variance and confidence intervals; the same five with CTI tools removed for the ablation; and a memory-augmentation arm comparing a small model with and without seeded guidance against its larger sibling. Reasoning-effort settings are part of the model configuration matrix (the study covers GPT-5/5.1/5.2 at low, medium, and high), and one of its findings is that medium reasoning effort consistently outperforms both high and low on this workload.⁶

How to develop with it¶

The deterministic half of the scoring is small enough to validate exactly, and doing so clarifies what each signal measures. This block is executed and asserted: the Appendix C outcome scoring (a result row is a true positive only if all ground-truth field regexes match, F1 computed with explicit zero-division guards), the C1 Jaccard scoring, the mathematical fact that F1 and Jaccard can never disagree when computed on the same sets (F1 = 2J/(1+J)), and the constructed divergence that justifies scoring trajectory and outcome on different objects:

# ctirealm_scoring.py - validated: the deterministic scoring mechanics of
# CTI-REALM's C4 outcome (regex row-matching F1, Appendix C) and C1 trajectory
# (MITRE technique Jaccard), plus the reason they are applied to different
# objects. Mechanics validation in pure Python; does not run the benchmark.
import re


Row = dict[str, str]


def row_matches(row: Row, patterns: dict[str, str]) -> bool:
    """Appendix C: a row is a true positive iff ALL field patterns match."""
    return all(re.search(p, row.get(f, "")) for f, p in patterns.items())


def rule_f1(returned: list[Row], telemetry: list[Row],
            patterns: dict[str, str]) -> tuple[float, float, float]:
    """Precision over returned rows, recall against all attack rows in the
    telemetry, F1 with an explicit zero-division guard."""
    tp = sum(row_matches(r, patterns) for r in returned)
    attack_total = sum(row_matches(r, patterns) for r in telemetry)
    assert attack_total > 0, "task ground truth must contain attack rows"
    precision = tp / len(returned) if returned else 0.0
    recall = tp / attack_total
    if precision + recall == 0.0:
        return 0.0, 0.0, 0.0
    return precision, recall, 2 * precision * recall / (precision + recall)


def jaccard(a: set[str], b: set[str]) -> float:
    """C1/C2 trajectory scoring: |A intersect B| / |A union B|."""
    if not a and not b:
        return 1.0
    return len(a & b) / len(a | b)


# Telemetry: 5 attack rows (clipboard exfil per the paper's example ground
# truth) among 95 benign rows.
ATTACK = {"filename": "wl-paste|xclip|xsel", "initiatingprocessfilename": "bash|sh|dash"}
telemetry: list[Row] = (
    [{"filename": "xclip", "initiatingprocessfilename": "bash"} for _ in range(5)]
    + [{"filename": f"proc{i}", "initiatingprocessfilename": "systemd"} for i in range(95)]
)

# 1) A perfect rule returns exactly the attack rows: F1 = 1.
p, r, f1 = rule_f1(telemetry[:5], telemetry, ATTACK)
assert (p, r, f1) == (1.0, 1.0, 1.0)

# 2) A rule that fires on nothing relevant scores 0 (guarded, no ZeroDivision).
p, r, f1 = rule_f1(telemetry[10:12], telemetry, ATTACK)
assert (p, r, f1) == (0.0, 0.0, 0.0)
assert rule_f1([], telemetry, ATTACK) == (0.0, 0.0, 0.0)

# 3) An over-broad rule returning all 100 rows: recall 1, precision 5/100.
p, r, f1 = rule_f1(telemetry, telemetry, ATTACK)
assert r == 1.0 and p == 0.05
assert abs(f1 - 2 * 0.05 / 1.05) < 1e-12          # 0.0952..., low despite full recall

# 4) Jaccard boundaries and a hand-computed partial overlap.
assert jaccard({"T1053"}, {"T1053"}) == 1.0
assert jaccard({"T1053"}, {"T1078"}) == 0.0
assert jaccard({"T1078", "T1069", "T1098"}, {"T1078", "T1484"}) == 0.25  # 1/4

# 5) On the SAME pair of sets, F1 is a monotone transform of Jaccard
#    (F1 = 2J/(1+J)), so the two can never rank two rules differently there.
for a, b in ((set("abc"), set("ab")), (set("abcd"), set("cdef"))):
    tp, fp, fn = len(a & b), len(b - a), len(a - b)
    f1_sets = 2 * tp / (2 * tp + fp + fn)
    j = jaccard(a, b)
    assert abs(f1_sets - 2 * j / (1 + j)) < 1e-12

# 6) The benchmark instead applies them to DIFFERENT objects, which do diverge:
#    agent A maps techniques perfectly but writes a bad rule; agent B misses
#    half the techniques but lands a perfect rule. The signals rank them
#    oppositely, so C1 and C4 measure different capabilities.
truth_ttp = {"T1053", "T1115"}
a_c1 = jaccard({"T1053", "T1115"}, truth_ttp)              # 1.0
a_c4 = rule_f1(telemetry, telemetry, ATTACK)[2]            # over-broad rule
b_c1 = jaccard({"T1053"}, truth_ttp)                       # 0.5
b_c4 = rule_f1(telemetry[:5], telemetry, ATTACK)[2]        # perfect rule
assert a_c1 > b_c1 and a_c4 < b_c4, "trajectory and outcome must diverge"

print(f"over-broad rule: precision={p:.2f} recall={r:.0f} F1={f1:.4f}")
print(f"divergence: A(C1={a_c1:.2f}, C4={a_c4:.4f}) vs B(C1={b_c1:.2f}, C4={b_c4:.4f})")
print("all scoring assertions passed")

Output: over-broad rule: precision=0.05 recall=1 F1=0.0952, divergence: A(C1=1.00, C4=0.0952) vs B(C1=0.50, C4=1.0000), all scoring assertions passed. The divergence case is the paper's checkpoint analysis in miniature: strong MITRE mapping (C1) does not guarantee detection quality (C4), which is why the composite keeps them as separate signals rather than collapsing them.⁶

Building a benchmark of this shape means paying for four things the paper documents: real attack execution (adapting Atomic Red Team test cases and CTI-derived scenarios onto disposable infrastructure), telemetry hygiene (PII anonymization plus scrubbing resource identifiers so the tasks do not leak), executable ground truth (technique IDs, expected sources, and per-field regexes per task), and judge calibration (few-shot score anchors for the C0 and C4 judges, spot-checked by human experts).²⁴

How to maintain it¶

Contamination is a standing threat. The 37 source reports are public; a model trained on them may recall detections rather than derive them. The environment mitigates by sanitizing identifiers and grounding scoring in emulated telemetry the model has never seen, but maintainers must rotate scenarios as reports age into training corpora (evaluation integrity).
Pin the judge. C0 and the Sigma-quality share of C4 use GPT-5-Mini as judge; a judge-model upgrade silently rescales C0 (12.5%) plus that share of the 65% outcome weight (the KQL F1 inside C4 is deterministic). Re-anchor with the few-shot calibration examples and re-validate against expert review after any judge change.
Re-run variance before believing deltas. The paper's three-epoch protocol on CTI-REALM-25 shows within-model standard deviations of 0.196 to 0.263 driven by task difficulty; single-run differences smaller than that are noise. Claude models occasionally score zero on the hardest tasks while GPT-5 (Med) never did (minimum 0.241), a risk-profile difference invisible in means.⁸
Budget tokens per model family. Per-sample totals range from about 120K (GPT-4.1) to 539K (GPT-5.2 High); top rewards cost 442K to 524K tokens per sample on Anthropic models. Multiply by 50 tasks and epochs before scheduling a sweep.⁷

How to read the results in production¶

The headline table (CTI-REALM-50, all 16 configurations): Claude Opus 4.6 (High) 0.637, Claude Opus 4.5 0.624, Claude Sonnet 4.5 0.587, then GPT-5 (Med) 0.572 and GPT-5.2 (Med) 0.572, down to O4-Mini at 0.360, with GPT-4.1 at 0.419 and GPT-5-Mini at 0.451. Checkpoint columns expose the mechanics: C2 (data exploration) is near-uniformly high (0.78 to 0.88), so schema discovery is solved; C3 (query execution) is the discriminator; C1 (MITRE mapping) has the lowest cross-model mean of the trajectory checkpoints (Table 1 range 0.21 to 0.56) and is the weakest signal for every Claude model, though several OpenAI models score C3 lower still.⁶ By platform, rewards fall from Linux 0.585 to AKS 0.517 to Cloud 0.282 averaged across models: multi-step cloud attack chains needing correlation across sign-in, Graph activity, and audit logs are where every frontier model degrades most.⁶ Two operational conclusions from the studies: tools beat prompting (the ablation's reward drops without CTI tools exceed most model-swap gaps), and expert-seeded memory is a cheap upgrade for small models (GPT-5-Mini 0.371 to 0.432, closing 33% of its 0.184 gap to GPT-5 Med, with C1 doubling from 0.22 to 0.44 while C3 stayed at 0.12: knowledge transfers through context, query skill does not).⁹¹⁰ For a platform team, that maps to: spend on tool integration and curated context before spending on the bigger model, and validate on the platform tier (cloud) that matches your telemetry, not the benchmark mean.

Failure modes¶

Public-report contamination. Tasks derive from public CTI; a model may have memorized both report and canonical detection. The emulated-telemetry F1 resists this (memorized rule text still has to fire on unseen rows), but trajectory scores are softer; treat suspiciously perfect C0/C1 with an unusually weak C4 as a memorization signature.
Emulation fidelity. Simulations adapt Atomic Red Team-style test cases; real adversaries are noisier, slower, and blended with more benign traffic. A rule with F1 = 1 against one to three days of emulated telemetry is a candidate, not a production detection.
F1 against emulated-only telemetry. Precision is computed over the benchmark's background activity; production false-positive rates depend on your environment's baseline, which the paper explicitly does not model at SOC scale.
Single-engine bias. Scoring executes KQL against a Kusto cluster; Sigma quality is judged, not executed. Teams on Splunk SPL or Elastic EQL inherit an unmeasured translation gap.
Message-budget truncation. The 70-message cap ends trajectories mid-investigation; top models average 30 to 37 steps, so the ceiling is real for iterative strategies, and a model penalized by truncation looks worse than its untruncated capability.
Reading the mean, deploying the tail. Cloud-tier performance (0.282 average) is less than half the Linux tier; selecting a model on the aggregate reward for a cloud-heavy estate overstates what it will do on your hardest tasks.

References¶

Chakraborty, Ho, Cook, Melendez (Microsoft Security AI), CTI-REALM: Benchmark to Evaluate Agent Performance on Security Detection Rule Generation Capabilities (arXiv 2603.13517): https://arxiv.org/abs/2603.13517
CTI-REALM in Inspect Evals (public task and code): https://ukgovernmentbeis.github.io/inspect_evals/evals/cti_realm/
CTI-REALM dataset (Hugging Face): https://huggingface.co/datasets/arjun180-new/cti_realm
MITRE ATT&CK: https://attack.mitre.org/
Inspect AI, the evaluation framework the environment integrates: https://github.com/UKGovernmentBEIS/inspect_ai
Inspect Evals (community evaluation suite on Inspect): https://github.com/UKGovernmentBEIS/inspect_evals
Kusto Query Language reference: https://learn.microsoft.com/en-us/kusto/query/
Atomic Red Team (adversary emulation library the simulations adapt): https://github.com/redcanaryco/atomic-red-team
Sigma (generic detection-rule format and public rule corpus): https://github.com/SigmaHQ/sigma

CTI-REALM (arXiv 2603.13517v2, March 2026): end-to-end detection engineering benchmark; 16 frontier model configurations; Claude Opus 4.6 (High) best at 0.637; limitations: no production-scale SOC replication, Azure/KQL-centric. ↩↩
Section 3.2-3.3: 37 public CTI reports (Microsoft Security, Datadog Security Labs, Palo Alto Networks, Splunk Security Content); easy/medium/hard tiers, cloud exclusively hard; CTI-REALM-25 (12 Linux, 9 AKS, 4 Cloud) subset of CTI-REALM-50 (25 Linux, 17 AKS, 8 Cloud); attacks executed on a sandboxed Azure tenant, telemetry via Azure Monitor Agent and MDE, cleaned, PII-anonymized, resource identifiers scrubbed; Docker environment integrated with Inspect AI; 12 log sources with 1-3 days of telemetry each. ↩↩↩↩
Section 3.4: MDP formulation; R_total = sum(w_i * r_i), weights sum to 1; C0 0.125 (LLM judge), C1 0.075 (Jaccard over MITRE techniques), C2 0.10 (Jaccard over data sources), C3 0.05 (binary, at least 2 successful queries), C4 0.65 (KQL F1 plus Sigma judge); checkpoint reward 35%, ground-truth reward 65%; judge outputs spot-validated by security researchers. ↩↩↩
Appendix C: per-task ground truth is mitre_techniques (list of IDs), data_sources (list of tables), and regex_patterns (field-to-regex dict); a KQL result row is a true positive iff all field patterns match; precision, recall, F1 computed over returned rows; Sigma quality judged by GPT-5-Mini at 0.25 syntax + 0.75 specificity with few-shot calibration anchors. ↩↩↩
Appendix B, Table 5: eight tools (list_cti_report_tags, get_cti_reports_by_tag, list_kusto_tables, get_table_schema, execute_kql_query, get_mitre_techniques, search_sigma_rules, submit_detection); ReAct agent, 70-message cap. ↩
Section 5.1-5.3, Table 1: rewards 0.360 (O4-Mini) to 0.637 (Claude Opus 4.6 High, SE 0.037), Opus 4.5 0.624, Sonnet 4.5 0.587, GPT-5 (Med) 0.572, GPT-5.2 (Med) 0.572, GPT-5-Mini 0.451, GPT-4.1 0.419; category means Linux 0.585, AKS 0.517, Cloud 0.282; C3 most discriminating (Claude 0.86-0.92, most OpenAI below 0.50, GPT-4.1 0.02); C1 range 0.21-0.56; C0 separates Claude (0.83-0.91) from most OpenAI models (0.48-0.72, Table 1; the paper's Section 5.3 prose gives slightly different ranges); medium reasoning effort consistently beats high and low. ↩↩↩↩↩
Section 5.4, Appendix F: GPT-4.1 most token-efficient (~120K/sample, 0.419); GPT-5 (Low) best Pareto (~179K, 0.541); Anthropic top rewards at ~442-524K; top models average 30-37 steps vs 20-27 for weaker ones. ↩↩
Section 5.5, Table 2: top five models, three epochs on CTI-REALM-25; rankings stable; std 0.196-0.263; GPT-5 (Med) minimum 0.241 with no zero-reward samples, Claude models occasionally zero on hardest tasks. ↩
Section 5.6, Table 3: removing CTI tools drops rewards by 0.077 (Opus 4.6 High) to 0.150 (Opus 4.5), Cohen's d 0.30-0.64; C1 delta negative for all five models (agents compensate with more aggressive MITRE search), C3 positive for three of five, C4 non-negative for all; C0 excluded as mechanically zeroed. ↩↩
Section 5.7, Table 4: GPT-5-Mini 0.371 bare, 0.432 with expert-seeded memory (workflow guidance, tool tips, rule templates; not model-distilled), closing 33% of the 0.184 gap to GPT-5 (Med) at 0.556; C1 doubles 0.22 to 0.44, ground-truth reward +19%, C3 unchanged at 0.12. ↩↩