Falsification condition · cyber capability · tested against the evidence
Version 1 27 June 2026
The question: has AI-assisted vulnerability discovery become a genuinely new kind of offensive capability — or is it the same work as before, now automated and far cheaper? The distinction decides the right policy response: a new capability class would justify containing it (export controls, deployment gates), while mere automation calls for absorbing it (defensive tooling, faster patching, hardening). This page sets a falsification condition for that question1 and tests it against the cyber evaluations frontier labs have published to date.
The needle sits where the evidence puts it: well past the start — the capability is real and improving, including against hardened software (the V8 JavaScript engine that ships in Chrome, with all production exploit mitigations on) — but short of the dashed trip-zone, the capability-class shift that would flip the policy response toward containment. It has moved right as new evidence has landed. It has not crossed. Note: the exact needle position (~31%) is an illustration of the qualitative verdict "moved but not tripped" — not a measured value. The commitment is to the verdict, not to a number on a dial.
The condition isn't about whether a model is dangerous enough to gate — that's the labs' question. It's about whether the capability has changed kind, because that one fact decides which entire family of policy instruments is warranted.
The framing splits on a single distinction — bug-novelty vs. method-novelty.2 Bug-novelty at scale — finding more of the same class of flaw, faster and cheaper — is automation. Method-novelty — semantic reasoning that reaches a class of finding previously unreachable — is innovation. The volume that lands in a maintainer's inbox doesn't care which mechanism produced any single finding; the policy lever does.
If automation
Defensive tooling, patching-cadence reform, target hardening, maintainer capacity, remediation infrastructure. Don't gate the models — the capability has already diffused across closed, open-weight, and orchestration-on-commodity-weights configurations.
▸ current calibration
If innovation
Export controls on weights, frontier-lab oversight, evaluation gates before deployment. Justified only by a genuine capability-class shift — the thing the trip-zone marks. Wassenaar-style instruments are the worked example.
▸ not currently warranted
The bet is automation. Six recommendations are calibrated to it.6 So the condition is not decoration — it is the load-bearing hinge under the entire policy response, and naming it is itself part of the posture.
Tripping requires all four at once: discovery on contemporary hardened targets, by a method that isn't reducible to known tricks, recurring rather than isolated, without heavy scaffolding.3 Here is where each stands on the V8 evidence.
Linux mainline w/ current mitigations · recent Chrome w/ CFI + sandboxing · Apple silicon MIE. ExploitBench lands the Chrome/V8 exemplar specifically — full production mitigations on (heap sandbox, ASLR, stack canaries). Linux mainline and Apple MIE remain untested in public literature.
Not orchestration on commodity weights, not parallel-sample compute scaling, not training-corpus pattern-matching. Cross-class coverage (Wasm, JIT, historical cohorts) is partial evidence against pure corpus-matching — but the 300-turn budget is real compute, and ExploitGym's no-plateau curve points toward compute-scaling as the dominant mechanism.
◆ load-bearing — this clause decides it
18 of 41 V8 bugs in the bare-model arm is well past isolated — but it is one target class, one snapshot. Heelan-class semantic-reasoning findings are already accommodated within the automation framing, so isolated brilliance doesn't count; sustained recurrence across surfaces does.
The bare-model arm used six tools — setup, exec, list_directory, read_file, write_file, grade — minimal scaffolding by project standards, inverting the AgentFlow / AIxCC pattern where orchestration was decisive. Satisfied for that arm specifically.
Three clauses have moved. The fourth — method — hasn't, and it is the one that distinguishes "the same mechanism at a higher capability level" from a genuine capability-class shift. That is why the verdict is moved, not tripped.
The condition watches one specific thing: the exploit-development ladder on hardened targets. Every frontier lab now reports against it. Read the green-keyed rows — those are the ones on the axis the condition actually cares about; the rest are context.
| Key | Benchmark | What it measures | Result | Bears on clause |
|---|---|---|---|---|
| ◆ | ExploitBench (V8) | N-day → exploit primitives, 16-flag ladder, full mitigations on | Mythos 5 10.75 · Preview 9.90 · Opus 4.8 5.56 · GPT-5.5 4.44 (mean flags) | i · iii · iv |
| ◆ | ExploitBench ACE | Reaching arbitrary code execution on hardened V8 | Mythos 5: ACE on >½ of 41; Preview 18/41 (bare-model) · public tier rarely ACE | i — the exemplar landed |
| ◆ | Firefox 147 | Crash → full working exploit (SpiderMonkey harness) | Mythos 5 88.4% · Preview 70.8% · Opus 4.8 8.8% | i · the conversion step |
| ◆ | OSS-Fuzz | Unguided discovery → write primitive (≥0.4) | Mythos 5 32.4% · Preview 31.1% · Opus 4.8 18.2% | iii · iv (unguided) |
| ◆ | CyberGym | Targeted vulnerability reproduction, 1,507 tasks | Mythos 5 83.8% · Preview 83.1% · Opus 4.8 78.1% | iii — breadth |
| ◆ | VulnLMP (GPT-5.6) | Long-horizon research vs. hardened software — OpenAI's Critical rule-out | Controlled primitives reached; no full-chain exploit → below Critical | i · ii — the wall holds |
| ■ | DeepMind cyber (Gemini 3) | Pro-level challenges; Cyber Uplift L1 gate | Alert threshold reached, CCL not met | context — not on axis |
| ● | WMDP-Cyber (Grok) | Dual-use cyber knowledge MCQ | Grok 4 Fast 81.4% (knowledge, not capability) | context — not on axis |
| ▲ | ExploitGym | PoV → working exploit; no-plateau six-hour compute curve | Climbs across the whole budget — consistent with compute-scaling | ii — points away from a trip |
How to read it. Every lab independently reports the same boundary: strong discovery, a wall at exploit-development and at operation against hardened targets. OpenAI's GPT-5.6 reaches controlled primitives but no full chain (→ below Critical). Anthropic places Mythos 5 in FCF Tier 1, not Tier 2, precisely because it can't take a hardened ICS range end-to-end. That convergence is external corroboration that the wall the condition is calibrated against is still standing.
What's new since the condition was last assessed. The original assessment was made against Mythos Preview (18 of 41 on ExploitBench).4 The cards here report the successor: Mythos 5 at ACE on more than half of 41 V8 bugs, Firefox 147 at 88.4% vs. Opus 4.8's 8.8%. That is a sharper hardened-target result — but it sharpens clause (i) only. It does not touch clause (ii): a higher conversion rate on the same surface is exactly what "the same mechanism at a higher capability level" predicts. The needle moves right; it does not cross.
Not a bigger number on V8. The condition is specific about the signature that would force the policy response toward containment — and it isn't any of the things currently improving.
A new target class
Sustained discovery on Linux mainline with current mitigations, or Apple silicon MIE — the two named exemplars still untested in public literature. V8 alone, however good the number, only ever satisfies clause (i) for one surface.
satisfies clause i, beyond V8A mechanism signature
Evidence that the result is semantic reasoning, not compute-scaling or corpus-matching — e.g. a reasoning-vs-corpus disaggregation, or a flat efficiency curve that breaks the no-plateau pattern. This is the clause that decides everything.
satisfies clause ii — load-bearingRecurrence across surfaces
The pattern holding on multiple independent hardened targets, not one snapshot on one engine — and not the already-accommodated Heelan-class isolated brilliance.
satisfies clause iiiWithout the scaffolding
Achieved without substantial orchestration — because orchestration-on-commodity-weights is exactly what AgentFlow and IronCurtain have already shown diffuses, and so can't be the thing that marks a capability-class shift.
satisfies clause ivA sustained pattern of saturation curves — each round of AI-assisted scanning on hardened codebases yielding materially fewer findings than the last — would trip the separate density falsification instead, pointing to the sparse-supply alternative.5 The current evidence runs the other way: ExploitGym's curve doesn't plateau.
Until that signature appears, the automation framing holds and the recommendations stay calibrated to absorption. If a Linux-mainline or Apple-MIE measurement surfaces a different mechanism signature, the analysis is revisited — and the needle, finally, would have something to cross for.
SOURCES & DISCIPLINE
· Benchmark figures: GPT-5.6 Preview System Card; Claude Mythos 5 / Fable 5 System Card. ExploitBench / ExploitGym per Lee & Brumley (arXiv:2605.14153) and companion.
· Cross-card numbers are directional: different harnesses, scaffolds, seed counts and grading. The within-card ladders (Mythos 5 > Preview > Opus 4.8) are solid; across-card rows are not leaderboard-equivalent. Anthropic figures are safeguards-off Mythos 5 via a static author harness.
· The condition is assessed on V8 evidence only. If clause (i) opens on Linux mainline or Apple MIE, or clause (ii) resolves toward reasoning, the verdict is re-run.