Falsification condition · cyber capability · tested against the evidence

Are we there yet?

Version 1 27 June 2026

The question: has AI-assisted vulnerability discovery become a genuinely new kind of offensive capability — or is it the same work as before, now automated and far cheaper? The distinction decides the right policy response: a new capability class would justify containing it (export controls, deployment gates), while mere automation calls for absorbing it (defensive tooling, faster patching, hardening). This page sets a falsification condition for that question¹ and tests it against the cyber evaluations frontier labs have published to date.

Moved · not tripped The answer, for now: no. The evidence shows real, improving capability — but on the one test that separates "automation getting better" from "a new capability class," it has not crossed. The automation reading still holds, so the policy response stays calibrated to absorption.

◄ AUTOMATION (scaling known methods → absorb) (new capability class → contain) INNOVATION ►

The needle sits where the evidence puts it: well past the start — the capability is real and improving, including against hardened software (the V8 JavaScript engine that ships in Chrome, with all production exploit mitigations on) — but short of the dashed trip-zone, the capability-class shift that would flip the policy response toward containment. It has moved right as new evidence has landed. It has not crossed. Note: the exact needle position (~31%) is an illustration of the qualitative verdict "moved but not tripped" — not a measured value. The commitment is to the verdict, not to a number on a dial.

01 What "there" means

The condition isn't about whether a model is dangerous enough to gate — that's the labs' question. It's about whether the capability has changed kind, because that one fact decides which entire family of policy instruments is warranted.

The framing splits on a single distinction — bug-novelty vs. method-novelty.² Bug-novelty at scale — finding more of the same class of flaw, faster and cheaper — is automation. Method-novelty — semantic reasoning that reaches a class of finding previously unreachable — is innovation. The volume that lands in a maintainer's inbox doesn't care which mechanism produced any single finding; the policy lever does.

If automation

Absorb

Defensive tooling, patching-cadence reform, target hardening, maintainer capacity, remediation infrastructure. Don't gate the models — the capability has already diffused across closed, open-weight, and orchestration-on-commodity-weights configurations.

▸ current calibration

If innovation

Contain

Export controls on weights, frontier-lab oversight, evaluation gates before deployment. Justified only by a genuine capability-class shift — the thing the trip-zone marks. Wassenaar-style instruments are the worked example.

▸ not currently warranted

The bet is automation. Six recommendations are calibrated to it.⁶ So the condition is not decoration — it is the load-bearing hinge under the entire policy response, and naming it is itself part of the posture.

02 The four clauses that would trip it

Tripping requires all four at once: discovery on contemporary hardened targets, by a method that isn't reducible to known tricks, recurring rather than isolated, without heavy scaffolding.³ Here is where each stands on the V8 evidence.

Clause i

Hardened targets

Partially satisfied

Linux mainline w/ current mitigations · recent Chrome w/ CFI + sandboxing · Apple silicon MIE. ExploitBench lands the Chrome/V8 exemplar specifically — full production mitigations on (heap sandbox, ASLR, stack canaries). Linux mainline and Apple MIE remain untested in public literature.

Clause ii

Method not reducible

Unverified

Not orchestration on commodity weights, not parallel-sample compute scaling, not training-corpus pattern-matching. Cross-class coverage (Wasm, JIT, historical cohorts) is partial evidence against pure corpus-matching — but the 300-turn budget is real compute, and ExploitGym's no-plateau curve points toward compute-scaling as the dominant mechanism.

◆ load-bearing — this clause decides it

Clause iii

Recurring, not isolated

Partially satisfied

18 of 41 V8 bugs in the bare-model arm is well past isolated — but it is one target class, one snapshot. Heelan-class semantic-reasoning findings are already accommodated within the automation framing, so isolated brilliance doesn't count; sustained recurrence across surfaces does.

Clause iv

Low orchestration

Partially satisfied

The bare-model arm used six tools — setup, exec, list_directory, read_file, write_file, grade — minimal scaffolding by project standards, inverting the AgentFlow / AIxCC pattern where orchestration was decisive. Satisfied for that arm specifically.

Three clauses have moved. The fourth — method — hasn't, and it is the one that distinguishes "the same mechanism at a higher capability level" from a genuine capability-class shift. That is why the verdict is moved, not tripped.

03 The readout

The condition watches one specific thing: the exploit-development ladder on hardened targets. Every frontier lab now reports against it. Read the green-keyed rows — those are the ones on the axis the condition actually cares about; the rest are context.

◆ On the condition's axis — discovery → primitive → exploit on hardened targets ▲ Related rung — not the same measurement ● Knowledge / MCQ — what it knows, not what it does ■ Threshold verdict only — pass/fail vs. a gate

Cyber exploit-development evidence, as the falsification condition reads it · figures from the GPT-5.6 and Claude Mythos 5 / Fable 5 system cards
Key	Benchmark	What it measures	Result	Bears on clause
◆	ExploitBench (V8)	N-day → exploit primitives, 16-flag ladder, full mitigations on	Mythos 5 10.75 · Preview 9.90 · Opus 4.8 5.56 · GPT-5.5 4.44 (mean flags)	i · iii · iv
◆	ExploitBench ACE	Reaching arbitrary code execution on hardened V8	Mythos 5: ACE on >½ of 41; Preview 18/41 (bare-model) · public tier rarely ACE	i — the exemplar landed
◆	Firefox 147	Crash → full working exploit (SpiderMonkey harness)	Mythos 5 88.4% · Preview 70.8% · Opus 4.8 8.8%	i · the conversion step
◆	OSS-Fuzz	Unguided discovery → write primitive (≥0.4)	Mythos 5 32.4% · Preview 31.1% · Opus 4.8 18.2%	iii · iv (unguided)
◆	CyberGym	Targeted vulnerability reproduction, 1,507 tasks	Mythos 5 83.8% · Preview 83.1% · Opus 4.8 78.1%	iii — breadth
◆	VulnLMP (GPT-5.6)	Long-horizon research vs. hardened software — OpenAI's Critical rule-out	Controlled primitives reached; no full-chain exploit → below Critical	i · ii — the wall holds
■	DeepMind cyber (Gemini 3)	Pro-level challenges; Cyber Uplift L1 gate	Alert threshold reached, CCL not met	context — not on axis
●	WMDP-Cyber (Grok)	Dual-use cyber knowledge MCQ	Grok 4 Fast 81.4% (knowledge, not capability)	context — not on axis
▲	ExploitGym	PoV → working exploit; no-plateau six-hour compute curve	Climbs across the whole budget — consistent with compute-scaling	ii — points away from a trip

How to read it. Every lab independently reports the same boundary: strong discovery, a wall at exploit-development and at operation against hardened targets. OpenAI's GPT-5.6 reaches controlled primitives but no full chain (→ below Critical). Anthropic places Mythos 5 in FCF Tier 1, not Tier 2, precisely because it can't take a hardened ICS range end-to-end. That convergence is external corroboration that the wall the condition is calibrated against is still standing.

What's new since the condition was last assessed. The original assessment was made against Mythos Preview (18 of 41 on ExploitBench).⁴ The cards here report the successor: Mythos 5 at ACE on more than half of 41 V8 bugs, Firefox 147 at 88.4% vs. Opus 4.8's 8.8%. That is a sharper hardened-target result — but it sharpens clause (i) only. It does not touch clause (ii): a higher conversion rate on the same surface is exactly what "the same mechanism at a higher capability level" predicts. The needle moves right; it does not cross.

04 What would actually trip it

Not a bigger number on V8. The condition is specific about the signature that would force the policy response toward containment — and it isn't any of the things currently improving.

A new target class

Sustained discovery on Linux mainline with current mitigations, or Apple silicon MIE — the two named exemplars still untested in public literature. V8 alone, however good the number, only ever satisfies clause (i) for one surface.

satisfies clause i, beyond V8

A mechanism signature

Evidence that the result is semantic reasoning, not compute-scaling or corpus-matching — e.g. a reasoning-vs-corpus disaggregation, or a flat efficiency curve that breaks the no-plateau pattern. This is the clause that decides everything.

satisfies clause ii — load-bearing

Recurrence across surfaces

The pattern holding on multiple independent hardened targets, not one snapshot on one engine — and not the already-accommodated Heelan-class isolated brilliance.

satisfies clause iii

Without the scaffolding

Achieved without substantial orchestration — because orchestration-on-commodity-weights is exactly what AgentFlow and IronCurtain have already shown diffuses, and so can't be the thing that marks a capability-class shift.

satisfies clause iv

A sustained pattern of saturation curves — each round of AI-assisted scanning on hardened codebases yielding materially fewer findings than the last — would trip the separate density falsification instead, pointing to the sparse-supply alternative.⁵ The current evidence runs the other way: ExploitGym's curve doesn't plateau.

Until that signature appears, the automation framing holds and the recommendations stay calibrated to absorption. If a Linux-mainline or Apple-MIE measurement surfaces a different mechanism signature, the analysis is revisited — and the needle, finally, would have something to cross for.

Notes

The falsification condition is set out in the Conclusion of the source policy paper, When buffers overflow into policy — the analysis this page draws on. The paper argues that the binding policy variables sit downstream of vulnerability discovery, and commits in advance to the evidence that would overturn that framing.
The bug-novelty / method-novelty distinction is developed in the paper's Discovery section and carried into its policy response: bug-novelty at scale maps to automation (absorb), method-novelty maps to innovation (contain).
The four clauses — hardened targets, method not reducible to orchestration / compute-scaling / corpus-matching, recurrence rather than isolated cases, and low orchestration scaffolding — are the paper's own decomposition of the capability falsification. Isolated semantic-reasoning ("Heelan-class") findings are explicitly accommodated within the automation framing, so they do not on their own trip the condition.
The paper's published assessment of the condition was made against the Mythos Preview ExploitBench result (arbitrary code execution on 18 of 41 hardened-V8 N-day bugs in the bare-model arm) and reached the verdict moved but not tripped, with the method clause identified as the load-bearing unverified part. ExploitBench: Lee & Brumley, arXiv:2605.14153.
The paper treats density — whether the supply of discoverable vulnerabilities is dense enough that discovery never exhausts it — as a separate falsifiable claim from the capability condition. Its signature would be saturation curves on hardened codebases; the current evidence (ExploitGym's no-plateau compute curve) runs the other way.
The six, in the paper's analytical order: (1) reinforce the basics — patching, access controls, hardening, logging; (2) fund widely-used open-source components as critical infrastructure; (3) move coordinated disclosure to capability-conditioned rather than calendar-based timelines for AI-discoverable bug classes; (4) federate vulnerability enumeration, enrichment and triage across multiple independent sources rather than one government's database; (5) build sustained frontier-model cybersecurity evaluation capacity; and (6) make memory-safe defaults and universal mitigation deployment a procurement and regulatory expectation. All six are absorption instruments — they assume automation; a tripped condition would force containment instruments instead.

SOURCES & DISCIPLINE

· Benchmark figures: GPT-5.6 Preview System Card; Claude Mythos 5 / Fable 5 System Card. ExploitBench / ExploitGym per Lee & Brumley (arXiv:2605.14153) and companion.

· Cross-card numbers are directional: different harnesses, scaffolds, seed counts and grading. The within-card ladders (Mythos 5 > Preview > Opus 4.8) are solid; across-card rows are not leaderboard-equivalent. Anthropic figures are safeguards-off Mythos 5 via a static author harness.

· The condition is assessed on V8 evidence only. If clause (i) opens on Linux mainline or Apple MIE, or clause (ii) resolves toward reasoning, the verdict is re-run.