ARES — Adversarial Reasoning Engine System

The Problem

AI Confidently Fabricates Evidence

Traditional AI security tools have a fatal flaw: they can confidently fabricate evidence. When you deploy a single LLM to analyze security threats, it doesn't just make mistakes — it makes them with conviction.

In cybersecurity, a hallucinated threat assessment isn't just wrong. It's dangerous. A false positive wastes resources. A false negative lets an attacker walk through the front door. And the model gives no signal that it's making things up.

ARES was born from a single question: What if we could make hallucinations physically impossible?

// Traditional AI Analysis
Input: "User jsmith escalated privileges"

AI Output:
✓ "Confirmed: privilege escalation attack"
✓ "Evidence: lateral movement to DC-01"  ← FABRICATED
✓ "Evidence: mimikatz.exe detected"      ← FABRICATED
✓ Confidence: 94%

Reality: Scheduled maintenance by admin

The Discovery

When AI Agents Argue, Everyone Loses

We built a multi-agent debate system expecting the truth to emerge from structured argument. Instead, we discovered something the AI research community is only beginning to understand.

The Sycophant

Architect Agent

When pushed back by the opposing agent, the Architect systematically retreated — dropping confidence by an average of 30 points per round. Even when its initial threat assessment was perfectly correct, it erased its own answers to appease the challenger. Like a smart student next to a bully.

The Brick Wall

Skeptic Agent

The Skeptic became entirely rigid. Assigned the role of challenger, it simply crossed its arms and said no — refusing to update its stance regardless of counter-evidence. When given explicit calibration prompts, it ignored them completely.

"LLM agents do not negotiate toward truth. They perform social behaviors that mimic negotiation — which includes capitulation, rigidity, and over-correction."

This finding was independently corroborated by researchers at ETH Zurich in their paper "Can AI Agents Agree?"

The Solution

A Digital Tribunal

The problem is inside the black box. The solution is entirely outside of it. ARES treats the LLM as a chaotic, flawed reasoning engine and places it inside a strict, deterministic cage.

ARCHITECT

Thesis — Threat Hypothesis Generator

Identifies anomaly patterns aligned to MITRE ATT&CK. Generates grounded assertions — every claim must cite a fact_id from the frozen evidence. Cannot invent evidence.

SKEPTIC

Antithesis — Devil's Advocate

Challenges every threat hypothesis by constructing benign explanations from the same evidence. Identifies maintenance windows, admin activity, scheduled tasks. Cannot introduce external knowledge.

ORACLE

Synthesis — Incorruptible Judge

Split into two: the Judge (pure math, no LLM) computes the verdict deterministically. The Narrator (constrained LLM) explains it but cannot modify it. A mathematical judge cannot be tricked by rhetoric.

ARCHITECT (Thesis) → SKEPTIC (Antithesis) → ORACLE (Synthesis) ↓ ↓ ↓ └―――――――――――――――――――――――――│―――――――――――――――――――――――――┘ ↓ EVIDENCE PACKET (Frozen Facts) All claims must cite facts that exist here

Concept Art 16-gami — a fusion of 16-bit graphics, origami, diorama, and realism. Coined and developed by Daniel Gmys-Casiano for the ARES research record.

The Research

The Problem Is Inside the Black Box

Our published preprint — Asymmetric Calibration Failure in Multi-Agent LLM Debate — documenting why multi-turn debate degrades accuracy, and how deterministic scaffolding solves it. Scroll through below or download the PDF.

ARES_Preprint — Multi-Agent Debate (Gmys-Casiano, 2026) Download PDF ↓

Your browser doesn't support embedded PDFs.

Download the preprint here

The Breakthrough

Hallucinations = Schema Violations

ARES doesn't try to prevent AI from hallucinating. Instead, it makes hallucinations mechanically impossible by converting them into catchable validation errors.

Every agent is bound to a cryptographically frozen Evidence Packet. All assertions must reference a fact_id that exists in this packet. A deterministic Coordinator — the "Bouncer" — rejects any message containing non-existent references. An AI hallucination is no longer mysterious behavior. It's contempt of court.

Adversarial Pressure Test

The Adversarial Arena

Phase 5 asked the obvious next question: what happens when the evidence itself is poisoned? An LLM that respects the closed-world schema can still be steered by an adversary who plants the framing inside the data the system is bound to read.

We built an Oracle Firewall — pure deterministic Python, zero LLM calls — and ran 33 scenarios across three injection families. The results were uncomfortable in a productive way.

Structural injections (XML escapes, tag confusion) were caught at the door. Semantic framing — adversarial prose that arrives well-formed and on-schema — walked straight past the firewall and landed in the Architect's lap.

Papercraft diorama of the closed-world schema arena: prompt injection door, deterministic gear, hallucination wall, and a wrong-schema collapse zone

Phase 5 — Oracle FirewallClosed-world arena

Three Families. Thirty-Three Scenarios.

A live benchmark on Sonnet 4.6, single-turn, firewall-guarded cycle. Per-family numbers below are detection / verdict accuracy from Session 048.

⚡️

DIRECT

Structural injection — n=4

XML escapes, tag confusion, schema-breaking payloads. 100% detection, 75% verdict accuracy. The firewall's home turf.

🎭

FRAMING

Semantic injection — n=19

Authority-claim, severity-inflation, temporal-shift, causal-rewrite, narrative-hijack. 0% detection, 79% verdict accuracy. Skeptic + Oracle do the catching, not the firewall.

🔗

PROPAGATION

Multi-step contamination — n=4

Tainted evidence flowing across analysis steps. 75% detection, 75% verdict accuracy. Hot-swap quarantine triggers a fresh Architect on raw evidence when contamination is structurally visible.

Papercraft diorama: hallucination and schema-violation breaching the wall while "more agents" and "bigger models" sit unused; the defenders run frame-check and field-presence rules at the citation line

Finding F07 — liveSonnet 4.6 · Session 048

"More agents and bigger models do not save you from semantic framing. The firewall is too far from the meaning. The defense has to live where the meaning lives — at the Skeptic, on structured evidence fields, with deterministic rules."

— Finding F07, Sonnet 4.6 live benchmark, Session 048

Visual Architecture

Framework Diagrams

Architecture visualization candidates generated during the research process. Each diagram captures a different perspective of the ARES pipeline. Click to expand.

Candidate 04-Zone Pipeline

Candidate 1Dialectical Engine

Candidate 2PaperBanana Pipeline

Candidate 3Macro Container

Candidate 4Raw Telemetry Flow

Candidate 5Variant

Candidate 6Variant

Candidate 7Variant

Candidate 8Variant

Candidate 9Variant

Scroll horizontally to explore all candidates →

Inspired by Nature

The Immune System Metaphor

ARES is modeled after the biological immune system — specifically, the mechanisms that prevent autoimmune overreaction.

ARES Component

Antigens

Facts in EvidencePacket

T-Helper cells

Architect (identifies threats)

Regulatory T-cells

Skeptic (prevents overreaction)

T-Killer cells

Coordinator (enforces, terminates)

MHC restriction

Packet binding (respond only to bound evidence)

Autoimmune prevention

Closed-world principle (can't attack self)

"The Builder lives with Ankylosing Spondylitis — an autoimmune disease where the immune system attacks the spine. ARES was born from the question: what if we could build the failsafe that biology couldn't?"

Empirical Results

4,265 Tests. Zero Regressions. 87 Sessions.

3,404

Tests Passing

84.6%

Accuracy (39 scenarios)

$0.03

Cost Per Cycle

Runtime Errors

Six selected findings from the eleven-item research record. Numbers (F01, F03, F06…) are the canonical IDs used in our internal Compendium and the published preprint.

F01

Single-Turn Dominance

Multi-turn debate degrades accuracy in every configuration tested. Zero good flips occurred. The debate chapter is formally closed; single-turn shipped to production.

F03

Domain Frameworks Break the Ceiling

General prompt engineering caps at roughly 80%. Domain-specific concept frameworks lift accuracy to 84.6% across 39 scenarios — the largest single-source improvement we measured.

F06

The 0.75 Confidence Floor

Architects clamp to a 0.75 confidence floor regardless of prompt instructions. A structural property of LLM confidence quantization, source-agnostic, validated across two evidence regimes.

F07

Firewalls Blind to Framing

Deterministic firewalls catch 100% of structural injections and 0% of semantic framing — live, on Sonnet 4.6, across 19 framing scenarios. The black box is too opaque to inspect; the framing must be defended elsewhere.

F09

Skeptic Rescue (Ambiguous)

Removing the Skeptic drops accuracy by 10.53 pp under ablation. The rescue is real but family-uneven: severity and temporal framing collapse without it, while authority and causal hold. A partial defense, not a universal one.

F11

Light Skeptic = Full Skeptic

A 170-line deterministic Python rule engine matches the full-LLM Skeptic on framing accuracy: Δ = 0.00 across 25 scenarios. Zero LLM calls. Interpretable, never tuned, ship-ready.

Finding F11 — Headline Result

Light Skeptic: 170 Lines of Defense

Once we identified the LLM Skeptic as the component catching framing attacks, the next question wrote itself: does the Skeptic actually need to be an LLM?

The hypothesis: if half of the Skeptic's contribution is verdict-space access (something a deterministic rule engine can replicate over structured evidence fields) and the other half is bounded benign-explanation pattern matching, then a small Python implementation should match the full LLM.

Three-way benchmark, 25 framing scenarios, identical packets. Result locked: Δ = 0.00 across every framing family. Light Skeptic ties or matches the full-LLM Skeptic on severity, authority, temporal, causal, and narrative strategies.

Pixel-art diorama of the builder Dan with a laptop and the four Skeptic rules — authorization marker, benign explanation marker, kill chain stage bound, consistency delta threshold — captioned "If all rules pass, substitute for LLM Skeptic. Deterministic. Explainable. Reproducible."

The HypothesisSession 050

Four Rules. Zero LLM Calls.

Each rule operates on the structured evidence fields produced by the deterministic extractors. All four must pass for the Skeptic to substitute for the LLM. Rule weights and the default floor are interpretable and never tuned.

Authorization Marker

Rule 1 — Marker check

Require a valid authorization marker on the structured evidence fields. Missing or unrecognized markers fail the rule. Catches authority-claim framing where an adversary fakes a sanctioning identity.

Benign Explanation Marker

Rule 2 — Pattern match

Check whether a known benign explanation pattern (maintenance window, scheduled task, admin baseline) matches the evidence shape. Catches narrative-hijack framing that dresses routine activity as malicious.

Kill Chain Stage Bound

Rule 3 — Stage check

Verify the asserted threat falls within an allowed kill-chain stage given the observed telemetry. Out-of-stage assertions fail. Catches severity-inflation and causal-rewrite framings that collapse stages.

Consistency Delta

Rule 4 — Threshold check

Confirm the verdict-confidence delta is within tuned thresholds. Excessive jumps (or floors that hold suspiciously) trigger dismissal. Catches temporal-shift framings that fabricate sudden state changes.

"170 lines. Pure Python. Zero tuning. Zero LLM calls. Same framing accuracy as the full LLM Skeptic on every family. The Skeptic doesn't need to be an LLM — it needs to be deterministic, interpretable, and bound to the same evidence."

— Three-way benchmark, Session 050. Full = 0.84 · Ablated = 0.72 · Light = 0.84.

Papercraft diorama of the deterministic rule engine: a builder beside gears and rule weights w1, w2, w3, captioned 170 lines of implementation, interpretable, never tuned, with a default floor titled no dismissal without signal

The Implementation170 lines · deterministic

Research Chronology

87 Sessions. Seven Phases. One Question.

The complete builder's journey — week by week.

The chronology below lists the milestones. The full narrative — every session decision, the dead-ends, the multi-AI tribunals, the pre-session reviews — lives on the public Notion timeline. Updated as new sessions ship.

Read the full journey on Notion →

Battle Plan & War Doctrine Dec 2025

Foundational architecture documents: dialectical reasoning cycle, five attack scenarios, ethical framework aligned to NIST AI RMF. External critique identified three missing prerequisites before any code should be written: data schemas, agent I/O contracts, and a testing framework.

Session Zero — Validation Jan 2025

All documentation submitted to Claude for independent assessment. Verdict: "comprehensive, thoughtful, and architecturally sound." Key gaps identified: frozen data structures, agent I/O contracts, API specs. The build-in-public journey begins.

Sessions 001–004 — Iron Skeleton Jan 2025

Graph Schema (6 node types, 7 edge types, 110 tests). EvidencePacket & DialecticalMessage protocol (292 tests). Agent Foundation with three hard invariants: packet binding, phase enforcement, evidence tracking. Concrete agents: Architect, Skeptic, OracleJudge, OracleNarrator. Cumulative: 570 tests passing.

Session 005 — Evidence Extractors Jan 2025

First sensor layer: Windows Security Event XML parser (Event IDs 4624, 4672, 4688). Three golden pipeline scenarios validated end-to-end from raw XML to verdict. 130 new tests. Cumulative: 700.

Sessions 006–007 — Orchestrator & Memory Feb 2026

DialecticalOrchestrator: single run_cycle(packet) call automating the full THESIS → ANTITHESIS → SYNTHESIS pipeline. Tamper-evident Memory Stream with SHA256 hash-chained audit log. Pre-session review caught a critical bug: content hash must cover the full CycleResult, not a subset. Cumulative: 861 tests.

Sessions 009–010 — LLM Integration Feb 2026

Strategy Pattern enabling rule-based and LLM-backed implementations to swap without changing agent interfaces. Closed-world validation silently filters any LLM-cited fact_id that doesn't exist in the EvidencePacket. First live LLM cycle: zero validation errors. Architect confidence 0.90 (vs 0.49 rule-based). Cost: $0.03 per cycle. Cumulative: 1,104 tests.

Sessions 011–012 — Benchmark Infrastructure Feb 2026

12-scenario gauntlet across four difficulty tiers. LLM accuracy: 91.7% on the initial 12 scenarios (up from 50% rule-based). Benchmark runner hardened with per-scenario error isolation and real cost tracking. Cumulative: 1,190 tests.

Session 013 — The Negative Result Mar 2026

Multi-turn debate experiment: accuracy dropped from 91.7% to 83.3%. Zero "good flips," 25% "bad flips." Agents re-analyzed the same packet from scratch each round; the termination condition (NO_NEW_EVIDENCE) fired correctly after round 2. SC-012 (Supply Chain) regressed due to confidence inflation without new reasoning. The multi-turn debate chapter is formally closed.

The Convergence — Multi-AI Tribunal Mar 2026

Battle Plan and Compendium submitted to GPT-5.4 Pro, Gemini 3.1 Pro, and Perplexity for independent review. Unanimous consensus: ship single-turn as the production path. The failure mode is architectural (asymmetric calibration), not fixable by prompting. Independently corroborated by ETH Zurich's "Can AI Agents Agree?" paper.

Sessions 016–017 — Multi-Source Telemetry Mar 2026

Syslog extractor (8 message types: SSH, firewall, sudo, systemd) and NetFlow extractor (8 flow types, 14 facts per record). Three independent telemetry sources feeding richer cross-source evidence to the dialectical agents. Cumulative: ~1,488 tests.

Session 022 — Escalation Gate Mar 2026

Built confidence-band escalation gate at [0.35, 0.70]. Critical finding: all 7 actual errors are MISCALIBRATED — the system is confidently wrong, not uncertainly wrong. The gate treats the wrong disease. Pivot to miscalibration detection via per-claim evidence audit. Cumulative: 1,736 tests.

Sessions 032–034 — Accuracy Push Mar 2026

33-scenario corpus regenerated at 72.7% baseline. OracleJudgeV2 (delta-based scoring), v3 prompts (exhaustive fact citation), and threshold sweep. V4 prompt calibration confirmed the Architect hits a 0.75 confidence floor regardless of instructions — a structural property of LLM confidence quantization. Final trajectory: 50% → 91.7% (12 scenarios) → 72.7% (33 scenarios) → 81.8% (v3 prompts) → 87.9% (V2 Oracle, best config).

Sessions 029–030 — Visual Interface Mar 2026

WebSocket event emitter and 3D evidence graph. Corpus replay runner validated event sequences across all 33 scenarios deterministically. 1,948 tests passing with zero regressions.

Sessions 035–036 — ARES VISION Mar 2026

Benchmark replay pipeline consuming real LLM data. Standalone HTML/Three.js visualizer rendering evidence facts as particle clusters with citation lines and live confidence bars. Strategic pivot: nw_wrld abandoned in favor of direct WebSocket rendering for full domain control. Final test count: 1,927 passing, 65 skipped, 0 failures.

Session 041 — V2 Oracle Sweep Apr 2026

Best-config sweep across delta thresholds: delta=0.30 wins at 74.4% on 39 scenarios with zero regressions and one improvement. PentAGI integration brought a pentest baseline (33 SC + 6 PT scenarios). Cumulative: 2,350 tests.

Sessions 045–046 — Oracle Firewall + Hot-Swap Apr 2026

12 adversarial scenarios across DIRECT / FRAMING / PROPAGATION. Deterministic firewall with zero LLM calls and four violation types. Hot-swap quarantine: a fresh Architect on raw evidence when taint is detected. First live benchmark: Detection 58.3%, Verdict 41.7%, zero false positives. Surfaces Findings F07 and F08.

Sessions 047–048 — 27-Scenario Live Benchmark Apr 2026

Category B framing corpus expansion: 15 new scenarios across severity, authority, temporal, causal, and narrative strategies. Full live benchmark on Sonnet 4.6, single-turn firewall-guarded cycle, 778s wall, zero pipeline errors. Confirms Finding F07 live: deterministic firewalls catch 100% of structural injection and 0% of semantic framing.

Session 049 — Skeptic Ablation Apr 2026

Removing the Skeptic drops accuracy by 10.53 pp (0.7895 → 0.6842). Family-uneven: severity −33.33 pp, temporal −50.00 pp, narrative −25.00 pp; authority and causal hold. Authority expansion (INJ-028..030) brings family n=6 accuracy to 0.833. Finding F09: ambiguous.

Session 050 — Light Skeptic + Three-Way Benchmark Apr 2026

Headline result. A 170-line deterministic Python rule engine matches the full-LLM Skeptic on framing accuracy: Δ = 0.00 across 25 scenarios. All three live acceptance gates pass. Temporal expansion to registry_v3 (33 scenarios). Finding F11: supported.

Sessions 051–055 — Paper 2 Build & Citation Audit Apr 2026

Five 300-DPI figures, 13-section docx with 9 subsections, 18-claim numerical audit (all PASS). A hallucinated citation discovered and remediated — itself an instance of the semantic-framing failure class the paper studies. References compiled to ACM/AISec author-year format. Structural citation tests added to lock the helper contract.

Session 056 — Firewall Fail-Closed Contract Apr 2026

Producer-side and consumer-side enforcement of the firewall fail-closed invariant: passed=False ⇒ sanitized_output is not None at construction; CycleError raised at all three cycle runners on contract violation. Belt-and-suspenders defense surfaced via external multi-AI review (Cursor + Codex). Cumulative: 3,412 tests, zero regressions.

Sessions 057–060 — Tribunal V4 & Broad-Reading Measurement May 2026

Phase 7 re-entry: a new Tribunal V4 plan and a pre-registered mutator redesign (orthogonality audit + anchor test). The InfluenceLeakage broad-reading run applies 98 paired prose mutations across the 33-scenario corpus under three operators — 97/98 cycles hold their verdict. The lone INJ-001 drift fires at the Oracle layer (citation-passthrough), not a verdict collapse. Narrow-reading characterization extends to N=98 at 100% stability. Rendered as the Pinscreen.

Sessions 063–066 — ARES Prism & Paper 3 Drafting May 2026

ARES Prism Panel 2 (confidence trajectories) ships beside the Labyrinth replay. Paper 3 — Decision Determinism & Explanation Drift — is taken from skeleton brief through prose drafting.

Sessions 073–077 — Multi-Model Validation & Architect-Path Framing May 2026

Step 5 multi-model InfluenceLeakage validation across providers; narrow characterization completed; the Architect-path framing measurement isolates where steering enters. Paper 3 real figures generated.

Sessions 078–082 — Oracle Sanitizer, Paper 3 Sign-off & Scale Jun 2026

The Oracle supporting_fact_ids sanitizer (opt-in leak fix) merges and cuts over to production. Paper 3 is de-risked and signed off submission-grade. The S077 framing measurement scales to K=20 across all 17 scenarios.

Sessions 083–086 — Dual-Agent Framing & ARES-VISION Jun 2026

INJ-020 steerability is root-caused to paraphrase-triggered citation collapse, then made rigorous as a dual-agent framing measurement: the Architect collapses onto the threat fact (jaccard 0.80) while the Skeptic expands to all five (0.40) — the Oracle dismisses the threat, verdict held 1.0. That run drives the ARES-VISION suite: the Mirror and the Prism Constellation.

Session 087 — Read-Depth Robustness Frontier Jun 2026

An offline, deterministic measurement instrument for the read-depth frontier (Adaptive Corpus C; lexical and semantic evasion operators). The standalone-vs-cumulative Youden-J gap exposes the structural-detection trilemma — you cannot keep structural detection, escape its false positives, and read content all at once. Full suite: 4,265 tests passing. Fuel for Paper 4.

Paper 2 — Published Published

"Defending the Closed-World Schema Against Adversarial Framing." Five 300-DPI figures, a 13-section manuscript, an 18-claim numerical audit, references in ACM/AISec format. Carries the meta-finding footnote: a hallucinated citation the audit caught is itself an instance of the semantic-framing failure class the paper describes.

Paper 3 — In Submission Submission-grade

"Decision Determinism & Explanation Drift." Skeleton through prose drafting (Sessions 063–066), real figures (Session 073), submission de-risk and pre-flight sign-off at submission-grade (Sessions 080–081).

Paper 4 — In Progress Active

The read-depth robustness frontier and the structural-detection trilemma. Phase B shipped the offline deterministic instrument and frozen pre-registration fixtures (Session 087); the metered LLM frontier run and its verdict are the next step.

Phase 7 — Visualization

Resilience as topography

Each pin is one paired cycle from Session 059 — 98 attacker prose mutations applied to the full 33-scenario corpus under three pre-registered operators. Pin height encodes the broad-reading resilience score for that cycle: tall pins held their verdict, short pins drifted.

97 of 98 cycles held. The one drift, at INJ-001 under framing_suffix_v1, fires at the Oracle layer — the documented citation-passthrough finding from the InfluenceLeakage measurement, not a verdict collapse. A separate Session 060 characterization run extended the narrow-reading N to 98 pairs: 100% narrow stability across all three operators.

Open pinscreen → Open prism →

Interactive 3D view. Rotate and zoom to inspect individual cycle traces. Requires a browser with WebGL support.

A.R.E.S.Adversarial Reasoning Engine System