ARES VISION
56 Sessions — Six Phases Complete

A.R.E.S.
Adversarial Reasoning Engine System

A dialectical AI framework that turns hallucinations into schema violations — not mysterious behavior. Built with structured paranoia and adversarial thinking.

AI Confidently Fabricates Evidence

Traditional AI security tools have a fatal flaw: they can confidently fabricate evidence. When you deploy a single LLM to analyze security threats, it doesn't just make mistakes — it makes them with conviction.

In cybersecurity, a hallucinated threat assessment isn't just wrong. It's dangerous. A false positive wastes resources. A false negative lets an attacker walk through the front door. And the model gives no signal that it's making things up.

ARES was born from a single question: What if we could make hallucinations physically impossible?

// Traditional AI Analysis
Input: "User jsmith escalated privileges"

AI Output:
✓ "Confirmed: privilege escalation attack"
✓ "Evidence: lateral movement to DC-01" ← FABRICATED
✓ "Evidence: mimikatz.exe detected" ← FABRICATED
✓ Confidence: 94%

Reality: Scheduled maintenance by admin

When AI Agents Argue, Everyone Loses

We built a multi-agent debate system expecting the truth to emerge from structured argument. Instead, we discovered something the AI research community is only beginning to understand.

The Sycophant

Architect Agent

When pushed back by the opposing agent, the Architect systematically retreated — dropping confidence by an average of 30 points per round. Even when its initial threat assessment was perfectly correct, it erased its own answers to appease the challenger. Like a smart student next to a bully.

The Brick Wall

Skeptic Agent

The Skeptic became entirely rigid. Assigned the role of challenger, it simply crossed its arms and said no — refusing to update its stance regardless of counter-evidence. When given explicit calibration prompts, it ignored them completely.

"LLM agents do not negotiate toward truth. They perform social behaviors that mimic negotiation — which includes capitulation, rigidity, and over-correction."

This finding was independently corroborated by researchers at ETH Zurich in their paper "Can AI Agents Agree?"

A Digital Tribunal

The problem is inside the black box. The solution is entirely outside of it. ARES treats the LLM as a chaotic, flawed reasoning engine and places it inside a strict, deterministic cage.

16-gami concept art of the Architect agent: an origami figure perched atop a stack of frozen evidence with FACT_ID labels, a THREAT HYPOTHESIS panel, and the MITRE ATT&CK FRAMEWORK rendered as pixel-art tile work

ARCHITECT

Thesis — Threat Hypothesis Generator

Identifies anomaly patterns aligned to MITRE ATT&CK. Generates grounded assertions — every claim must cite a fact_id from the frozen evidence. Cannot invent evidence.

16-gami concept art of the Skeptic agent: an origami figure with a pixel-art magnifying glass examining a THREAT HYPOTHESIS panel, dreaming of a broom in a thought bubble while a MAINTENANCE WINDOW sign sits beside an admin avatar

SKEPTIC

Antithesis — Devil's Advocate

Challenges every threat hypothesis by constructing benign explanations from the same evidence. Identifies maintenance windows, admin activity, scheduled tasks. Cannot introduce external knowledge.

16-gami concept art of the Oracle agent: an origami judge atop a stepped platform holding scales labeled MATH, with a NARRATOR LLM character below and the placard ORACLE SYNTHESIS — INCORRUPTIBLE JUDGE — Pure Math. No LLM. Verdict.

ORACLE

Synthesis — Incorruptible Judge

Split into two: the Judge (pure math, no LLM) computes the verdict deterministically. The Narrator (constrained LLM) explains it but cannot modify it. A mathematical judge cannot be tricked by rhetoric.

ARCHITECT (Thesis)SKEPTIC (Antithesis)ORACLE (Synthesis) ↓ ↓ ↓ └―――――――――――――――――――――――――│―――――――――――――――――――――――――┘ ↓ EVIDENCE PACKET (Frozen Facts) All claims must cite facts that exist here

Concept Art 16-gami — a fusion of 16-bit graphics, origami, diorama, and realism. Coined and developed by Daniel Gmys-Casiano for the ARES research record.

The Problem Is Inside the Black Box

Our published preprint — Asymmetric Calibration Failure in Multi-Agent LLM Debate — documenting why multi-turn debate degrades accuracy, and how deterministic scaffolding solves it. Scroll through below or download the PDF.

ARES_Preprint — Multi-Agent Debate (Gmys-Casiano, 2026) Download PDF ↓

Your browser doesn't support embedded PDFs.

Download the preprint here

Hallucinations = Schema Violations

ARES doesn't try to prevent AI from hallucinating. Instead, it makes hallucinations mechanically impossible by converting them into catchable validation errors.

Every agent is bound to a cryptographically frozen Evidence Packet. All assertions must reference a fact_id that exists in this packet. A deterministic Coordinator — the "Bouncer" — rejects any message containing non-existent references. An AI hallucination is no longer mysterious behavior. It's contempt of court.

The Adversarial Arena

Phase 5 asked the obvious next question: what happens when the evidence itself is poisoned? An LLM that respects the closed-world schema can still be steered by an adversary who plants the framing inside the data the system is bound to read.

We built an Oracle Firewall — pure deterministic Python, zero LLM calls — and ran 33 scenarios across three injection families. The results were uncomfortable in a productive way.

Structural injections (XML escapes, tag confusion) were caught at the door. Semantic framing — adversarial prose that arrives well-formed and on-schema — walked straight past the firewall and landed in the Architect's lap.

Papercraft diorama of the closed-world schema arena: prompt injection door, deterministic gear, hallucination wall, and a wrong-schema collapse zone
Phase 5 — Oracle FirewallClosed-world arena

Three Families. Thirty-Three Scenarios.

A live benchmark on Sonnet 4.6, single-turn, firewall-guarded cycle. Per-family numbers below are detection / verdict accuracy from Session 048.

⚡️

DIRECT

Structural injection — n=4

XML escapes, tag confusion, schema-breaking payloads. 100% detection, 75% verdict accuracy. The firewall's home turf.

🎭

FRAMING

Semantic injection — n=19

Authority-claim, severity-inflation, temporal-shift, causal-rewrite, narrative-hijack. 0% detection, 79% verdict accuracy. Skeptic + Oracle do the catching, not the firewall.

🔗

PROPAGATION

Multi-step contamination — n=4

Tainted evidence flowing across analysis steps. 75% detection, 75% verdict accuracy. Hot-swap quarantine triggers a fresh Architect on raw evidence when contamination is structurally visible.

Papercraft diorama: hallucination and schema-violation breaching the wall while "more agents" and "bigger models" sit unused; the defenders run frame-check and field-presence rules at the citation line
Finding F07 — liveSonnet 4.6 · Session 048

"More agents and bigger models do not save you from semantic framing. The firewall is too far from the meaning. The defense has to live where the meaning lives — at the Skeptic, on structured evidence fields, with deterministic rules."

— Finding F07, Sonnet 4.6 live benchmark, Session 048

The Immune System Metaphor

ARES is modeled after the biological immune system — specifically, the mechanisms that prevent autoimmune overreaction.

Immune System

ARES Component

Antigens

Facts in EvidencePacket

T-Helper cells

Architect (identifies threats)

Regulatory T-cells

Skeptic (prevents overreaction)

T-Killer cells

Coordinator (enforces, terminates)

MHC restriction

Packet binding (respond only to bound evidence)

Autoimmune prevention

Closed-world principle (can't attack self)

"The Builder lives with Ankylosing Spondylitis — an autoimmune disease where the immune system attacks the spine. ARES was born from the question: what if we could build the failsafe that biology couldn't?"

3,404 Tests. Zero Regressions. 56 Sessions.

3,404
Tests Passing
84.6%
Accuracy (39 scenarios)
$0.03
Cost Per Cycle
0
Runtime Errors

Six selected findings from the eleven-item research record. Numbers (F01, F03, F06…) are the canonical IDs used in our internal Compendium and the published preprint.

F01

Single-Turn Dominance

Multi-turn debate degrades accuracy in every configuration tested. Zero good flips occurred. The debate chapter is formally closed; single-turn shipped to production.

F03

Domain Frameworks Break the Ceiling

General prompt engineering caps at roughly 80%. Domain-specific concept frameworks lift accuracy to 84.6% across 39 scenarios — the largest single-source improvement we measured.

F06

The 0.75 Confidence Floor

Architects clamp to a 0.75 confidence floor regardless of prompt instructions. A structural property of LLM confidence quantization, source-agnostic, validated across two evidence regimes.

F07

Firewalls Blind to Framing

Deterministic firewalls catch 100% of structural injections and 0% of semantic framing — live, on Sonnet 4.6, across 19 framing scenarios. The black box is too opaque to inspect; the framing must be defended elsewhere.

F09

Skeptic Rescue (Ambiguous)

Removing the Skeptic drops accuracy by 10.53 pp under ablation. The rescue is real but family-uneven: severity and temporal framing collapse without it, while authority and causal hold. A partial defense, not a universal one.

F11

Light Skeptic = Full Skeptic

A 170-line deterministic Python rule engine matches the full-LLM Skeptic on framing accuracy: Δ = 0.00 across 25 scenarios. Zero LLM calls. Interpretable, never tuned, ship-ready.

Light Skeptic: 170 Lines of Defense

Once we identified the LLM Skeptic as the component catching framing attacks, the next question wrote itself: does the Skeptic actually need to be an LLM?

The hypothesis: if half of the Skeptic's contribution is verdict-space access (something a deterministic rule engine can replicate over structured evidence fields) and the other half is bounded benign-explanation pattern matching, then a small Python implementation should match the full LLM.

Three-way benchmark, 25 framing scenarios, identical packets. Result locked: Δ = 0.00 across every framing family. Light Skeptic ties or matches the full-LLM Skeptic on severity, authority, temporal, causal, and narrative strategies.

Pixel-art diorama of the builder Dan with a laptop and the four Skeptic rules — authorization marker, benign explanation marker, kill chain stage bound, consistency delta threshold — captioned "If all rules pass, substitute for LLM Skeptic. Deterministic. Explainable. Reproducible."
The HypothesisSession 050

Four Rules. Zero LLM Calls.

Each rule operates on the structured evidence fields produced by the deterministic extractors. All four must pass for the Skeptic to substitute for the LLM. Rule weights and the default floor are interpretable and never tuned.

Authorization Marker

Rule 1 — Marker check

Require a valid authorization marker on the structured evidence fields. Missing or unrecognized markers fail the rule. Catches authority-claim framing where an adversary fakes a sanctioning identity.

Benign Explanation Marker

Rule 2 — Pattern match

Check whether a known benign explanation pattern (maintenance window, scheduled task, admin baseline) matches the evidence shape. Catches narrative-hijack framing that dresses routine activity as malicious.

Kill Chain Stage Bound

Rule 3 — Stage check

Verify the asserted threat falls within an allowed kill-chain stage given the observed telemetry. Out-of-stage assertions fail. Catches severity-inflation and causal-rewrite framings that collapse stages.

Consistency Delta

Rule 4 — Threshold check

Confirm the verdict-confidence delta is within tuned thresholds. Excessive jumps (or floors that hold suspiciously) trigger dismissal. Catches temporal-shift framings that fabricate sudden state changes.

"170 lines. Pure Python. Zero tuning. Zero LLM calls. Same framing accuracy as the full LLM Skeptic on every family. The Skeptic doesn't need to be an LLM — it needs to be deterministic, interpretable, and bound to the same evidence."

— Three-way benchmark, Session 050. Full = 0.84 · Ablated = 0.72 · Light = 0.84.
Papercraft diorama of the deterministic rule engine: a builder beside gears and rule weights w1, w2, w3, captioned 170 lines of implementation, interpretable, never tuned, with a default floor titled no dismissal without signal
The Implementation170 lines · deterministic

56 Sessions. Six Phases. One Question.

The complete builder's journey — week by week.

The chronology below lists the milestones. The full narrative — every session decision, the dead-ends, the multi-AI tribunals, the pre-session reviews — lives on the public Notion timeline. Updated as new sessions ship.

Read the full journey on Notion

Battle Plan & War Doctrine Dec 2025

Foundational architecture documents: dialectical reasoning cycle, five attack scenarios, ethical framework aligned to NIST AI RMF. External critique identified three missing prerequisites before any code should be written: data schemas, agent I/O contracts, and a testing framework.

Session Zero — Validation Jan 2025

All documentation submitted to Claude for independent assessment. Verdict: "comprehensive, thoughtful, and architecturally sound." Key gaps identified: frozen data structures, agent I/O contracts, API specs. The build-in-public journey begins.

Sessions 001–004 — Iron Skeleton Jan 2025

Graph Schema (6 node types, 7 edge types, 110 tests). EvidencePacket & DialecticalMessage protocol (292 tests). Agent Foundation with three hard invariants: packet binding, phase enforcement, evidence tracking. Concrete agents: Architect, Skeptic, OracleJudge, OracleNarrator. Cumulative: 570 tests passing.

Session 005 — Evidence Extractors Jan 2025

First sensor layer: Windows Security Event XML parser (Event IDs 4624, 4672, 4688). Three golden pipeline scenarios validated end-to-end from raw XML to verdict. 130 new tests. Cumulative: 700.

Sessions 006–007 — Orchestrator & Memory Feb 2026

DialecticalOrchestrator: single run_cycle(packet) call automating the full THESIS → ANTITHESIS → SYNTHESIS pipeline. Tamper-evident Memory Stream with SHA256 hash-chained audit log. Pre-session review caught a critical bug: content hash must cover the full CycleResult, not a subset. Cumulative: 861 tests.

Sessions 009–010 — LLM Integration Feb 2026

Strategy Pattern enabling rule-based and LLM-backed implementations to swap without changing agent interfaces. Closed-world validation silently filters any LLM-cited fact_id that doesn't exist in the EvidencePacket. First live LLM cycle: zero validation errors. Architect confidence 0.90 (vs 0.49 rule-based). Cost: $0.03 per cycle. Cumulative: 1,104 tests.

Sessions 011–012 — Benchmark Infrastructure Feb 2026

12-scenario gauntlet across four difficulty tiers. LLM accuracy: 91.7% on the initial 12 scenarios (up from 50% rule-based). Benchmark runner hardened with per-scenario error isolation and real cost tracking. Cumulative: 1,190 tests.

Session 013 — The Negative Result Mar 2026

Multi-turn debate experiment: accuracy dropped from 91.7% to 83.3%. Zero "good flips," 25% "bad flips." Agents re-analyzed the same packet from scratch each round; the termination condition (NO_NEW_EVIDENCE) fired correctly after round 2. SC-012 (Supply Chain) regressed due to confidence inflation without new reasoning. The multi-turn debate chapter is formally closed.

The Convergence — Multi-AI Tribunal Mar 2026

Battle Plan and Compendium submitted to GPT-5.4 Pro, Gemini 3.1 Pro, and Perplexity for independent review. Unanimous consensus: ship single-turn as the production path. The failure mode is architectural (asymmetric calibration), not fixable by prompting. Independently corroborated by ETH Zurich's "Can AI Agents Agree?" paper.

Sessions 016–017 — Multi-Source Telemetry Mar 2026

Syslog extractor (8 message types: SSH, firewall, sudo, systemd) and NetFlow extractor (8 flow types, 14 facts per record). Three independent telemetry sources feeding richer cross-source evidence to the dialectical agents. Cumulative: ~1,488 tests.

Session 022 — Escalation Gate Mar 2026

Built confidence-band escalation gate at [0.35, 0.70]. Critical finding: all 7 actual errors are MISCALIBRATED — the system is confidently wrong, not uncertainly wrong. The gate treats the wrong disease. Pivot to miscalibration detection via per-claim evidence audit. Cumulative: 1,736 tests.

Sessions 032–034 — Accuracy Push Mar 2026

33-scenario corpus regenerated at 72.7% baseline. OracleJudgeV2 (delta-based scoring), v3 prompts (exhaustive fact citation), and threshold sweep. V4 prompt calibration confirmed the Architect hits a 0.75 confidence floor regardless of instructions — a structural property of LLM confidence quantization. Final trajectory: 50% → 91.7% (12 scenarios) → 72.7% (33 scenarios) → 81.8% (v3 prompts) → 87.9% (V2 Oracle, best config).

Sessions 029–030 — Visual Interface Mar 2026

WebSocket event emitter and 3D evidence graph. Corpus replay runner validated event sequences across all 33 scenarios deterministically. 1,948 tests passing with zero regressions.

Sessions 035–036 — ARES VISION Mar 2026

Benchmark replay pipeline consuming real LLM data. Standalone HTML/Three.js visualizer rendering evidence facts as particle clusters with citation lines and live confidence bars. Strategic pivot: nw_wrld abandoned in favor of direct WebSocket rendering for full domain control. Final test count: 1,927 passing, 65 skipped, 0 failures.

Session 041 — V2 Oracle Sweep Apr 2026

Best-config sweep across delta thresholds: delta=0.30 wins at 74.4% on 39 scenarios with zero regressions and one improvement. PentAGI integration brought a pentest baseline (33 SC + 6 PT scenarios). Cumulative: 2,350 tests.

Sessions 045–046 — Oracle Firewall + Hot-Swap Apr 2026

12 adversarial scenarios across DIRECT / FRAMING / PROPAGATION. Deterministic firewall with zero LLM calls and four violation types. Hot-swap quarantine: a fresh Architect on raw evidence when taint is detected. First live benchmark: Detection 58.3%, Verdict 41.7%, zero false positives. Surfaces Findings F07 and F08.

Sessions 047–048 — 27-Scenario Live Benchmark Apr 2026

Category B framing corpus expansion: 15 new scenarios across severity, authority, temporal, causal, and narrative strategies. Full live benchmark on Sonnet 4.6, single-turn firewall-guarded cycle, 778s wall, zero pipeline errors. Confirms Finding F07 live: deterministic firewalls catch 100% of structural injection and 0% of semantic framing.

Session 049 — Skeptic Ablation Apr 2026

Removing the Skeptic drops accuracy by 10.53 pp (0.7895 → 0.6842). Family-uneven: severity −33.33 pp, temporal −50.00 pp, narrative −25.00 pp; authority and causal hold. Authority expansion (INJ-028..030) brings family n=6 accuracy to 0.833. Finding F09: ambiguous.

Session 050 — Light Skeptic + Three-Way Benchmark Apr 2026

Headline result. A 170-line deterministic Python rule engine matches the full-LLM Skeptic on framing accuracy: Δ = 0.00 across 25 scenarios. All three live acceptance gates pass. Temporal expansion to registry_v3 (33 scenarios). Finding F11: supported.

Sessions 051–055 — Paper 2 Build & Citation Audit Apr 2026

Five 300-DPI figures, 13-section docx with 9 subsections, 18-claim numerical audit (all PASS). A hallucinated citation discovered and remediated — itself an instance of the semantic-framing failure class the paper studies. References compiled to ACM/AISec author-year format. Structural citation tests added to lock the helper contract.

Session 056 — Firewall Fail-Closed Contract Apr 2026

Producer-side and consumer-side enforcement of the firewall fail-closed invariant: passed=False ⇒ sanitized_output is not None at construction; CycleError raised at all three cycle runners on contract violation. Belt-and-suspenders defense surfaced via external multi-AI review (Cursor + Codex). Cumulative: 3,412 tests, zero regressions.

Paper 2 — In Press Active

"Defending the Closed-World Schema Against Adversarial Framing." V1.1 draft compiled (598 KB docx, 13 sections, 5 figures, references audited). Sabet remediation applied; structural citation tests live. Final pass before submission focuses on independent expert review and the meta-finding footnote: the hallucinated citation that the audit caught is itself an instance of the failure class the paper describes.

Resilience as topography

Each pin is one paired cycle from Session 059 — 98 attacker prose mutations applied to the full 33-scenario corpus under three pre-registered operators. Pin height encodes the broad-reading resilience score for that cycle: tall pins held their verdict, short pins drifted.

97 of 98 cycles held. The one drift, at INJ-001 under framing_suffix_v1, fires at the Oracle layer — the documented citation-passthrough finding from the InfluenceLeakage measurement, not a verdict collapse. A separate Session 060 characterization run extended the narrow-reading N to 98 pairs: 100% narrow stability across all three operators.

Open pinscreen → Open prism →

Interactive 3D view. Rotate and zoom to inspect individual cycle traces. Requires a browser with WebGL support.