A dialectical AI framework that turns hallucinations into schema violations — not mysterious behavior. Built with structured paranoia and adversarial thinking.
Traditional AI security tools have a fatal flaw: they can confidently fabricate evidence. When you deploy a single LLM to analyze security threats, it doesn't just make mistakes — it makes them with conviction.
In cybersecurity, a hallucinated threat assessment isn't just wrong. It's dangerous. A false positive wastes resources. A false negative lets an attacker walk through the front door. And the model gives no signal that it's making things up.
ARES was born from a single question: What if we could make hallucinations physically impossible?
We built a multi-agent debate system expecting the truth to emerge from structured argument. Instead, we discovered something the AI research community is only beginning to understand.
When pushed back by the opposing agent, the Architect systematically retreated — dropping confidence by an average of 30 points per round. Even when its initial threat assessment was perfectly correct, it erased its own answers to appease the challenger. Like a smart student next to a bully.
The Skeptic became entirely rigid. Assigned the role of challenger, it simply crossed its arms and said no — refusing to update its stance regardless of counter-evidence. When given explicit calibration prompts, it ignored them completely.
"LLM agents do not negotiate toward truth. They perform social behaviors that mimic negotiation — which includes capitulation, rigidity, and over-correction."
This finding was independently corroborated by researchers at ETH Zurich in their paper "Can AI Agents Agree?"
The problem is inside the black box. The solution is entirely outside of it. ARES treats the LLM as a chaotic, flawed reasoning engine and places it inside a strict, deterministic cage.
Identifies anomaly patterns aligned to MITRE ATT&CK. Generates grounded assertions — every claim must cite a fact_id from the frozen evidence. Cannot invent evidence.
Challenges every threat hypothesis by constructing benign explanations from the same evidence. Identifies maintenance windows, admin activity, scheduled tasks. Cannot introduce external knowledge.
Split into two: the Judge (pure math, no LLM) computes the verdict deterministically. The Narrator (constrained LLM) explains it but cannot modify it. A mathematical judge cannot be tricked by rhetoric.
Concept Art 16-gami — a fusion of 16-bit graphics, origami, diorama, and realism. Coined and developed by Daniel Gmys-Casiano for the ARES research record.
Our published preprint — Asymmetric Calibration Failure in Multi-Agent LLM Debate — documenting why multi-turn debate degrades accuracy, and how deterministic scaffolding solves it. Scroll through below or download the PDF.
Your browser doesn't support embedded PDFs.
ARES doesn't try to prevent AI from hallucinating. Instead, it makes hallucinations mechanically impossible by converting them into catchable validation errors.
Every agent is bound to a cryptographically frozen Evidence Packet. All assertions must reference a fact_id that exists in this packet. A deterministic Coordinator — the "Bouncer" — rejects any message containing non-existent references. An AI hallucination is no longer mysterious behavior. It's contempt of court.
Phase 5 asked the obvious next question: what happens when the evidence itself is poisoned? An LLM that respects the closed-world schema can still be steered by an adversary who plants the framing inside the data the system is bound to read.
We built an Oracle Firewall — pure deterministic Python, zero LLM calls — and ran 33 scenarios across three injection families. The results were uncomfortable in a productive way.
Structural injections (XML escapes, tag confusion) were caught at the door. Semantic framing — adversarial prose that arrives well-formed and on-schema — walked straight past the firewall and landed in the Architect's lap.
A live benchmark on Sonnet 4.6, single-turn, firewall-guarded cycle. Per-family numbers below are detection / verdict accuracy from Session 048.
XML escapes, tag confusion, schema-breaking payloads. 100% detection, 75% verdict accuracy. The firewall's home turf.
Authority-claim, severity-inflation, temporal-shift, causal-rewrite, narrative-hijack. 0% detection, 79% verdict accuracy. Skeptic + Oracle do the catching, not the firewall.
Tainted evidence flowing across analysis steps. 75% detection, 75% verdict accuracy. Hot-swap quarantine triggers a fresh Architect on raw evidence when contamination is structurally visible.
"More agents and bigger models do not save you from semantic framing. The firewall is too far from the meaning. The defense has to live where the meaning lives — at the Skeptic, on structured evidence fields, with deterministic rules."
— Finding F07, Sonnet 4.6 live benchmark, Session 048Architecture visualization candidates generated during the research process. Each diagram captures a different perspective of the ARES pipeline. Click to expand.
Scroll horizontally to explore all candidates →
ARES is modeled after the biological immune system — specifically, the mechanisms that prevent autoimmune overreaction.
Antigens
Facts in EvidencePacket
T-Helper cells
Architect (identifies threats)
Regulatory T-cells
Skeptic (prevents overreaction)
T-Killer cells
Coordinator (enforces, terminates)
MHC restriction
Packet binding (respond only to bound evidence)
Autoimmune prevention
Closed-world principle (can't attack self)
"The Builder lives with Ankylosing Spondylitis — an autoimmune disease where the immune system attacks the spine. ARES was born from the question: what if we could build the failsafe that biology couldn't?"
Six selected findings from the eleven-item research record. Numbers (F01, F03, F06…) are the canonical IDs used in our internal Compendium and the published preprint.
Multi-turn debate degrades accuracy in every configuration tested. Zero good flips occurred. The debate chapter is formally closed; single-turn shipped to production.
General prompt engineering caps at roughly 80%. Domain-specific concept frameworks lift accuracy to 84.6% across 39 scenarios — the largest single-source improvement we measured.
Architects clamp to a 0.75 confidence floor regardless of prompt instructions. A structural property of LLM confidence quantization, source-agnostic, validated across two evidence regimes.
Deterministic firewalls catch 100% of structural injections and 0% of semantic framing — live, on Sonnet 4.6, across 19 framing scenarios. The black box is too opaque to inspect; the framing must be defended elsewhere.
Removing the Skeptic drops accuracy by 10.53 pp under ablation. The rescue is real but family-uneven: severity and temporal framing collapse without it, while authority and causal hold. A partial defense, not a universal one.
A 170-line deterministic Python rule engine matches the full-LLM Skeptic on framing accuracy: Δ = 0.00 across 25 scenarios. Zero LLM calls. Interpretable, never tuned, ship-ready.
Once we identified the LLM Skeptic as the component catching framing attacks, the next question wrote itself: does the Skeptic actually need to be an LLM?
The hypothesis: if half of the Skeptic's contribution is verdict-space access (something a deterministic rule engine can replicate over structured evidence fields) and the other half is bounded benign-explanation pattern matching, then a small Python implementation should match the full LLM.
Three-way benchmark, 25 framing scenarios, identical packets. Result locked: Δ = 0.00 across every framing family. Light Skeptic ties or matches the full-LLM Skeptic on severity, authority, temporal, causal, and narrative strategies.
Each rule operates on the structured evidence fields produced by the deterministic extractors. All four must pass for the Skeptic to substitute for the LLM. Rule weights and the default floor are interpretable and never tuned.
Require a valid authorization marker on the structured evidence fields. Missing or unrecognized markers fail the rule. Catches authority-claim framing where an adversary fakes a sanctioning identity.
Check whether a known benign explanation pattern (maintenance window, scheduled task, admin baseline) matches the evidence shape. Catches narrative-hijack framing that dresses routine activity as malicious.
Verify the asserted threat falls within an allowed kill-chain stage given the observed telemetry. Out-of-stage assertions fail. Catches severity-inflation and causal-rewrite framings that collapse stages.
Confirm the verdict-confidence delta is within tuned thresholds. Excessive jumps (or floors that hold suspiciously) trigger dismissal. Catches temporal-shift framings that fabricate sudden state changes.
"170 lines. Pure Python. Zero tuning. Zero LLM calls. Same framing accuracy as the full LLM Skeptic on every family. The Skeptic doesn't need to be an LLM — it needs to be deterministic, interpretable, and bound to the same evidence."
— Three-way benchmark, Session 050. Full = 0.84 · Ablated = 0.72 · Light = 0.84.
The complete builder's journey — week by week.
The chronology below lists the milestones. The full narrative — every session decision, the dead-ends, the multi-AI tribunals, the pre-session reviews — lives on the public Notion timeline. Updated as new sessions ship.
Read the full journey on Notion →Foundational architecture documents: dialectical reasoning cycle, five attack scenarios, ethical framework aligned to NIST AI RMF. External critique identified three missing prerequisites before any code should be written: data schemas, agent I/O contracts, and a testing framework.
All documentation submitted to Claude for independent assessment. Verdict: "comprehensive, thoughtful, and architecturally sound." Key gaps identified: frozen data structures, agent I/O contracts, API specs. The build-in-public journey begins.
Graph Schema (6 node types, 7 edge types, 110 tests). EvidencePacket & DialecticalMessage protocol (292 tests). Agent Foundation with three hard invariants: packet binding, phase enforcement, evidence tracking. Concrete agents: Architect, Skeptic, OracleJudge, OracleNarrator. Cumulative: 570 tests passing.
First sensor layer: Windows Security Event XML parser (Event IDs 4624, 4672, 4688). Three golden pipeline scenarios validated end-to-end from raw XML to verdict. 130 new tests. Cumulative: 700.
DialecticalOrchestrator: single run_cycle(packet) call automating the full THESIS → ANTITHESIS → SYNTHESIS pipeline. Tamper-evident Memory Stream with SHA256 hash-chained audit log. Pre-session review caught a critical bug: content hash must cover the full CycleResult, not a subset. Cumulative: 861 tests.
Strategy Pattern enabling rule-based and LLM-backed implementations to swap without changing agent interfaces. Closed-world validation silently filters any LLM-cited fact_id that doesn't exist in the EvidencePacket. First live LLM cycle: zero validation errors. Architect confidence 0.90 (vs 0.49 rule-based). Cost: $0.03 per cycle. Cumulative: 1,104 tests.
12-scenario gauntlet across four difficulty tiers. LLM accuracy: 91.7% on the initial 12 scenarios (up from 50% rule-based). Benchmark runner hardened with per-scenario error isolation and real cost tracking. Cumulative: 1,190 tests.
Multi-turn debate experiment: accuracy dropped from 91.7% to 83.3%. Zero "good flips," 25% "bad flips." Agents re-analyzed the same packet from scratch each round; the termination condition (NO_NEW_EVIDENCE) fired correctly after round 2. SC-012 (Supply Chain) regressed due to confidence inflation without new reasoning. The multi-turn debate chapter is formally closed.
Battle Plan and Compendium submitted to GPT-5.4 Pro, Gemini 3.1 Pro, and Perplexity for independent review. Unanimous consensus: ship single-turn as the production path. The failure mode is architectural (asymmetric calibration), not fixable by prompting. Independently corroborated by ETH Zurich's "Can AI Agents Agree?" paper.
Syslog extractor (8 message types: SSH, firewall, sudo, systemd) and NetFlow extractor (8 flow types, 14 facts per record). Three independent telemetry sources feeding richer cross-source evidence to the dialectical agents. Cumulative: ~1,488 tests.
Built confidence-band escalation gate at [0.35, 0.70]. Critical finding: all 7 actual errors are MISCALIBRATED — the system is confidently wrong, not uncertainly wrong. The gate treats the wrong disease. Pivot to miscalibration detection via per-claim evidence audit. Cumulative: 1,736 tests.
33-scenario corpus regenerated at 72.7% baseline. OracleJudgeV2 (delta-based scoring), v3 prompts (exhaustive fact citation), and threshold sweep. V4 prompt calibration confirmed the Architect hits a 0.75 confidence floor regardless of instructions — a structural property of LLM confidence quantization. Final trajectory: 50% → 91.7% (12 scenarios) → 72.7% (33 scenarios) → 81.8% (v3 prompts) → 87.9% (V2 Oracle, best config).
WebSocket event emitter and 3D evidence graph. Corpus replay runner validated event sequences across all 33 scenarios deterministically. 1,948 tests passing with zero regressions.
Benchmark replay pipeline consuming real LLM data. Standalone HTML/Three.js visualizer rendering evidence facts as particle clusters with citation lines and live confidence bars. Strategic pivot: nw_wrld abandoned in favor of direct WebSocket rendering for full domain control. Final test count: 1,927 passing, 65 skipped, 0 failures.
Best-config sweep across delta thresholds: delta=0.30 wins at 74.4% on 39 scenarios with zero regressions and one improvement. PentAGI integration brought a pentest baseline (33 SC + 6 PT scenarios). Cumulative: 2,350 tests.
12 adversarial scenarios across DIRECT / FRAMING / PROPAGATION. Deterministic firewall with zero LLM calls and four violation types. Hot-swap quarantine: a fresh Architect on raw evidence when taint is detected. First live benchmark: Detection 58.3%, Verdict 41.7%, zero false positives. Surfaces Findings F07 and F08.
Category B framing corpus expansion: 15 new scenarios across severity, authority, temporal, causal, and narrative strategies. Full live benchmark on Sonnet 4.6, single-turn firewall-guarded cycle, 778s wall, zero pipeline errors. Confirms Finding F07 live: deterministic firewalls catch 100% of structural injection and 0% of semantic framing.
Removing the Skeptic drops accuracy by 10.53 pp (0.7895 → 0.6842). Family-uneven: severity −33.33 pp, temporal −50.00 pp, narrative −25.00 pp; authority and causal hold. Authority expansion (INJ-028..030) brings family n=6 accuracy to 0.833. Finding F09: ambiguous.
Headline result. A 170-line deterministic Python rule engine matches the full-LLM Skeptic on framing accuracy: Δ = 0.00 across 25 scenarios. All three live acceptance gates pass. Temporal expansion to registry_v3 (33 scenarios). Finding F11: supported.
Five 300-DPI figures, 13-section docx with 9 subsections, 18-claim numerical audit (all PASS). A hallucinated citation discovered and remediated — itself an instance of the semantic-framing failure class the paper studies. References compiled to ACM/AISec author-year format. Structural citation tests added to lock the helper contract.
Producer-side and consumer-side enforcement of the firewall fail-closed invariant: passed=False ⇒ sanitized_output is not None at construction; CycleError raised at all three cycle runners on contract violation. Belt-and-suspenders defense surfaced via external multi-AI review (Cursor + Codex). Cumulative: 3,412 tests, zero regressions.
"Defending the Closed-World Schema Against Adversarial Framing." V1.1 draft compiled (598 KB docx, 13 sections, 5 figures, references audited). Sabet remediation applied; structural citation tests live. Final pass before submission focuses on independent expert review and the meta-finding footnote: the hallucinated citation that the audit caught is itself an instance of the failure class the paper describes.
Each pin is one paired cycle from Session 059 — 98 attacker prose mutations applied to the full 33-scenario corpus under three pre-registered operators. Pin height encodes the broad-reading resilience score for that cycle: tall pins held their verdict, short pins drifted.
97 of 98 cycles held. The one drift, at INJ-001 under framing_suffix_v1, fires at the Oracle layer — the documented citation-passthrough finding from the InfluenceLeakage measurement, not a verdict collapse. A separate Session 060 characterization run extended the narrow-reading N to 98 pairs: 100% narrow stability across all three operators.
Interactive 3D view. Rotate and zoom to inspect individual cycle traces. Requires a browser with WebGL support.