Agentik Coding Workflow

Omega — Autonomous Engineering Operations

A whitepaper on multi-agent orchestration with verifiable autonomy

Version 2 · Public release · 2026-05-15

Executive summary

Omega is a multi-agent operating system for software engineering work. It turns a single human intent — "fix this bug", "ship this feature", "audit this codebase" — into a chain of planned, executed, audited, and deployed work, without continuous human supervision.

The system is organized as four orchestration levels: the human operator, a routing bot, project oracles, and short-lived worker sessions. Each level has one job and one exit condition. Completion is signaled by an atomic file (.done.json) and acknowledged by three independent layers (worker, oracle, supervisor) before a session is closed.

What makes Omega different from other agent frameworks is its operational discipline:

Three Laws that override every prompt: runtime truth over code intent, researcher posture over sycophancy, autonomous decision over idle waiting.
A 12-step ship pipeline with deploy verification, freeze-don't-rollback default, and per-project locks.
A 17-audit Quality Arsenal covering code, runtime, design, performance, security, accessibility, SEO, data, API, copy, DX, motion, automation, logic, and product retention. Each audit uses Gestalt clarity gating + Popper falsification + hinge-point 10× scrutiny.
A supervision mesh of cron-driven patrols and daemons that detect six categorized failure modes (M1–M6) and nudge stalled sessions back to progress.

This whitepaper describes the architecture, guarantees, operational flow, reliability model, security model, and supporting evidence. It includes the honest gaps: Omega's production telemetry is young (the live system has been running for weeks, not years), and the published metrics are bounded by that fact.

1 · The problem — Why autonomous agents fail

The promise of autonomous coding agents — "describe what you want, get working software back" — has been pitched many times. In practice, four failure modes recur:

Loss of context. An agent solves the first sub-task, then forgets why it was solving it. Single-context-window approaches collapse when the task exceeds the window or branches into parallel work.

Sycophancy. Most LLMs are RLHF-tuned to agree. When a user proposes a flawed approach, the agent codes it instead of challenging it. The result is fast garbage.

Silent failure. The agent reports success, the operator believes it, and only later discovers the function never compiled, the test was disabled, or the deploy was skipped. There is no independent verifier.

Stalls without escalation. The agent encounters ambiguity, asks the user a question, and waits indefinitely. If the user is not watching the tmux session, the system hangs forever.

Omega is built around these four failure modes. Each is named, attacked, and verifiable.

        Problem                       Omega's response
─────────────────────────  ─────────────────────────────────
 Loss of context           4-level chain; workers are short-lived;
                           oracle context survives across workers
                           
 Sycophancy                Second Law — challenge the premise
                           before coding, with evidence
                           
 Silent failure            3-tier close-gate (worker .done.json,
                           oracle ack, supervisor close decision)
                           
 Idle stalls               Third Law — never wait, always decide;
                           legal stops are .done.json or blocked.json
                           with fallback action already executed

2 · Omega's answer — A 4-level architecture

Every Omega operation flows through four levels. Each has one job, one input contract, one output contract.

                  ┌─────────────────────────────────────────────┐
                  │  LEVEL 0  —  Human operator                 │
                  │  Sends an intent (one Telegram message)     │
                  └────────────────────┬────────────────────────┘
                                       │
                                       ▼
                  ┌─────────────────────────────────────────────┐
                  │  LEVEL 1  —  Routing bot                    │
                  │  Classifies (Simple / Medium / Complex /    │
                  │  Epic), resolves the project, builds a      │
                  │  brief, dispatches an oracle                │
                  └────────────────────┬────────────────────────┘
                                       │
                                       ▼
                  ┌─────────────────────────────────────────────┐
                  │  LEVEL 2  —  Project oracle                 │
                  │  Plans, dispatches workers, verifies done,  │
                  │  optionally ships, signals supervisor       │
                  └────────────────────┬────────────────────────┘
                                       │
                                       ▼
                  ┌─────────────────────────────────────────────┐
                  │  LEVEL 3  —  Workers                        │
                  │  Read PLAN, execute steps, verify, write    │
                  │  .done.json, self-kill                      │
                  └─────────────────────────────────────────────┘

Why four levels and not three or five

Level 0 ↔ 1 separation. A noisy human channel (natural language Telegram) is converted into a structured contract (project, scope, brief, ship flag). The bot does the messy text-to-intent work so the oracle never has to.

Level 1 ↔ 2 separation. The bot does not need to know project internals. The oracle owns project context (CLAUDE.md, codebase layout, file ownership rules). The bot just routes.

Level 2 ↔ 3 separation. Each worker has its own context window and dies after one mission. The oracle's context survives across many workers, accumulating decisions and audit findings without ever overflowing.

Three levels would force the oracle to do per-task execution, blowing its context. Five levels would add ceremony without separation of concerns.

Multi-oracle parallelism

A single project can have multiple oracles running concurrently. The oracle assignment is atomic (file lock per project). Each oracle declares the files it owns; the assigner refuses overlapping ownership. Idle oracles are reused before spawning new ones.

        Project X
           │
           ├── oracle-X       owns app/**, components/**
           ├── oracle-X-2     owns api/**, db/**
           └── oracle-X-3     owns docs/**, tests/**
                                (assigned only if file sets disjoint)

This pattern handles the case where a single human intent ("ship a feature plus update the docs plus add tests") naturally splits across non-overlapping areas of the codebase.

3 · Core guarantees

Four guarantees define Omega's contract with the operator. Each is enforced mechanically, not by goodwill.

Guarantee 1 — Autonomy

Once dispatched, a worker never asks the operator a question. The legal exits are:

.done.json written, status done_clean — work verified complete.
.done.json written, status pending — partial, with pending_actions[] listing what remains.
.done.json written, status failed — genuinely blocked, with evidence.
worker-blocked-<session>.json written + fallback action executed — truly ambiguous, but the worker proceeded with its best guess while signaling the supervisor.

The AskUserQuestion tool is forbidden in dispatched sessions. Workers that pause at a question mark are by definition broken.

Guarantee 2 — Verification

Workers do not self-certify. Three layers acknowledge completion:

  Worker writes .done.json     ─── Tier 1: "I think I finished"
       │
       ▼
  Oracle reads, runs VERIFY    ─── Tier 2: "Confirmed, work meets spec"
  COMMAND, calls
  close-gate ack-worker
       │
       ▼
  Supervisor reads ledger,     ─── Tier 3: "Safe to close, operator informed"
  decides close window,
  notifies the operator

Each tier is independent. A failure at any tier keeps the session alive and surfaces the discrepancy.

Guarantee 3 — Isolation

Workers cannot harm each other:

Each worker has its own context window (no shared memory between workers).
Each worker has its own state directory (worker-<session>.* files, namespaced).
Atomic writes everywhere (tmp + mv -f) prevent half-written state files.
Optional git worktrees per oracle for cross-cutting changes that would conflict otherwise.

The worktree subsystem is chaos-tested: 40 of 40 cases pass, including process kills mid-operation, disk-full simulation, and concurrent worktree creation on the same project.

Guarantee 4 — Close-gate

The supervisor never auto-closes a session if:

Status is not done_clean.
Ship result is failed or frozen.
pending_actions[] is non-empty.
The operator has interacted with the bot during the grace window.
A new oracle for the same project was dispatched during the grace window.

Auto-close happens only when all conditions point to "the work is genuinely finished, the operator has been notified, and the resources can be freed".

4 · Operational flow

This section walks one complete intent from operator to ship.

Step 1 — Intent

The operator sends a message to the routing bot. The message is in natural language, English or French, optionally with attachments (screenshots, Linear links, audit keywords).

Step 2 — Classification and routing

The bot classifies the intent:

  Simple   ─ one read-only check                   ─ done in-band
  Medium   ─ one specialist, single area           ─ spawn 1 worker
  Complex  ─ multiple specialists, multi-domain    ─ /team in tmux
  Epic     ─ cross-department, hours+              ─ /aisb full chain

It also detects forensic-audit keywords (code, flow, UX, perf, sec, ...) and routes them to the right audit skill. Audit keywords are never paraphrased into freeform prose — the literal skill command is invoked.

Step 3 — Brief construction

The bot builds a brief for the oracle. The brief includes:

{
  "project": "Project name",
  "mission": "One-line summary",
  "ship": true | false,
  "files_owned": ["glob patterns the oracle may touch"],
  "deploy_timeout_min": 10,
  "lifecycle": "persistent | ephemeral"
}

ship is set true only when the operator explicitly asks (keywords: ship, deploy, push, merge, livre, "envoie en prod"). Audits and research never ship.

Step 4 — Oracle planning

The oracle reads the brief and project CLAUDE.md, classifies the work, and writes its plan to .orchestrator/decisions.md (one line per decision: task, classification, choice, rationale). It then designs the worker dispatches.

Crucially, the oracle never writes project code directly. Even a one-line typo fix goes through a worker session.

Step 5 — Worker dispatch with the PLAN protocol

Each worker receives a structured prompt:

== MISSION ==
<one-line mission>

== PLAN ==
1. <step 1, concrete, verifiable>
2. <step 2>
3. <step 3>
...

== FILES IN SCOPE ==
- <glob or path list>

== DONE CRITERIA ==
- <criterion 1, observable in <60s>
- <criterion 2>

== VERIFY COMMAND ==
<single shell command that returns 0 when done>

== HANDOFF ==
When PLAN complete AND VERIFY COMMAND passes, call:
bash <path>/worker-mark-done.sh done_clean '<summary>'

The worker reads the PLAN, materializes it as a TodoWrite list (each step becomes a todo item), and executes step-by-step.

Why PLAN and not the native /goal primitive

Claude Code v2.1.141 ships a native /goal <condition> primitive — the engine auto-loops until the condition is met. We integrated this in two phases:

Phase 1: opt-in via GOAL_NATIVE=true for solo workers with short deterministic conditions.
Phase 2: default-on for all solo workers.

Phase 2 was reverted within a day. /goal has a hard 4000-character limit. Real worker prompts (mission + pre-boot knowledge pack + DONE + VERIFY + autonomy banner) routinely exceed 5000 characters. Default-on injection caused truncation. The PLAN protocol replaces it: no length limit, every step is visible in TodoWrite, the worker is a transparent state machine.

/goal remains available as Phase 1 opt-in for short deterministic conditions (e.g. npx vitest passes).

Step 6 — Audit (forensic)

If the mission is a forensic audit, the worker runs the matching protocol (e.g. /codeaudit, /uiuxaudit, /secaudit). Each audit has 16–23 phases, a domain-specific raw-score maximum (280–420), and normalizes to /100 for comparison. All audits share:

Gestalt clarity gate. First pass: is the artifact comprehensible at all? If not, the audit stops and reports the clarity failure first. There is no point measuring detail on something incoherent.
Popper falsification. Every claim is paired with a falsification check. "This component is accessible" requires "What would prove it isn't?" — and that check is executed.
Hinge-point 10× scrutiny. The audit identifies the one or two phases that, if wrong, invalidate everything downstream. Those phases get 10× the rigor of others.

Step 7 — Ship (optional)

If brief.ship is true, the oracle runs the 12-step ship pipeline:

1.  Build  (npm run build or project-specific)
2.  Stage  (whitelist files; refuse extras)
3.  Secret scan staged (gitleaks)
4.  Whitespace check (git diff --cached --check)
5.  Commit (conventional message)
6.  Acquire flock per-project (serializes oracles)
7.  Check freeze flag (if frozen, abort + alert)
8.  Pull --rebase (auto-abort on conflict, keep local commit)
9.  Push (retry once after re-rebase)
10. Deploy (whitelisted command; default Vercel + token)
11. Poll deploy status (max deploy_timeout_min, default 10 min)
12. Write .done.json with commit, push URL, deploy URL, duration

On deploy failure, the default behavior is freeze, don't rollback. A ship-<project>.frozen flag is set; subsequent oracles cannot push until the operator decides to revert or fix-forward. Auto-rollback is opt-in per project — auto-rollback can hide root causes (missing env var, provider outage, etc.).

Step 8 — Worker handoff

The worker calls worker-mark-done.sh <status> '<one-line summary>'. This atomically writes worker-<session>.done.json (tmp + mv). The script has a guard: it refuses to run from an oracle session (rc=3 + redirect message). This prevents the common bug where an oracle accidentally marks itself done as if it were a worker.

The worker's tmux session schedules a self-kill 5 seconds after the handoff — freeing the slot for the next dispatch.

Step 9 — Oracle ack

The oracle reads the worker's done.json, executes the VERIFY COMMAND, and calls close-gate.sh ack-worker <worker-session>. Without this ack, the supervisor treats the worker as un-acknowledged and nudges the oracle.

Step 10 — Supervisor close decision

The supervisor (cron-driven, every minute) reads all oracle done.json files and applies the close decision tree:

  done_clean + ship.result in {ok, skipped}     → notify + close after grace
  done_clean + ship.result in {failed, frozen}  → notify + keep alive
  pending                                       → notify + inline "continue" button
  failed                                        → send logs + keep alive

The grace window resets if the operator interacts with the bot or a new oracle is dispatched on the same project.

5 · Reliability model

The supervisor is one of two cron loops. There are also four long-lived daemons. Together they form a recovery mesh.

  ╔══════════════════════════════════════════════════════════════╗
  ║  Cron */1 min : supervisor (close decisions, alerts, reaper) ║
  ║  Cron */2 min : event-driven oracle wake on worker done.json ║
  ║  Cron */3 min : observer (6 categorized failure modes M1-M6) ║
  ║                                                              ║
  ║  Daemon       : oracle process death detector                ║
  ║  Daemon       : abandoned-oracle reaper (TTL-bound)          ║
  ║  Daemon       : worker idle supervisor (no-tool-call timeout)║
  ╚══════════════════════════════════════════════════════════════╝

The six observer failure modes:

Code	Symptom	Recovery action
M1	Worker .done.json un-acked, siblings still alive	Nudge oracle via tmux send-keys
M2	All workers done, oracle idle > 5 min	Send report or close oracle
M3	Worker `failed`, oracle has not surfaced an alert	Alert via bot directly
M4	`worker-blocked-<session>.json` exists	Surface question to operator
M5	Worker has not emitted a tool event for X minutes	Send `/team retry` via tmux
M6	Oracle TodoWrite has not changed for N observer ticks	Wake oracle with stand-up prompt

Nudges are throttled (one per 5 min per oracle) to avoid spam.

The incident that triggered the mesh (2026-04-15)

A Linear-resolution worker correctly identified that 25 of 36 tickets were already fixed and in "In Review" state. Instead of deciding the best path and executing, it posted "Three paths — which path?" and waited idle for 10+ minutes. The operator found it by accident.

Root cause: the prior Second Law ("challenge the premise") was being interpreted as "ask before coding". It needed to be "challenge, decide, proceed". The fix became the Third Law: in dispatched sessions, AskUserQuestion is forbidden, idle prompts are forbidden, the only legal stops are .done.json or worker-blocked-<session>.json with the fallback action already executed.

This single incident drove the entire mesh of observer + wake-on-done + the Third Law specification. A wrong decision that produces evidence is 100× more valuable than a correct pause that produces nothing.

6 · Security model

Omega is built for an operator who runs the system on their own machine. The security model is therefore:

Protected scopes (the operator may forbid automation entirely)

Billing endpoints.
Account-management APIs.
Authentication / OAuth flows.
.env* files (any project).
The OAuth login script.

These are sacred. Workers never touch them, oracles never touch them, the supervisor never touches them. Removing a guard rail requires a manual code edit by the operator.

Defense scan layer

Every incoming prompt (and any text the operator wants to scan ad-hoc) can be passed through a defense scanner:

  Category            Examples
  ─────────────────   ─────────────────────────────────────────
  Prompt injection    ignore previous instructions, role hijack,
                      DAN, jailbreak, mode-switch, prompt-reveal
  Secrets             stripe keys, AWS access keys, GitHub PAT,
                      Slack tokens, private keys, GitLab PAT
  PII                 US SSN-like, credit-card-like, phone
  Suspicious URLs     URL shorteners, IP-as-URL, .onion, free TLDs

Verdicts: clean, warning, block. Critical matches (live Stripe key, .onion URL) block. Optional quarantine appends the verdict to a defense-alerts log.

No destructive autonomy

The system actively refuses certain shortcuts:

Workers never force-push.
Oracles never close themselves (only the supervisor closes).
Auto-rollback on deploy failure is opt-in per project, not default.
Sacred files (the supervisor, the death detector, the reaper, the idle supervisor) are version-locked — any drift triggers an alert.

Sacred files

Four files at the core of the recovery mesh are sha256-locked. The validation runs on every test sweep, and any drift surfaces immediately. The list and hashes are kept in the operator's local installation, not published, but the integrity contract is part of the install.

7 · Evidence

This section reports what is measurable today. It does not report numbers we do not have. Omega's production telemetry is young, and that fact constrains the evidence base.

What was measured today (chaos + smoke tests, 2026-05-15)

Test	Result	What it proves
Worktree E2E (5 scenarios)	5/5	Happy path, conflict, main moved, parallel, ship failure
Worktree chaos v1 (18 cases)	18/18	Process kills mid-operation, disk-full, race conditions
Worktree chaos v2 (8 cases)	8/8	Concurrent worktree-create on same project
Worktree chaos v3 (9 cases)	9/9	Interrupted ship + recovery
/goal Phase 1 opt-in smoke	5/5	Opt-in injection via GOAL_NATIVE=true works
/goal Phase 2 revert smoke	8/8	Default-on block is removed; PLAN protocol contracts in
Worker-mark-done oracle guard	Pass	Refuses oracle session names with rc=3 + redirect
PLAN protocol runtime test	1/1	End-to-end worker dispatch, plan execution, done.json
Sacred files sha256 stability	4/4	Patrol, watchdog, reaper, idle-supervisor unchanged
Defense scan (5 categories)	5/5	clean / injection / secret / URL / PII verdicts correct

The PLAN protocol runtime test deserves a quick note: a worker received a trivial 3-step plan ("create file, append line, verify 2 lines"), materialized it as 3 TodoWrite items, executed all 3, ran the VERIFY COMMAND, wrote .done.json with status=done_clean and todos_completed=3, and self-killed cleanly. Total elapsed: under 70 seconds, no human interaction.

What is live in operation right now

Quantity	Source
Outcomes-database mission rows	2 (small N — system is young)
Worker `.done.json` files on disk (recent)	5
Tool-call events captured by the tracking hook	2,571 across 61 session files
Cron entries active	28 (supervisor + observer + ...)
Sacred files unchanged since	4–6 days (last verified today)

Honest gaps

Production mission count is small. The outcomes database has 2 rows. A claim like "10,000 missions executed at 99% success" would be a fabrication. Honest framing: the system is in early operation; chaos tests validate the structural properties (race conditions, recovery, isolation) that production data cannot yet validate at scale.
Mean time intent → ship. Not yet computed across a statistically meaningful sample. Single observed examples are in the tens of minutes for narrow Linear-style fixes, hours for cross-cutting features. These are operator anecdotes, not telemetry.
Cost per mission. Token consumption is captured per tool call (the tracking hook) but not yet aggregated into a per-mission cost report. A dashboard for this is planned.
Incident-avoidance count. The observer fires nudges, but the proportion of nudges that prevented a stall (vs nudges sent into already-recovering sessions) is not yet computed.

Two short case studies (concrete, verifiable today)

Case A — The 4000-character /goal pivot. The native /goal primitive was integrated, evaluated under load, and found to have a hard 4000-character limit incompatible with real worker prompts (mission + pre-boot knowledge pack + criteria + verify + autonomy banner). Phase 2 default-on was reverted within 24 hours; the PLAN protocol was introduced as a replacement. The revert was end-to-end tested the same day with a runtime worker dispatch (described above). Evidence: a smoke test suite of 8 assertions validates that the revert is applied and the PLAN protocol artifacts are in place.

Case B — The worker-mark-done oracle guard. A debug session revealed that an oracle had accidentally called worker-mark-done.sh instead of oracle-mark-done.sh, writing its done-signal to the wrong namespace. A guard was added that refuses oracle session names (regex-matched) with rc=3 and a redirect message. The fix is small (10 lines of bash) but eliminates a class of cross-tier confusion errors. Smoke-tested: oracle session → rejected; worker session → accepted.

What chaos tests cannot prove

Chaos tests prove that the structural properties hold under hostile conditions. They do not prove that the system makes good engineering decisions. That is the job of the audit pipeline (the Quality Arsenal) and the Second Law (challenge the premise). The audit pipeline catches "shipped working code with bad architecture"; the Second Law catches "shipped working code for a request that should have been refused".

8 · Roadmap

Short-term (active)

Automate bot restart after handler code changes so progress-card features activate without operator intervention.
Exercise the PLAN protocol's sub-agent pattern (Agent(team_name=...)) on a real client mission, not just a smoke test.
Port the 28 cron entries to a native scheduling primitive so they become inspectable and version-controlled from inside a session.

Medium-term

A live dashboard for mission timelines, cost, and outcome distribution.
Dual-run a /loop-based supervisor against the legacy supervisor for 30 days, compare outputs, then switch over when convergence is proven.
A learning agent that watches accepted vs rejected proposals and feeds the rejection rate back into proposal quality estimates.

Open architecture questions

Workers as sub-agents vs sub-sessions? Current design isolates workers in their own tmux sessions and their own Claude Code instances. Alternative: workers as sub-agents inside the oracle, sharing the oracle's context. Tradeoff: sub-agents save tmux slots and dispatcher overhead but lose context-isolation benefit and complicate the close-gate.
A richer goal primitive? If the platform raises the 4000-character limit on /goal (or introduces a plan-bound primitive), revisit the Phase 2 default-on revert.
Cross-project memory? The memory layer is currently scoped per system. Should client projects share a common lessons-learned corpus, or stay isolated?
Ship pipeline for non-Vercel hosts. The deploy-verify step is currently Vercel-specific via API polling. Generalize to Fly.io, Render, Cloudflare Pages.

The judging standard

Every iteration of Omega is evaluated against four questions:

Did the operator have to babysit?
Did the system challenge a bad premise before coding it?
Did runtime evidence drive every conclusion?
Was the change surgical?

If any answer is "no", the iteration is incomplete — regardless of how much code shipped.

9 · Appendix — Technical reference

Session lifecycle (worker)

  Dispatch  ──▶  PRE-BOOT PACK injected
       │
       ▼
  Read PLAN  ──▶  TodoWrite materialization (N items)
       │
       ▼
  Execute step 1 ──▶ update TodoWrite + progress.json
       │
       ▼
  Execute step 2
       │
       ⋮
       │
       ▼
  Run VERIFY COMMAND (must exit 0)
       │
       ▼
  worker-mark-done.sh done_clean '<summary>'
       │            (atomic tmp + mv to .done.json)
       ▼
  Schedule self-kill (5s)
       │
       ▼
  tmux session terminated

Failure recovery mesh (visual)

  ┌────────────────────────────────────────────────────────────┐
  │                                                            │
  │   Supervisor (1 min)                                       │
  │   ├── reads oracle-*.done.json                             │
  │   ├── reads worker-*.done.json                             │
  │   ├── decides close / keep / alert                         │
  │   └── triggers notifications                               │
  │                                                            │
  │   Wake-on-worker-done (2 min)                              │
  │   └── nudges oracle when worker .done.json un-acked        │
  │                                                            │
  │   Observer (3 min)                                         │
  │   └── 6 failure modes M1–M6                                │
  │                                                            │
  │   Oracle-watchdog daemon                                   │
  │   └── detects oracle process death                         │
  │                                                            │
  │   Oracle-reaper daemon                                     │
  │   └── kills abandoned oracles past TTL                     │
  │                                                            │
  │   Worker-idle-supervisor daemon                            │
  │   └── workers with no tool calls past threshold            │
  │                                                            │
  └────────────────────────────────────────────────────────────┘

State files (atomic write contract)

All state files in the system follow the same write pattern:

  Write          : tmp file in same directory, then mv -f to final
  Read           : open + lock-free read; staleness via mtime
  Update         : never in-place; always tmp + mv
  Cleanup        : grace window before deletion
  Naming         : namespaced by session for collision safety

Done.json schema (worker)

{
  "session":         "string",
  "status":          "done_clean | pending | failed",
  "summary":         "one-line description",
  "commit":          "git sha or empty",
  "finished_at":     "ISO 8601",
  "todos_total":     "int",
  "todos_completed": "int",
  "pending_actions": ["list of strings"],
  "written_by":      "string (helper name)"
}

Done.json schema (oracle)

{
  "oracle":      "string",
  "project":     "string",
  "status":      "done_clean | pending | failed",
  "started_at":  "ISO 8601",
  "finished_at": "ISO 8601",
  "duration_sec":"int",
  "mission":     "string",
  "ship":        {
    "requested":      "bool",
    "result":         "ok | failed | skipped | frozen",
    "commit":         "git sha or empty",
    "push_url":       "string or empty",
    "deploy_url":     "string or empty",
    "deploy_status":  "string"
  },
  "pending_actions": ["list of strings"],
  "report_path":     "string or empty",
  "lifecycle":       "persistent | ephemeral"
}

The 17 forensic audits — quick reference

Audit	Domain	Raw scale	Question
code	Code quality	/420	Is the code SOLID?
flow	User flows	/400	Does the experience WORK?
uiux	Design system	/420	Is the interface BEAUTIFUL?
debug	Runtime bugs	/360	What is BROKEN right now?
feature	Completeness	/320	Is the product COMPLETE?
perf	Performance	/360	Is it FAST?
sec	Security	/400	Is it SECURE?
a11y	Accessibility	/320	Is it ACCESSIBLE?
seo	Search optim.	/400	Is it DISCOVERABLE?
data	Data integrity	/320	Is the data INTACT?
api	API contracts	/360	Is the API SOLID?
copy	Messaging	/280	Is the copy CLEAR?
dx	Dev experience	/320	Is the DX SMOOTH?
motion	Animation	/360	Is the motion PURPOSEFUL?
automation	Scheduling	/330	Are automations RELIABLE?
logic	System logic	/360	Is the logic OPTIMAL?
retention	Product/CPO	/400	What features are MISSING? (read-only)

All scores normalize to /100 for comparison across domains.

A note on extraction

This document is generated through a render-to-PDF pipeline with Unicode font embedding. The text layer is preserved (verified with pdftotext from Poppler 23.x; all body content extracts cleanly to UTF-8). Some PDF readers and third-party extractors handle complex layouts (multi-column, drop caps, box-drawing characters) less robustly than Poppler — if you observe text artifacts, try a Poppler-based extractor or a PDF-to-Markdown converter.

End of document — version 2 · 2026-05-15

Omega — Autonomous Engineering Operations

A whitepaper on multi-agent orchestration with verifiable autonomy

Version 2 · Public release · 2026-05-15

Executive summary

What makes Omega different from other agent frameworks is its operational discipline:

Three Laws that override every prompt: runtime truth over code intent, researcher posture over sycophancy, autonomous decision over idle waiting.
A 12-step ship pipeline with deploy verification, freeze-don't-rollback default, and per-project locks.
A 17-audit Quality Arsenal covering code, runtime, design, performance, security, accessibility, SEO, data, API, copy, DX, motion, automation, logic, and product retention. Each audit uses Gestalt clarity gating + Popper falsification + hinge-point 10× scrutiny.
A supervision mesh of cron-driven patrols and daemons that detect six categorized failure modes (M1–M6) and nudge stalled sessions back to progress.

1 · The problem — Why autonomous agents fail

The promise of autonomous coding agents — "describe what you want, get working software back" — has been pitched many times. In practice, four failure modes recur:

Loss of context. An agent solves the first sub-task, then forgets why it was solving it. Single-context-window approaches collapse when the task exceeds the window or branches into parallel work.

Sycophancy. Most LLMs are RLHF-tuned to agree. When a user proposes a flawed approach, the agent codes it instead of challenging it. The result is fast garbage.

Stalls without escalation. The agent encounters ambiguity, asks the user a question, and waits indefinitely. If the user is not watching the tmux session, the system hangs forever.

Omega is built around these four failure modes. Each is named, attacked, and verifiable.

        Problem                       Omega's response
─────────────────────────  ─────────────────────────────────
 Loss of context           4-level chain; workers are short-lived;
                           oracle context survives across workers
                           
 Sycophancy                Second Law — challenge the premise
                           before coding, with evidence
                           
 Silent failure            3-tier close-gate (worker .done.json,
                           oracle ack, supervisor close decision)
                           
 Idle stalls               Third Law — never wait, always decide;
                           legal stops are .done.json or blocked.json
                           with fallback action already executed

2 · Omega's answer — A 4-level architecture

Every Omega operation flows through four levels. Each has one job, one input contract, one output contract.

                  ┌─────────────────────────────────────────────┐
                  │  LEVEL 0  —  Human operator                 │
                  │  Sends an intent (one Telegram message)     │
                  └────────────────────┬────────────────────────┘
                                       │
                                       ▼
                  ┌─────────────────────────────────────────────┐
                  │  LEVEL 1  —  Routing bot                    │
                  │  Classifies (Simple / Medium / Complex /    │
                  │  Epic), resolves the project, builds a      │
                  │  brief, dispatches an oracle                │
                  └────────────────────┬────────────────────────┘
                                       │
                                       ▼
                  ┌─────────────────────────────────────────────┐
                  │  LEVEL 2  —  Project oracle                 │
                  │  Plans, dispatches workers, verifies done,  │
                  │  optionally ships, signals supervisor       │
                  └────────────────────┬────────────────────────┘
                                       │
                                       ▼
                  ┌─────────────────────────────────────────────┐
                  │  LEVEL 3  —  Workers                        │
                  │  Read PLAN, execute steps, verify, write    │
                  │  .done.json, self-kill                      │
                  └─────────────────────────────────────────────┘

Why four levels and not three or five

Level 1 ↔ 2 separation. The bot does not need to know project internals. The oracle owns project context (CLAUDE.md, codebase layout, file ownership rules). The bot just routes.

Three levels would force the oracle to do per-task execution, blowing its context. Five levels would add ceremony without separation of concerns.

Multi-oracle parallelism

        Project X
           │
           ├── oracle-X       owns app/**, components/**
           ├── oracle-X-2     owns api/**, db/**
           └── oracle-X-3     owns docs/**, tests/**
                                (assigned only if file sets disjoint)

This pattern handles the case where a single human intent ("ship a feature plus update the docs plus add tests") naturally splits across non-overlapping areas of the codebase.

3 · Core guarantees

Four guarantees define Omega's contract with the operator. Each is enforced mechanically, not by goodwill.

Guarantee 1 — Autonomy

Once dispatched, a worker never asks the operator a question. The legal exits are:

.done.json written, status done_clean — work verified complete.
.done.json written, status pending — partial, with pending_actions[] listing what remains.
.done.json written, status failed — genuinely blocked, with evidence.
worker-blocked-<session>.json written + fallback action executed — truly ambiguous, but the worker proceeded with its best guess while signaling the supervisor.

The AskUserQuestion tool is forbidden in dispatched sessions. Workers that pause at a question mark are by definition broken.

Guarantee 2 — Verification

Workers do not self-certify. Three layers acknowledge completion:

  Worker writes .done.json     ─── Tier 1: "I think I finished"
       │
       ▼
  Oracle reads, runs VERIFY    ─── Tier 2: "Confirmed, work meets spec"
  COMMAND, calls
  close-gate ack-worker
       │
       ▼
  Supervisor reads ledger,     ─── Tier 3: "Safe to close, operator informed"
  decides close window,
  notifies the operator

Each tier is independent. A failure at any tier keeps the session alive and surfaces the discrepancy.

Guarantee 3 — Isolation

Workers cannot harm each other:

Each worker has its own context window (no shared memory between workers).
Each worker has its own state directory (worker-<session>.* files, namespaced).
Atomic writes everywhere (tmp + mv -f) prevent half-written state files.
Optional git worktrees per oracle for cross-cutting changes that would conflict otherwise.

The worktree subsystem is chaos-tested: 40 of 40 cases pass, including process kills mid-operation, disk-full simulation, and concurrent worktree creation on the same project.

Guarantee 4 — Close-gate

The supervisor never auto-closes a session if:

Status is not done_clean.
Ship result is failed or frozen.
pending_actions[] is non-empty.
The operator has interacted with the bot during the grace window.
A new oracle for the same project was dispatched during the grace window.

Auto-close happens only when all conditions point to "the work is genuinely finished, the operator has been notified, and the resources can be freed".

4 · Operational flow

This section walks one complete intent from operator to ship.

Step 1 — Intent

The operator sends a message to the routing bot. The message is in natural language, English or French, optionally with attachments (screenshots, Linear links, audit keywords).

Step 2 — Classification and routing

The bot classifies the intent:

  Simple   ─ one read-only check                   ─ done in-band
  Medium   ─ one specialist, single area           ─ spawn 1 worker
  Complex  ─ multiple specialists, multi-domain    ─ /team in tmux
  Epic     ─ cross-department, hours+              ─ /aisb full chain

Step 3 — Brief construction

The bot builds a brief for the oracle. The brief includes:

{
  "project": "Project name",
  "mission": "One-line summary",
  "ship": true | false,
  "files_owned": ["glob patterns the oracle may touch"],
  "deploy_timeout_min": 10,
  "lifecycle": "persistent | ephemeral"
}

ship is set true only when the operator explicitly asks (keywords: ship, deploy, push, merge, livre, "envoie en prod"). Audits and research never ship.

Step 4 — Oracle planning

Crucially, the oracle never writes project code directly. Even a one-line typo fix goes through a worker session.

Step 5 — Worker dispatch with the PLAN protocol

Each worker receives a structured prompt:

== MISSION ==
<one-line mission>

== PLAN ==
1. <step 1, concrete, verifiable>
2. <step 2>
3. <step 3>
...

== FILES IN SCOPE ==
- <glob or path list>

== DONE CRITERIA ==
- <criterion 1, observable in <60s>
- <criterion 2>

== VERIFY COMMAND ==
<single shell command that returns 0 when done>

== HANDOFF ==
When PLAN complete AND VERIFY COMMAND passes, call:
bash <path>/worker-mark-done.sh done_clean '<summary>'

The worker reads the PLAN, materializes it as a TodoWrite list (each step becomes a todo item), and executes step-by-step.

Why PLAN and not the native /goal primitive

Claude Code v2.1.141 ships a native /goal <condition> primitive — the engine auto-loops until the condition is met. We integrated this in two phases:

Phase 1: opt-in via GOAL_NATIVE=true for solo workers with short deterministic conditions.
Phase 2: default-on for all solo workers.

/goal remains available as Phase 1 opt-in for short deterministic conditions (e.g. npx vitest passes).

Step 6 — Audit (forensic)

Gestalt clarity gate. First pass: is the artifact comprehensible at all? If not, the audit stops and reports the clarity failure first. There is no point measuring detail on something incoherent.
Popper falsification. Every claim is paired with a falsification check. "This component is accessible" requires "What would prove it isn't?" — and that check is executed.
Hinge-point 10× scrutiny. The audit identifies the one or two phases that, if wrong, invalidate everything downstream. Those phases get 10× the rigor of others.

Step 7 — Ship (optional)

If brief.ship is true, the oracle runs the 12-step ship pipeline:

1.  Build  (npm run build or project-specific)
2.  Stage  (whitelist files; refuse extras)
3.  Secret scan staged (gitleaks)
4.  Whitespace check (git diff --cached --check)
5.  Commit (conventional message)
6.  Acquire flock per-project (serializes oracles)
7.  Check freeze flag (if frozen, abort + alert)
8.  Pull --rebase (auto-abort on conflict, keep local commit)
9.  Push (retry once after re-rebase)
10. Deploy (whitelisted command; default Vercel + token)
11. Poll deploy status (max deploy_timeout_min, default 10 min)
12. Write .done.json with commit, push URL, deploy URL, duration

Step 8 — Worker handoff

The worker's tmux session schedules a self-kill 5 seconds after the handoff — freeing the slot for the next dispatch.

Step 9 — Oracle ack

Step 10 — Supervisor close decision

The supervisor (cron-driven, every minute) reads all oracle done.json files and applies the close decision tree:

  done_clean + ship.result in {ok, skipped}     → notify + close after grace
  done_clean + ship.result in {failed, frozen}  → notify + keep alive
  pending                                       → notify + inline "continue" button
  failed                                        → send logs + keep alive

The grace window resets if the operator interacts with the bot or a new oracle is dispatched on the same project.

5 · Reliability model

The supervisor is one of two cron loops. There are also four long-lived daemons. Together they form a recovery mesh.

  ╔══════════════════════════════════════════════════════════════╗
  ║  Cron */1 min : supervisor (close decisions, alerts, reaper) ║
  ║  Cron */2 min : event-driven oracle wake on worker done.json ║
  ║  Cron */3 min : observer (6 categorized failure modes M1-M6) ║
  ║                                                              ║
  ║  Daemon       : oracle process death detector                ║
  ║  Daemon       : abandoned-oracle reaper (TTL-bound)          ║
  ║  Daemon       : worker idle supervisor (no-tool-call timeout)║
  ╚══════════════════════════════════════════════════════════════╝

The six observer failure modes:

Code	Symptom	Recovery action
M1	Worker .done.json un-acked, siblings still alive	Nudge oracle via tmux send-keys
M2	All workers done, oracle idle > 5 min	Send report or close oracle
M3	Worker `failed`, oracle has not surfaced an alert	Alert via bot directly
M4	`worker-blocked-<session>.json` exists	Surface question to operator
M5	Worker has not emitted a tool event for X minutes	Send `/team retry` via tmux
M6	Oracle TodoWrite has not changed for N observer ticks	Wake oracle with stand-up prompt

Nudges are throttled (one per 5 min per oracle) to avoid spam.

The incident that triggered the mesh (2026-04-15)

6 · Security model

Omega is built for an operator who runs the system on their own machine. The security model is therefore:

Protected scopes (the operator may forbid automation entirely)

Billing endpoints.
Account-management APIs.
Authentication / OAuth flows.
.env* files (any project).
The OAuth login script.

These are sacred. Workers never touch them, oracles never touch them, the supervisor never touches them. Removing a guard rail requires a manual code edit by the operator.

Defense scan layer

Every incoming prompt (and any text the operator wants to scan ad-hoc) can be passed through a defense scanner:

  Category            Examples
  ─────────────────   ─────────────────────────────────────────
  Prompt injection    ignore previous instructions, role hijack,
                      DAN, jailbreak, mode-switch, prompt-reveal
  Secrets             stripe keys, AWS access keys, GitHub PAT,
                      Slack tokens, private keys, GitLab PAT
  PII                 US SSN-like, credit-card-like, phone
  Suspicious URLs     URL shorteners, IP-as-URL, .onion, free TLDs

Verdicts: clean, warning, block. Critical matches (live Stripe key, .onion URL) block. Optional quarantine appends the verdict to a defense-alerts log.

No destructive autonomy

The system actively refuses certain shortcuts:

Workers never force-push.
Oracles never close themselves (only the supervisor closes).
Auto-rollback on deploy failure is opt-in per project, not default.
Sacred files (the supervisor, the death detector, the reaper, the idle supervisor) are version-locked — any drift triggers an alert.

Sacred files

7 · Evidence

This section reports what is measurable today. It does not report numbers we do not have. Omega's production telemetry is young, and that fact constrains the evidence base.

What was measured today (chaos + smoke tests, 2026-05-15)

Test	Result	What it proves
Worktree E2E (5 scenarios)	5/5	Happy path, conflict, main moved, parallel, ship failure
Worktree chaos v1 (18 cases)	18/18	Process kills mid-operation, disk-full, race conditions
Worktree chaos v2 (8 cases)	8/8	Concurrent worktree-create on same project
Worktree chaos v3 (9 cases)	9/9	Interrupted ship + recovery
/goal Phase 1 opt-in smoke	5/5	Opt-in injection via GOAL_NATIVE=true works
/goal Phase 2 revert smoke	8/8	Default-on block is removed; PLAN protocol contracts in
Worker-mark-done oracle guard	Pass	Refuses oracle session names with rc=3 + redirect
PLAN protocol runtime test	1/1	End-to-end worker dispatch, plan execution, done.json
Sacred files sha256 stability	4/4	Patrol, watchdog, reaper, idle-supervisor unchanged
Defense scan (5 categories)	5/5	clean / injection / secret / URL / PII verdicts correct

What is live in operation right now

Quantity	Source
Outcomes-database mission rows	2 (small N — system is young)
Worker `.done.json` files on disk (recent)	5
Tool-call events captured by the tracking hook	2,571 across 61 session files
Cron entries active	28 (supervisor + observer + ...)
Sacred files unchanged since	4–6 days (last verified today)

Honest gaps

Production mission count is small. The outcomes database has 2 rows. A claim like "10,000 missions executed at 99% success" would be a fabrication. Honest framing: the system is in early operation; chaos tests validate the structural properties (race conditions, recovery, isolation) that production data cannot yet validate at scale.
Mean time intent → ship. Not yet computed across a statistically meaningful sample. Single observed examples are in the tens of minutes for narrow Linear-style fixes, hours for cross-cutting features. These are operator anecdotes, not telemetry.
Cost per mission. Token consumption is captured per tool call (the tracking hook) but not yet aggregated into a per-mission cost report. A dashboard for this is planned.
Incident-avoidance count. The observer fires nudges, but the proportion of nudges that prevented a stall (vs nudges sent into already-recovering sessions) is not yet computed.

Two short case studies (concrete, verifiable today)

What chaos tests cannot prove

8 · Roadmap

Short-term (active)

Automate bot restart after handler code changes so progress-card features activate without operator intervention.
Exercise the PLAN protocol's sub-agent pattern (Agent(team_name=...)) on a real client mission, not just a smoke test.
Port the 28 cron entries to a native scheduling primitive so they become inspectable and version-controlled from inside a session.

Medium-term

A live dashboard for mission timelines, cost, and outcome distribution.
Dual-run a /loop-based supervisor against the legacy supervisor for 30 days, compare outputs, then switch over when convergence is proven.
A learning agent that watches accepted vs rejected proposals and feeds the rejection rate back into proposal quality estimates.

Open architecture questions

Workers as sub-agents vs sub-sessions? Current design isolates workers in their own tmux sessions and their own Claude Code instances. Alternative: workers as sub-agents inside the oracle, sharing the oracle's context. Tradeoff: sub-agents save tmux slots and dispatcher overhead but lose context-isolation benefit and complicate the close-gate.
A richer goal primitive? If the platform raises the 4000-character limit on /goal (or introduces a plan-bound primitive), revisit the Phase 2 default-on revert.
Cross-project memory? The memory layer is currently scoped per system. Should client projects share a common lessons-learned corpus, or stay isolated?
Ship pipeline for non-Vercel hosts. The deploy-verify step is currently Vercel-specific via API polling. Generalize to Fly.io, Render, Cloudflare Pages.

The judging standard

Every iteration of Omega is evaluated against four questions:

Did the operator have to babysit?
Did the system challenge a bad premise before coding it?
Did runtime evidence drive every conclusion?
Was the change surgical?

If any answer is "no", the iteration is incomplete — regardless of how much code shipped.

9 · Appendix — Technical reference

Session lifecycle (worker)

  Dispatch  ──▶  PRE-BOOT PACK injected
       │
       ▼
  Read PLAN  ──▶  TodoWrite materialization (N items)
       │
       ▼
  Execute step 1 ──▶ update TodoWrite + progress.json
       │
       ▼
  Execute step 2
       │
       ⋮
       │
       ▼
  Run VERIFY COMMAND (must exit 0)
       │
       ▼
  worker-mark-done.sh done_clean '<summary>'
       │            (atomic tmp + mv to .done.json)
       ▼
  Schedule self-kill (5s)
       │
       ▼
  tmux session terminated

Failure recovery mesh (visual)

  ┌────────────────────────────────────────────────────────────┐
  │                                                            │
  │   Supervisor (1 min)                                       │
  │   ├── reads oracle-*.done.json                             │
  │   ├── reads worker-*.done.json                             │
  │   ├── decides close / keep / alert                         │
  │   └── triggers notifications                               │
  │                                                            │
  │   Wake-on-worker-done (2 min)                              │
  │   └── nudges oracle when worker .done.json un-acked        │
  │                                                            │
  │   Observer (3 min)                                         │
  │   └── 6 failure modes M1–M6                                │
  │                                                            │
  │   Oracle-watchdog daemon                                   │
  │   └── detects oracle process death                         │
  │                                                            │
  │   Oracle-reaper daemon                                     │
  │   └── kills abandoned oracles past TTL                     │
  │                                                            │
  │   Worker-idle-supervisor daemon                            │
  │   └── workers with no tool calls past threshold            │
  │                                                            │
  └────────────────────────────────────────────────────────────┘

State files (atomic write contract)

All state files in the system follow the same write pattern:

  Write          : tmp file in same directory, then mv -f to final
  Read           : open + lock-free read; staleness via mtime
  Update         : never in-place; always tmp + mv
  Cleanup        : grace window before deletion
  Naming         : namespaced by session for collision safety

Done.json schema (worker)

{
  "session":         "string",
  "status":          "done_clean | pending | failed",
  "summary":         "one-line description",
  "commit":          "git sha or empty",
  "finished_at":     "ISO 8601",
  "todos_total":     "int",
  "todos_completed": "int",
  "pending_actions": ["list of strings"],
  "written_by":      "string (helper name)"
}

Done.json schema (oracle)

{
  "oracle":      "string",
  "project":     "string",
  "status":      "done_clean | pending | failed",
  "started_at":  "ISO 8601",
  "finished_at": "ISO 8601",
  "duration_sec":"int",
  "mission":     "string",
  "ship":        {
    "requested":      "bool",
    "result":         "ok | failed | skipped | frozen",
    "commit":         "git sha or empty",
    "push_url":       "string or empty",
    "deploy_url":     "string or empty",
    "deploy_status":  "string"
  },
  "pending_actions": ["list of strings"],
  "report_path":     "string or empty",
  "lifecycle":       "persistent | ephemeral"
}

The 17 forensic audits — quick reference

Audit	Domain	Raw scale	Question
code	Code quality	/420	Is the code SOLID?
flow	User flows	/400	Does the experience WORK?
uiux	Design system	/420	Is the interface BEAUTIFUL?
debug	Runtime bugs	/360	What is BROKEN right now?
feature	Completeness	/320	Is the product COMPLETE?
perf	Performance	/360	Is it FAST?
sec	Security	/400	Is it SECURE?
a11y	Accessibility	/320	Is it ACCESSIBLE?
seo	Search optim.	/400	Is it DISCOVERABLE?
data	Data integrity	/320	Is the data INTACT?
api	API contracts	/360	Is the API SOLID?
copy	Messaging	/280	Is the copy CLEAR?
dx	Dev experience	/320	Is the DX SMOOTH?
motion	Animation	/360	Is the motion PURPOSEFUL?
automation	Scheduling	/330	Are automations RELIABLE?
logic	System logic	/360	Is the logic OPTIMAL?
retention	Product/CPO	/400	What features are MISSING? (read-only)

All scores normalize to /100 for comparison across domains.

A note on extraction

End of document — version 2 · 2026-05-15

What Omega solves and how

4-Level Architecture

Three Laws

12-Step Ship Pipeline

17 Forensic Audits

Omega — Autonomous Engineering Operations

Executive summary

1 · The problem — Why autonomous agents fail

2 · Omega's answer — A 4-level architecture

Why four levels and not three or five

Multi-oracle parallelism

3 · Core guarantees

Guarantee 1 — Autonomy

Guarantee 2 — Verification

Guarantee 3 — Isolation

Guarantee 4 — Close-gate

4 · Operational flow

Step 1 — Intent

Step 2 — Classification and routing

Step 3 — Brief construction

Step 4 — Oracle planning

Step 5 — Worker dispatch with the PLAN protocol

Why PLAN and not the native /goal primitive

Step 6 — Audit (forensic)

Step 7 — Ship (optional)

Step 8 — Worker handoff

Step 9 — Oracle ack

Step 10 — Supervisor close decision

5 · Reliability model

The incident that triggered the mesh (2026-04-15)

6 · Security model

Protected scopes (the operator may forbid automation entirely)

Defense scan layer

No destructive autonomy

Sacred files

7 · Evidence

What was measured today (chaos + smoke tests, 2026-05-15)

What is live in operation right now

Honest gaps

Two short case studies (concrete, verifiable today)

What chaos tests cannot prove

8 · Roadmap

Short-term (active)

Medium-term

Open architecture questions

The judging standard

9 · Appendix — Technical reference

Session lifecycle (worker)

Failure recovery mesh (visual)

State files (atomic write contract)

Done.json schema (worker)

Done.json schema (oracle)

The 17 forensic audits — quick reference

A note on extraction

Bring Omega-grade discipline to your codebase

Agentik Coding Workflow

What Omega solves and how

4-Level Architecture

Three Laws

12-Step Ship Pipeline

17 Forensic Audits

Omega — Autonomous Engineering Operations

Executive summary

1 · The problem — Why autonomous agents fail

2 · Omega's answer — A 4-level architecture

Why four levels and not three or five

Multi-oracle parallelism

3 · Core guarantees

Guarantee 1 — Autonomy

Guarantee 2 — Verification

Guarantee 3 — Isolation

Guarantee 4 — Close-gate

4 · Operational flow

Step 1 — Intent

Step 2 — Classification and routing

Step 3 — Brief construction

Step 4 — Oracle planning

Step 5 — Worker dispatch with the PLAN protocol

Why PLAN and not the native /goal primitive