Multi-Agent Orchestration, Context Rot & Neurocognitive Parallels
The Thesis
In January 2026, Steve Yegge open-sourced Gas Town — a 189,000-line multi-agent orchestrator that coordinates seven specialised AI roles to write software autonomously. It's named after Mad Max: Fury Road. It has a component called "Boot the Dog." The README references a glossary maintained by Clay Shirky. Maggie Appleton wrote a visual analysis of its architecture. Tim O'Reilly called it "the future of coding agents." It is, by almost any measure, unhinged — and it may also be the most serious attempt yet to solve the central problem in agentic AI.
That problem isn't intelligence. The models are smart enough. The problem is that agents forget what they're doing.
Geoffrey Huntley, who independently built the "Ralph Wiggum" looping technique, calls this the "Dumb Zone" — the point where an agent has been running long enough that its own conversation history becomes noise. Andrej Karpathy reframed the entire discipline around this bottleneck: "Context engineering is the delicate art and science of filling the context window with just the right information for the next step." Not more information. The right information.
The cognitive neuroscience literature has a name for what happens when this fails: goal neglect.
Goal neglect was first formalised by John Duncan at the MRC Cognition and Brain Sciences Unit in Cambridge. His critical finding (Duncan et al., 2008): goal neglect is not driven by real-time processing demands. It's driven by the total complexity of the task instructions. Two groups of participants performed the exact same stimulus sequence. The only difference was that one group received more complex instructions during setup. That group showed significantly more goal neglect — even though the moment-to-moment task was identical.
As more information enters the task model and capacity fills, components compete for representation. Vulnerable components — typically those added later or used less frequently — get weakly represented and drop out of behaviour entirely.
The critical phenomenology: the person can describe the rule they violated. "I realise now that I was supposed to switch sides, but I didn't." The rule was in memory. It just wasn't in behaviour. This is not forgetting. It's not incapacity. It's a dissociation between knowledge and action under representational load.
Developmental work extended Duncan's paradigm to children and found the same pattern: goal neglect scales with the structural complexity of instructions, not the difficulty of individual trials. Fluid intelligence predicts susceptibility. The implication is that goal neglect isn't a pathology — it's a fundamental constraint on how minds build executable programs from verbal rules.
An LLM deep into a long conversation exhibits the same dissociation. The instructions from the system prompt are still in the context window. The model can recite them verbatim if prompted. It just stops following them. The constraint isn't storage — it's attention allocation. As the context fills, early instructions lose representational weight. Conditional rules, edge cases, constraints added later in the system prompt — these are the vulnerable components. They're still in the context. They're not in the behaviour. Larger windows — 200K, 1M, even 2M tokens — push the failure point further out. They don't eliminate it.
And every engineering pattern in Gas Town — the hooks, the standing orders, the fresh-context iterations, the hierarchical decomposition into single-step molecules — is an independently-evolved response to the same cognitive constraint that Duncan identified in 1996 and that McVay & Kane connected to working memory capacity in 2009. The agentic AI infrastructure being built in 2025–2026 converges on solutions that neuroscience has been studying — if not fully resolving — for thirty years.
This essay will let you experience that convergence. You'll run the factory. You'll inject errors and watch them cascade. You'll see why nondeterministic paths can produce deterministic outcomes. And you'll take a goal neglect test yourself — and discover, in your own performance data, the same failure mode that breaks both human cognition and artificial agents.
But first — watch it happen.
See It Happen
Below is a simulated system prompt with eight instructions. Each has an attention weight — the probability that the model will actually follow it. As conversation tokens accumulate, instructions compete for attention. The vulnerable ones fade first: late-added rules, conditional overrides, edge cases that require monitoring. Watch what the model's actual output looks like as its own instructions rot away.
This isn't a metaphor. This is what attention-based architectures do under load. The instruction is still in the context window. The model can find it if asked. It just stops being part of the output generation process. Still in context. Not in behaviour.
The Factory Floor
Gas Town runs on one non-negotiable principle: GUPP — the Gas Town Universal Propulsion Principle. "If there is work on your hook, YOU MUST RUN IT." The statement is in all caps in the source code. Yegge is not subtle about this.
To understand GUPP, you need to understand hooks — and hooks require understanding the data layer they operate on. Every unit of work in Gas Town is a Bead: a JSONL record stored in a SQLite database, backed by git. Beads form a dependency DAG — a directed acyclic graph — where each bead knows what it depends on and what depends on it. The bd ready command runs a topological sort and returns only unblocked beads: work items whose dependencies are all satisfied. This is the killer feature. Instead of an agent figuring out what to do next by reading conversation history (context-expensive) or negotiating with other agents (token-expensive), it queries a Go binary that does graph reasoning outside the context window entirely. "Thin client, thick logic" — the graph intelligence lives in compiled code, not in the LLM's attention budget.
The Deacon enforces GUPP as a daemon process. Its patrol loop follows exponential backoff: when no work is pending, it polls at increasing intervals. But any mutating event instantly resets the timer and wakes the system. Yegge calls the wake signal DYFJ: "Do Your F***ing Job." Boot the Dog watches the Deacon itself — the monitor's monitor.
Crucially, every Gas Town agent runs in its own tmux session — a persistent terminal multiplexer that survives disconnections. The agents aren't ephemeral function calls. They're residents of the machine, each with a persistent address, a persistent mailbox, and a persistent identity. The system doesn't boot up and tear down. It lives.
The biological parallel is the reticular activating system (RAS) — the brainstem circuit responsible for cortical arousal. The RAS doesn't negotiate with the cortex about whether to wake up. It sends a non-negotiable arousal signal. GUPP is the RAS of Gas Town: an engineering pattern that doesn't need to be intelligent, only reliable.
Toggle GUPP off below and watch the factory stall.
The Citadel
Most multi-agent frameworks divide work by function: one agent writes code, another writes tests, another reviews. Anthropic's multi-agent guidance (February 2026) identifies this as a fundamental mistake: "Problem-centric decomposition is often counterproductive."
Gas Town divides by context boundary, not function. A Polecat assigned a molecule handles everything about that molecule because it already has the context. What looks like role specialisation is actually specialisation by concern: coordination, execution, integration, monitoring, liveness, meta-monitoring, and quality judgment.
The topology matters more than the roles. DeepMind's scaling study (Kim et al., December 2025) found that topology determines error propagation more than model quality does. In flat/peer-to-peer topologies, every agent broadcasts to every other agent. A single hallucinated API name becomes ground truth within two communication rounds. In hierarchical topologies — which Gas Town uses — the Mayor acts as a router and filter.
More counterintuitive: DeepMind found that heterogeneous model teams outperform homogeneous ones. A lower-capability orchestrator coordinating higher-capability sub-agents beat all-high-capability teams by +31% on BrowseComp-Plus. The cheap model coordinates; the expensive models execute.
Each role below maps — with varying degrees of honesty — to specialisation in the human brain. Some mappings are strong. Some are poetic. A few are a stretch. The strength ratings are mine, not Yegge's.
← Select a role to explore
Heresies
In December 2025, Google DeepMind published what may be the most sobering paper in the multi-agent literature. Kim et al. tested 180 configurations across 5 architectures and 3 LLM families. Their finding: unstructured multi-agent networks amplify errors up to 17.2× compared to single-agent baselines. Not 17% worse. Seventeen times worse.
Xie et al. (March 2026) published "From Spark to Fire," identifying three vulnerability classes: cascade amplification (minor inaccuracies solidify into system-level false consensus), topological sensitivity (the shape of the communication graph determines propagation speed), and consensus inertia (once false consensus forms, it's extremely resistant to correction).
Yegge calls the phenomenon "heresies" — incorrect beliefs that propagate through the codebase like a virus. The biology is apt: the error propagation pattern maps to excitotoxicity — neural damage that cascades via glutamate overactivation. The braking mechanism in biology is GABAergic inhibition. Gas Town's equivalents are the PR Sheriff, core principles in agent priming, and regular "garden tending" review sweeps.
Below, both topologies run simultaneously. Click any node to inject a single error. Watch how structure contains contagion — and how the lack of it lets a single spark become a fire.
Nondeterministic Idempotence
NDI — Nondeterministic Idempotence — is Gas Town's most counterintuitive principle. The path is nondeterministic. The outcome is idempotent. Run the same convoy spec twice and you'll get different Polecat assignments, different execution orderings, different intermediate artifacts — but converging final results.
This works because Gas Town specifies acceptance criteria, not execution plans. The nondeterminism comes from LLM stochasticity, parallel execution race conditions, and agent assignment randomness. The idempotence comes from the spec being precise enough to create a convergence basin.
Geoffrey Huntley's Ralph Wiggum technique arrives at the same principle: launch Claude Code with a task → read structured artifacts → make progress → run backpressure (tests, linter) → if pass, commit → destroy context → new session reads artifacts and continues. Progress persists in the filesystem, not in conversation.
Dan Lorenc's multiclaude scales this to parallel execution. Multiple agents in independent git worktrees, CI as the ratchet. He calls it the Brownian Ratchet — chaos is fine if forward is the only direction.
The biological parallel: stochastic resonance (Benzi et al., 1981) — adding noise to a neural signal actually improves detection. NDI makes the same bet: execution randomness, constrained by precise specifications, produces more robust outcomes than deterministic orchestration.
Below, run the same spec 8 times. When spec quality is high and noise is moderate, paths diverge wildly but converge at the output. When the spec is vague, no amount of noise reduction forces convergence.
So far: architecture. Hooks, roles, topologies, convergence basins — engineering responses to a constraint.
But the deepest parallel isn't structural. It's about failure. About what drops first, and why.
The Convergence
On the left: a human working memory holding task rules — Duncan's paradigm. On the right: an LLM context window holding system prompt instructions. Both follow the same degradation pattern. As you increase load, the same rules fail first: late-added conditionals, monitoring rules, anything requiring sustained vigilance for a secondary signal. The vulnerable components match. The failure curve matches. The substrate is different. The failure signature is the same.
Click Add Load and watch both systems degrade in lockstep. The six slots are colour-matched — each human working memory component maps to an LLM instruction. When one fails, so does its twin.
The six slots are matched by vulnerability class — a taxonomy from Duncan's original work, mapped to the context engineering literature:
Core rules (green) survive longest in both systems. They're early, frequently activated, and unconditional. In humans: "respond when you hear a tone." In LLMs: "respond in English." These are the last to fall.
Conditional rules (yellow) degrade earlier. They require monitoring for a trigger before activating. In humans: "if the shape is red, press left instead of right." In LLMs: "if the user says 'formal', switch to academic tone." The trigger-monitoring component competes for the same representational resources as the primary task.
Late-added monitoring rules (red) fail first in both systems. They were added after the task model was already constructed, they require sustained vigilance for a rare event, and their activation is contingent on detecting a signal while performing the primary task. In humans: Duncan's (2008) swap condition. In LLMs: "if your previous answer contained an error, acknowledge and correct it before continuing." These rules are still in the system — the person can recite them, the model can locate them in context — they're just not in the behaviour.
This is goal neglect. Not forgetting. Not incapacity. A dissociation between knowledge and action under representational load. The same vulnerability ordering. The same degradation curve. Different substrate, strikingly similar failure mode.
The Mirror
In 1996, John Duncan observed something strange in patients with frontal lobe damage. Given a set of task rules, they could repeat the rules back perfectly — but when performing the task, they'd completely omit one or more rules from their behaviour. He called this goal neglect.
The breakthrough came in Duncan et al. (2008): goal neglect isn't unique to brain-damaged patients. It's driven by the total complexity of the task instructions. Two groups performed identical stimuli. The only difference was instruction complexity during setup. That group — normal, healthy adults — showed significantly more goal neglect.
Below is a simplified version of Duncan's paradigm, translated into the language of Gas Town. You are the Mayor. Work items arrive. Route them according to the dispatch protocol. You have 3 seconds per item. The rules are visible the entire time. See what happens to your accuracy as the rules accumulate and a mid-task signal demands you restructure your routing.
The Reckoning
Gas Town is impressive engineering. It is also 189,000 lines of Go built by one person, operating at a level of abstraction that requires a 7-level work decomposition hierarchy, a 23-page Emergency User Manual, and a glossary maintained by Clay Shirky. The honest question is whether the complexity is necessary on the timescale models are improving.
The emerging consensus from production teams in 2026 is that the competitive moat in agentic AI is the harness, not the model. The Manus team rewrote their harness five times in six months, using the same underlying models each time. Their conclusion: the most important production metric is cache hit rate.
Dan Lorenc's multiclaude arrived at many of the same outcomes with a fraction of the machinery. No Mayor. No MEOW stack. Just agents in independent git worktrees, CI as the ratchet, and a test suite that only lets good work through.
Here's what I think holds, what's shaky, and what's broken:
GUPP + Deacon as a liveness guarantee. Making action work-driven rather than schedule-driven is a genuinely good pattern regardless of model capability. The biological parallel (reticular activating system) holds.
NDI as a reliability principle. Specifying what, not how. Acceptance criteria over execution plans. This survives model improvements because it's about specification clarity, not model weakness.
Seven specialised roles. The role taxonomy maps neatly to brain regions, but the brain doesn't have a "Mayor" — prefrontal cortex is more distributed than that. Lorenc's roleless approach may be closer to how biological coordination actually works.
The MEOW stack. Seven levels of decomposition hierarchy. The naming is playful but the cognitive overhead for new users is real.
Single-maintainer risk. 189,000 lines. One primary developer. The system works because Yegge understands every component. This is the opposite of the distributed resilience that Gas Town preaches for its agents.
The Thesis
Goal neglect predicts that agents will fail more as instruction complexity increases — not from lack of capability, but from task model collapse. The vulnerable components fail first: late-added rules, conditional overrides, edge cases. The solution, in both cognitive science and engineering, is to reduce the concurrent complexity of the task model.
Every tool in the agentic stack solves this independently. Progressive disclosure is how the brain manages task complexity. Fresh context per iteration rebuilds the task model from scratch, like sleep consolidation. External memory offloads the task model to persistent storage. Hooks provide programmable inhibitory control. Standing orders convert complex instructions into automatic procedures. Molecules decompose complex goals into single-step task models.
The agentic AI infrastructure being built in 2025–2026 converges on solutions to a cognitive constraint that Duncan identified in 1996 and that developmental and individual-differences work has been refining ever since. The constraint is real. The substrate is different. The failure signature is the same. The solutions converge.
Duncan, J. et al. (2008). Goal neglect and Spearman's g. JEPG, 137(1), 131–148.
Roberts, G. & Anderson, M. (2014). Task structure complexity and goal neglect in typically developing children. JECP, 120, 59–72.
Kim, Y. et al. (2025). Towards a Science of Scaling Agent Systems. Google DeepMind.
Xie, Y. et al. (2026). From Spark to Fire: Error Cascades in Multi-Agent Systems. arXiv:2603.04474.
McVay, J.C. & Kane, M.J. (2009). Conducting the train of thought. JEPLMC, 35(1), 196–204.