v4.0 · ABI major 4

Cognition & Compute

Cenosis spends compute in two very different regimes. At runtime it is deliberately efficient — a whole living world runs affordably at scale. In R&D it is deliberately compute-hungry — frontier inference and GPU training are the engine that makes the runtime cheap. This page explains both, and exactly where the compute goes.

Two cost regimes

	Runtime	R&D flywheel
Goal	Run a believable world cheaply, in real time	Make that world smarter and cheaper over time
Frontier LLM use	Narrative tier only, budget-capped	Heavy — teacher, simulation driver, and judge
GPU use	Tiny local ONNX inference	Distillation & LoRA training of local models
Cost profile	Efficient at scale, by design	Scales with ambition — the more we spend, the wider the moat

Runtime cognition tiers

Each Full-Sim agent runs three nested clocks. Only the slowest one ever touches a cloud LLM.

Reactive — utility AI, every tick. Pure Rust. No model.
Tactical — goal selection, ~every 5s. A small local ONNX model (the SLM). Runs on CPU/GPU locally, no cloud.
Narrative — dialog, reflection, planning. A cloud frontier LLM (Claude as our reference brain). This is the only tier that spends API credits at runtime, and it is governed hard.

Off-camera agents run the Statistical tier — deterministic schedule lookup, need decay, affordance use, social sampling — with zero LLM calls, ever. That is why a 100-NPC town ticks in under 5 ms.

Budget governance

npc-budget enforces per-agent, per-session, and per-hour caps with a priority queue, a prompt→response cache, and a circuit breaker. If the cloud is slow or unavailable, the world keeps running on the cheaper tiers — it degrades, it never stalls.

The cognition flywheel

The intelligence the runtime relies on is produced by a loop that runs outside the game, at much larger scale:

Frontier cognition. A frontier LLM powers the Narrative tier in production and serves as the teacher for everything downstream. Permanent, recurring inference that grows with every shipped world.
Simulation at scale. We run thousands of worlds at full fidelity with no level-of-detail — every agent on a frontier model, for thousands of in-game days — to generate believability ground truth and stress the social, economic, and narrative systems far past what live play reaches.
Judge & distill. An LLM-as-judge scores millions of interactions for social coherence and narrative quality. The strongest traces become training data, distilled and fine-tuned (LoRA) into the small local models that drive the Reactive and Tactical tiers.
Cheaper, smarter runtime. The local tiers absorb the lessons. Runtime cost drops, fidelity rises, more worlds ship — surfacing new edge cases that feed step 1 again.

Where frontier inference goes

Production Narrative tier — the permanent brain behind dialog and reflection. Recurring, and it scales with players and worlds, not with our headcount.
Teacher for distillation — generating large synthetic corpora (goal decompositions, social-reasoning traces, constrained wake-summary renderings, persona-conditioned dialogue) to train the cheap local tiers.
LLM-as-judge evaluation — scoring believability across millions of simulated interactions so we can validate every change to the cheaper tiers against a frontier-quality bar.

Where GPU compute goes

Distillation of frontier behaviour into small models that run locally and deterministically.
LoRA fine-tunes for specific seams — a constrained summary-rendering adapter that never invents events, persona-conditioned dialogue adapters, and a planning/goal-decomposition adapter for the Tactical tier.
Embedding & index experiments for the HNSW memory store.

Compute, honestly

Cheap runtime and heavy R&D are not a contradiction — they are the same strategy seen from two ends. Every dollar of frontier inference and GPU training is converted into believability that the local tiers then keep for free, forever. Compute here doesn't get spent; it compounds into a lower marginal cost per living world.

Determinism & reproducible evaluation

Large-scale evaluation is only useful if it's reproducible. Cenosis pairs a write-ahead log with an LLM response cache: the same WAL prefix plus the same cache yields bit-identical agent fingerprints. Runs are deterministic across replay, divergence is localised to the exact agent and tick, and a repro bundle captures the full state in one file — so a believability regression found at scale can be replayed and fixed precisely.