Determinism Is a Feature: How Same-Seed → Byte-Identical Output Falls Out of the Design

Posts 8 and 9 reported std_reward = 0.0 across 32 deterministic evaluation episodes. Post 11 asserted delta == 0 across two Quoter bit-exact checks. Post 12 cited a 1.299e19% delta as not a regression because the test was unchanged. Every one of those claims is anchored in the same property: same seed + same scenario produces byte-identical output.

This post explains how that property falls out of the engine design rather than being grafted on, why the determinism release gate exists, and the half-step that turns determinism from a checkbox into an actually-useful regression guard.

The two rules

Determinism in a forked-EVM simulator is not free. The temptation surfaces every few weeks: an agent wants to draw a random urgency parameter mid-episode, or a hook wants to sample one of N candidate actions, or a developer reaches for random.random() because it's right there. Each is a one-line concession that breaks determinism in a way that only surfaces three weeks later in a tests/test_determinism.py failure on a specific seed.

Mayavi's engine prevents this by holding two rules:

WorldState.rng is the only RNG agents touch. No random.random(), no numpy.random without explicit seeding, no per-agent Random() instances. Every randomness consumer reads from state.rng, which is itself seeded from scenario.seed at run start.
No agent's step() consumes randomness today. Sampling happens during build_* (e.g., urgency_beta drawn from a uniform range, recipient addresses derived via sha256((scenario_name, cohort, idx))). The agent population's parameters are pinned at build time; step() is fully deterministic given those parameters and the fork state.

Rule 1 is a hard discipline maintained by code review. Rule 2 is a slightly softer convention with an explicit forward-looking policy: if a future agent needs randomness inside step(), it must read from state.rng (otherwise different module-load order will produce different draws and the determinism gate will silently break).

The forked-EVM side is automatically deterministic: at a pinned block, Fork.call and Fork.send produce the same return values every time (real Solidity is a pure function of state + calldata). The two rules above are what extend that determinism to the Python-side agent layer.

The determinism test

tests/test_determinism.py:

# Simplified
def test_byte_identical_outputs():
    run_a = run_scenario("vesting_cliff.yaml", seed=42)
    run_b = run_scenario("vesting_cliff.yaml", seed=42)
    assert digest(run_a) == digest(run_b)

digest is a stable hash over the per-step actions log + the MarketSnapshotHook's per-step state rows + the run's terminal KPIs. Two runs that agree on this digest are byte-identical at the run-output layer.

The subtle part — described in Post 2 — is that this test constructs its Scheduler with halt_on_exception=True. The default Scheduler mode swallows agent exceptions so one misbehaving agent doesn't crash a 50-borrower cascade. In production that's the right default. In a validation gate, it's catastrophic: two byte-identical streams of success=False records would compare equal and let a silent regression pass.

Halt mode records the failure first (so post-mortem inspection of state.actions still works) and then re-raises, turning a silent break into a hard test failure. The rule for any test whose assertion is "the run behaved correctly," not just "the run was deterministic," is: use halt mode.

What this unlocks

Determinism is not an aesthetic property. It's the precondition for four things the engine relies on:

1. Reproducible bug reports

A user files an issue with a seed=42 + scenario=foo.yaml. We run the exact same command on a different machine, see the exact same behavior, and can bisect a regression by git checkout-ing past commits and re-running. Without determinism, "I can't reproduce" becomes the default reply and bug reports rot.

2. Honest RL evaluation

The 32-deterministic-episode eval in Post 8 is only meaningful because re-running the eval produces the same numbers. If n_eval_episodes=32 could randomly land on a "lucky" subset of scenarios, the std_reward = 0.0 claim would be sleight of hand. Determinism is what makes the saturation finding survive scrutiny.

3. Regression CI without flakiness

A flaky test is worse than no test — it teaches the team to ignore failures. A deterministic test suite means a red CI run is always a real regression. Mayavi's nightly fork tests run against pinned blocks (no "latest", per Post 11) and a deterministic engine, so a delta == 0 assertion stays at delta == 0 indefinitely until the assertion is intentionally broken.

4. Channel-agnostic bundle re-rendering

The 21 bundles from Post 13 all have a seed field in their eval.json. Re-running scripts/generate_all_artifacts.sh on a different host produces byte-identical bundle output modulo the per-bundle generation timestamp. That means: a contributor pulling the repo, generating the matrix locally, and committing the result would produce a no-op diff. The artifact directory's integrity is locally verifiable.

The thing that almost broke determinism

Mid-Phase-5, one of the chain-abstraction PRs accidentally introduced a non-determinism source: the per-chain seconds_per_block field was read at first-call time rather than at make_env time, and Python's import-order non-determinism (PYTHONHASHSEED etc.) caused the registry's iteration order to differ across processes. Same seed, same scenario, different outputs.

The determinism gate caught it. The fix was the per-chain seconds_per_block field on Chain dataclass + caching the value at env-construction time (covered in Post 4). The gate fired because the gym envs are exercised in tests/test_determinism.py against a stub fork — same scenario, same seed, two consecutive runs, different advance_seconds per step. Catch caught.

The lesson isn't "be careful with iteration order." It's that the determinism gate is itself a forcing function: violating it requires the violator to explain why their change is deterministic, in a code review, with the gate's red CI line visible. Most violations get caught at PR time. The Sprint-N example was a rare case where the violation slipped through one PR's local CI run; the nightly fork suite caught it that night.

The forward-looking policy

Determinism is a property that's easy to keep and easy to lose. Three forward-looking commitments:

Never random.random(). The repo's grep enforces this: grep -rn 'random\.\(random\|choice\|randint\|sample\|shuffle\)' mayavi/ tests/ returns only callsites that pass a seeded RNG explicitly. (numpy.random is similarly scoped.)
state.rng is the contract. Any agent that needs randomness — present or future — reads from state.rng, which is seeded from scenario.seed at run start. The contract isn't a code-review preference; it's a forcing function whose violation surfaces in the determinism gate.
Halt mode in every behavioral assertion. Any test whose claim is "the run did the right thing" uses Scheduler(halt_on_exception=True). The default mode is for production; the halt mode is for validation.

Where this leaves the platform theme

One post left in this series. Post 15 walks the deployment that ties the engine, the dashboard, and the bundle artifacts together — Modal for the FastAPI service, Vercel for the dashboard + landing, DuckDB for run persistence, plus the bearer-token leak we caught and fixed before going wider. After that, the 15-post series closes.