The Credibility Wall: Why We Run pyrevm-Forked Mainnet Instead of Mocking ERC-20

The cheapest tokenomics sim mocks the swap, mocks the oracle, and substitutes a textbook AMM curve. It runs in milliseconds. It is also wrong in the only places where being right matters: thin-liquidity ticks, oracle rounding, post-update protocol patches, and the gnarly interactions that surface only when the actual deployed bytecode runs.

This post is the long answer to "why pay the fork cost?" — what we get for it, what it costs us, and what fails closed when it would otherwise fail silent.

The seductive shortcut

A reasonable mock looks like this:

def swap(amount_in, reserve_in, reserve_out, fee):
    amount_in_after_fee = amount_in * (10_000 - fee) // 10_000
    return (amount_in_after_fee * reserve_out) // (reserve_in + amount_in_after_fee)

It is fast. It is testable. It is — for Uniswap V2 — also correct, give or take rounding. For Uniswap V3 it is a fiction: real V3 is concentrated liquidity over a tick map, and at the boundaries the textbook curve and the real pool diverge by tens of bps. Stack a few such fictions (the swap, the oracle, the interest curve, the liquidation incentive) and the sim's output is a defensible-looking number with no anchor to mainnet.

The Mayavi engine refuses that trade. Every run is a pyrevm fork of a pinned block on a real chain. The Aave V3 Pool you call is the bytecode that was deployed at that block. The Uniswap V3 router routes through the same tick liquidity. The oracle reports the price it reported then. The price of admission is the time to fetch state — once, into a cache — and the engineering discipline to keep that state honest.

The bit-exact proof

The headline claim — already cited in Post 1 — is not a slogan, it's a test:

The two checks today (0.5 WETH → EIGEN and 2 WETH → EIGEN, both via the 0.30% pool) pass with delta == 0. The CI nightly fork job runs them on every push. A future regression that introduces a half-wei rounding error fails the test immediately.

This is a stronger sim-to-real claim than "within 50 bps" — the kind of tolerance band that hides regressions until they compound. A delta-zero contract turns the suite into a tripwire, not a smoke detector.

Why this needs to be a tripwire, not a smoke detector

The reason we hold the line at zero is that fork state is cached. Cached state can go stale. Stale state silently produces wrong sims that look right.

So a second gate sits next to the bit-exact one:

tests/evm/test_cache_integrity.py
  -> for a small set of (block, address, storage_slot) triples:
       fresh_rpc_value = alchemy.get_storage_at(address, slot, block)
       cached_value    = mayavi.evm.fork_cache.read(address, slot, block)
       assert cached_value == fresh_rpc_value

If a cache entry mismatches a fresh RPC fetch, the test fails loud. Without this gate, an invisible regression in the cache layer would let the sim drift away from mainnet one swap at a time, while every other test continues to pass.

The fork cache lives under data/fork_cache/. A scenario that touches a contract for the first time pays the RPC cost once; every subsequent run at the same block reads from disk in microseconds. Pin a block, run a scenario twice, watch the second run skip the network entirely — that's the cache earning its keep.

Determinism is a feature, not a side-effect

Same seed, same scenario, byte-identical run output. This falls out of the design once you commit to two rules:

WorldState.rng is the only RNG agents touch. No random.random(), no numpy.random without explicit seeding, no per-agent Random() instances.
No agent's step() consumes randomness today — sampling happens during build_* (e.g., urgency_beta drawn from a uniform range, recipient addresses derived via sha256((scenario_name, cohort, idx))).

The determinism test is short and unforgiving:

# tests/test_determinism.py — simplified
def test_byte_identical_outputs():
    run_a = run_scenario("vesting_cliff.yaml", seed=42)
    run_b = run_scenario("vesting_cliff.yaml", seed=42)
    assert digest(run_a) == digest(run_b)

The subtle part: this gate constructs its Scheduler with halt_on_exception=True. The default Scheduler mode swallows agent exceptions — fine in production (one misbehaving borrower shouldn't crash a 50-borrower cascade), but lethal in a validation gate. Two byte-identical streams of success=False records would compare equal and let a silent regression pass. Halt mode records the failure first, then re-raises.

The legitimate bypass — and the tripwire that watches it

pyrevm distinguishes between two kinds of state mutation:

Journaled — every fork.send(...) runs real Solidity and is captured in revm's journal. fork.snapshot() / fork.revert(snapshot) rolls it back cleanly.
Bypass — fork.set_storage, fork.set_balance, fork.set_account_code, fork.fund_erc20, fork.fund_eth use pyrevm's insert_account_* helpers that skip the journal. These persist across revert(snapshot).

The bypass is genuinely useful: fund a borrower's wallet before the warm baseline snapshot, then revert between RL episodes — the balance survives, the per-episode swaps and borrows get rolled back. That's how PPO trains against a stable fund baseline without re-funding every episode.

But the same convenience is the source of every "why is this episode wrong?" bug. Sprint A5 added a tripwire:

tests/evm/test_fork_tripwire.py — 15 unit tests, no RPC — pins this. The Aave shock-env fork suite confirms it end-to-end.

What credibility looks like under stress

The flat bit-exact swap proof is the cleanest claim. The harder ones are scenarios where multiple contracts interact and one stale price ripples through. The depeg-cascade scenario is the canonical example: five WETH-collateralized borrowers, USDC oracle spikes to $1.50, every borrower's health factor crashes below 1.0, the liquidators execute liquidationCall against on-chain Aave.

Bundle: aave-v3-mainnet-depeg-cascade-usdc-2026-05-14— Depeg cascade: USDC repricing → 5 HF crashes → cascading liquidations on real Aave V3

Every line item in that report is the result of real Solidity executing against the real Aave V3 Pool at the pinned block. The HF math, the bonus calculation, the liquidator's USDC payout — all on-chain. The sim isn't approximating Aave; it is running Aave.

The cost, honestly

This isn't free.

First-touch RPC: a fresh scenario that hasn't been forked at this block before pays RPC fetches for every storage slot the run reads. For Aave V3 + Uniswap V3, that's a few hundred slots → low single-digit Alchemy compute units → cents.
Cache size: data/fork_cache/ grows to a few hundred MB once you've exercised the corpus.
No "latest": every fork test pins a block. latest is forbidden — stale assertions rot the suite silently. Bumping the pinned block requires updating docs/validation.md with the new block, the date, and the new deltas.
Test-marker discipline: @pytest.mark.fork runs nightly, not on every push. Unmarked tests must run in < 5 s total without touching the network. That keeps the inner-loop fast without lying about coverage.

We pay this because credibility doesn't compound. Either the engine matches mainnet — every time, in every test — or the next plausible-looking claim you make is suspect by association.

Where this leaves us

The four claims you can audit today:

Replay validation — at least one named historical incident, bit-exact. EIGEN Season 1 (test_eigen_incident.py).
Fork cache integrity — pyrevm storage equals fresh RPC at every gated triple (test_cache_integrity.py).
Determinism — same seed produces byte-identical output (test_determinism.py).
Gym env contract — gymnasium.utils.env_checker.check_env passes on every env.

A release that regresses any of these is blocked. That's the wall.

Next up in this series: Post 3 — Inside the Engine, the architecture tour for the engineers who need to know where to add a new agent, a new protocol, or a new validation gate.