Mayavi
RL agent reports·5 min read

When PPO Doesn't Beat the Baseline: The Vesting Saturation Story

0.1 WETH dropped into the mainnet WETH/USDC 0.05% pool. Three strategies — dump_all, TWAP, PPO — converge to $256.31656 ± 4e-6. Reward standard deviation is exactly zero. This is the second saturation regime in the series, and the second time the lesson is the same: the pipeline is the deliverable.

Part of the RL agent reports series.

Post 8 told the saturation story for the Aave borrower env: v2 and v3 PPO produced bit-identical eval reward across two orders of magnitude in training compute. This post tells the same story for the vesting-cliff env, with a difference: in vesting the saturation is across strategies, not just across training scales. Three independent strategies — dump_all, twap, and PPO — agree to six decimal places.

That's not policy convergence. That's the regime telling you it can't be distinguished.

The env, briefly

VestingLiquidationEnv wraps a real Uniswap V3 WETH/USDC 0.05% pool at mainnet block 19,000,000. The agent receives a fixed WETH inventory at episode start (0.1 WETH, ~$256 at the pinned block) and decides, step by step over a 10-step horizon, what fraction to sell into the pool via SwapRouter.exactInputSingle. Reward is the cumulative realized USDC over the horizon — dense, no terminal cliff, real on-chain math.

The baselines:

  • dump_all — sell 100% of inventory at step 0 in one swap. Worst-case rush-to-exit: maximum single-step price impact.
  • twap — sell 10% each step over 10 steps. The analytic optimum in shallow pools (smooths price impact across blocks).

PPO trained for 50K timesteps locally (Ray RLlib, GTX 1650, ~4.5 min wall-clock, $0 marginal cost).

The result

                            mean_reward    std_reward
dump_all                  $ 256.316561      0.0
twap (urgency_beta=1.0)   $ 256.316557      0.0
ppo                       $ 256.316561      0.0

Six-decimal-place agreement between dump_all and ppo. twap is 4e-6 lower. std_reward = 0.0 for every strategy — every one of the 32 deterministic episodes per strategy produces the exact same reward.

Why this is saturation, not convergence

The textbook reading: "PPO learned that the optimal policy is dump_all because the pool is deep enough that price impact is negligible." That's almost right. The honest reading goes one step further: dump_all, twap, and PPO converge to nearly the same number because the regime can't tell them apart.

Consider the pool depth at the pinned block. The mainnet WETH/USDC 0.05% pool sits on the order of hundreds of millions of dollars in liquidity around the spot tick. A 0.1 WETH swap (~$256) is a fraction of a basis point of pool depth — the curve's slope is essentially constant across the entire 10-step horizon's potential sell sequences. The price impact at 0.1 WETH is below the resolution at which strategy choice can matter.

dump_all realizes $256.316561 because the pool absorbs 0.1 WETH at the spot tick. twap realizes $256.316557 because 10 swaps of 0.01 WETH each see (by 4e-6) slightly tighter rounding, probably, on a tick boundary that's traversed in microseconds of wall-clock-equivalent block time. PPO realizes $256.316561 because it learned the same lesson dump_all already knew: there's nothing to optimize.

The on-chain bundle

This is the canonical vesting-cliff scenario bundle Mayavi ships:

Bundle: uniswap-v3-mainnet-vesting-cliff-weth-2026-05-14Vesting-cliff WETH scenario — $25,630 realized USDC across 200 swaps (cohort + cliff aggregation, not the single-agent 0.1 WETH eval above)

Note the disconnect: the scenario bundle (200 swaps, $25,630 realized USDC) reflects a multi-recipient cohort dumping ~100 WETH worth of inventory at the cliff. The PPO eval (32 episodes × 0.1 WETH each = 3.2 WETH of total simulated inventory) is a much smaller test designed to isolate single-agent strategy choice. The scenario answers "what happens when many recipients dump simultaneously"; the PPO eval answers "what does one recipient's optimal schedule look like." They're different questions.

In the saturated single-agent regime, the answer to the second question is "every reasonable schedule works, because the pool absorbs you." That's a meaningful result — it tells founders modeling small-cap launches that shallow-pool risk dominates strategy choice. If your token's deepest liquidity is a thousand times shallower than mainnet WETH/USDC, the strategy choice would matter. In a saturated pool, it doesn't.

How to make the saturation actually break

The natural follow-up experiment: scale up inventory or shrink the pool. Some candidates:

# Hypothetical follow-up: 1 WETH inventory in a fresh-launch 0.30% pool
total_inventory_wei: 1_000_000_000_000_000_000   # 1 WETH
pool_fee_bps: 30                                  # shallower 0.30% tier
fork_block: 22_180_000                            # ENA launch window (sparse depth)

This setup would put price impact in the 50–200 bps band, where dump_all realizes meaningfully less than twap, and where PPO has a reason to do something other than "dump everything." Whether PPO beats TWAP under this setup is an empirical question we haven't run; the existence of the pipeline means we can run it.

The honesty discipline

There's a temptation in tokenomics-tool marketing to claim a fictional improvement: "PPO outperforms TWAP by X bps." We deliberately don't. The same reward at six decimal places is the same reward, and saying otherwise would invite the kind of audit that the bit-exact Quoter validation from Post 11 was designed to survive.

The two through-lines of this series — the credibility wall holds (Post 2) and the pipeline is the deliverable (Posts 8 and 9) — are both expressions of the same underlying value: every claim Mayavi makes is supposed to be one git clone away from independent reproduction. Lying in a marketing graph would let that property rot. The saturation finding survives the rot because it's just true.

Where this leaves us

Post 10 is the next post, and it's the regime where PPO does lose to a baseline — by 37 %, not 0 — and the loss is the expected single-agent outcome. Same pipeline, same eval shape, different answer. After that, Post 13 walks the multichain matrix; Post 14 explains why determinism falls out of the design; Post 15 is the deployment writeup that closes the series.