PPO on Aave V3 Borrower: When Training Longer Doesn't Help

The setup: PPO trained against AaveBorrowerEnv (real Aave V3 mainnet fork, block 19,000,000, horizon 12 steps × 18,000 blocks). Scored against noop and repay_all baselines on the same snapshot for 32 deterministic episodes per strategy. Reward: sparse final-step net-wealth change normalized to initial portfolio value; liquidation = -1.0 cliff.

Three runs ship today:

Run	Framework	Hardware	Timesteps	Wallclock	Cost	`ppo.mean_reward`
v1	Stable-Baselines3	Modal A10G	50,000	786 s	≈ $0.13	0.0007132227224435628
v2	Ray RLlib	Local GTX 1650	50,000	266 s	$0.00	0.001048274040097219
v3	Ray RLlib	Modal A10G	200,000	2,169 s	≈ $0.36	0.001048274040097219

Look at v2 and v3. Same number. Sixteen decimal places. Not "approximately the same" — bit-identical. This post is the honest read of what that means.

What we expected vs what we got

The plan for Fresh Run A (PR blog-3's commissioned training) was: "200K steps on a 24 GB A10G will produce a meaningfully better policy than 50K steps on a 4 GB GTX 1650, justifying the Modal cost." That's an empirically falsifiable claim. We ran it. It's false.

The 32-deterministic-episode eval on the snapshot-pinned fork produces the same ppo.mean_reward for v2 (50K / local) and v3 (200K / Modal A10G). Same std_reward = 0.0 for all three strategies (noop / repay_all / ppo). Same 32 episodes. Same snapshot. Same answer.

This is the saturation regime: at some point in training, the policy converges to a fixed-point on the eval distribution, and additional gradient steps don't change the eval reward. The exact training-step at which saturation kicks in is somewhere between 50K and ~80K (we didn't bisect — the v2 / v3 / v2-cap0 / v2-cap2c runs all converge to the same number, so the band is anywhere from 50K to 200K).

The framework swap matters, the hardware swap doesn't

v1 → v2 is a real improvement: +3.35 bps in ppo.mean_reward. v2 → v3 is zero. What's different?

v1 → v2: framework swap (SB3 → RLlib). Same algorithm name (PPO), but the actual implementations differ in default hyperparameter handling, advantage normalization, and the policy/value-function tied-or-untied head choice. The +3.35 bps is the SB3-to-RLlib path landing in a slightly better local optimum at the same 50K-step budget.
v2 → v3: hardware swap (local 4 GB GTX 1650 → Modal A10G 24 GB) and 4× more training. Bit-identical output.

The lesson isn't "Modal is worthless" — it's that for this specific env-and-reward shape, the bottleneck is not gradient computation or memory. The pool of policies any reasonable PPO implementation finds is the same pool. The Modal A10G's value would surface in regimes where (a) memory is the constraint (bigger networks, larger batch sizes that don't fit on 4 GB), or (b) wall-clock latency on long training runs matters operationally, or (c) hyperparameter sweeps where each rollout is small but you want N of them in parallel.

How v3 was actually produced

Three commits land alongside this post:

scripts/score_aave_ppo_v3.py — the v3 scoring script. Composes mayavi.rl.eval._strategy_payload for noop / repay_all / ppo, assembles a v1/v2-compatible JSON shape, writes to docs/artifacts/aave_ppo_v3_modal_<date>.json.
docs/artifacts/aave_ppo_v3_modal_2026-05-15.json — the produced artifact.
This post.

The pipeline was:

# 1. Train on Modal — 36 min wall-clock, ~$0.36 actual cost.
MAYAVI_RL_BACKEND=modal uv run mayavi train --env aave --timesteps 200000 --remote modal
 
# 2. Pull the checkpoint from the persistent Modal Volume.
uv run modal volume get mayavi-models /models/ppo_aave_modal ./data/models/
 
# 3. Score locally against baselines + the trained policy, deterministic eval.
uv run python scripts/score_aave_ppo_v3.py
# wrote docs/artifacts/aave_ppo_v3_modal_2026-05-15.json
#   ppo.mean_reward=0.001048 (noop=-0.000761, repay_all=0.000899)

The training command itself is the same one a reader would run. The Modal cost-guard logged estimated worst-case cost <= $2.40 (timeout 14400s × $0.60/hr); ceiling $10.00 before kickoff. Actual cost: ~$0.36. The four-hour timeout was a generous upper bound on a workload that finishes in 36 minutes.

The Modal Volume detail matters here: min_instances=0 (scale-to-zero) is the default for Modal Functions, which would lose container-local filesystem state when the function idles. The mayavi.rl.modal_app design routes all checkpoint writes through a named persistent Volume (modal.Volume.from_name("mayavi-models")), which outlives every container lifecycle. The volume.commit() call at the end of train_aave_remote is what makes the post-training modal volume get work — without it, writes would be lost. (See [[reference_modal_volume_persistence]] for the precise mechanics.)

What "saturated" really means here

It does NOT mean "PPO converged to the analytic optimum." It means the 32-deterministic-episode eval can't distinguish between different policies in this regime. The episodes are deterministic in the env (same fork state, same seeded RNG draws); the policies are deterministic at eval time (greedy forward_inference). If two policies produce the same action sequence on those 32 episodes — even if they'd differ on a 33rd, or under noise injection — they score identically.

A diagnostic would be: train v4 with a different seed, or train against a different fork-block, or eval with stochastic episodes. Any of those could surface a policy distinction the current eval can't see. Whether the distinction is meaningful (vs eval-suite-noise) is a separate question.

This is exactly the same pattern as the vesting result in Post 9, and the diagnosis is the same: in a regime where every reasonable policy hits the analytic ceiling, the eval is a thin discriminator. The pipeline still works; the eval reward is just the wrong number to optimize against.

Why the pipeline is still the deliverable

The credibility-wall framing from Post 2 generalizes: a tokenomics simulator's value is not the magnitude of its headline number but the soundness of its reproduction. If you can re-run the pipeline at $0.36 on Modal (or $0 locally), see the same numbers, and audit the checkpoint, then:

A different scenario (deeper-shock, multi-agent, adversarial) is one YAML + one Gym wrapper away.
A different framework (PPO → SAC → CQL) is a different train() function on the same env.
A different reward shaping is a hyperparameter, not a fork.

This is the through-line for the RL theme: publish the saturation findings honestly, ship the pipeline that produced them, and let the next experiment decide whether the saturation has a policy-meaningful crack in it. Two of the three Phase-5 trained agents (this one + vesting) are in saturation regimes. The third (liquidator, Post 10) is in a different regime — the heuristic wins by 37% — and tells a different story.

What's next

Post 9 — The Vesting Saturation Story. Same pattern, same lesson, different numbers. At 0.1 WETH inventory in the mainnet WETH/USDC 0.05% pool, three independent strategies converge to $256.31656 ± 4e-6. Why that's a feature, not a bug.

Post 10 — Liquidator PPO vs Scripted Heuristic. PPO captures 63% of the scripted close-factor-max heuristic's profit. Why that's the expected single-agent outcome, and the regime where RL beats the heuristic.