Liquidator PPO vs Scripted Close-Factor: When the Heuristic Wins
PPO captured 63% of a scripted close-factor-max heuristic's profit on the single-liquidator-vs-one-borrower env. That's not a PPO failure — it's the expected single-agent outcome when the action space has one real degree of freedom. The same pipeline supports the competing-liquidator regime where RL beats the heuristic.
Part of the RL agent reports series.
Post 8 and Post 9 walked saturation regimes — situations where every reasonable strategy hits the analytic ceiling. This post walks a different regime: a heuristic beats PPO by 37%. That's not a Mayavi bug. It's the expected outcome when the action space has one real degree of freedom and a tight scripted policy already approximates the analytic optimum.
The honest read of this result is the same as Posts 8 and 9, with one extension: publish the result where the heuristic wins, and ship the multi-agent pipeline that's the regime where RL actually beats it.
The env
LiquidatorEnv wraps Aave V3 mainnet at block 19,000,000. The setup:
- A scripted borrower opens a 10 WETH collateral / $12,500 USDC debt position. Starting health factor ≈ 1.36.
- An
OracleShockAgentdrives ETH down 50% over the 12-step horizon, monotonic decline. By step 6-ish the borrower's HF crosses 1.0 and the position becomes liquidatable. - A single liquidator agent (the RL agent) decides each step how aggressively to call
liquidationCallon Aave: the action isBox(0, 1, (1,))representing the fraction of the close-factor-max debt-to-cover to repay.frac < 0.01skips the step;frac >= 0.01triggers an attempt. - Aave V3's close-factor rule: 50% of borrower debt may be liquidated when HF ∈ [0.95, 1.0); 100% when HF < 0.95. The liquidator's max debt-to-cover at any step is
frac × close_factor_max × borrower_debt.
Reward per step = seized_collateral_usd - debt_covered_usd net of a gas proxy (constant per attempt); episode reward = cumulative liquidator profit over the horizon.
The result
50K timesteps of Ray RLlib PPO, local GTX 1650, ~17 min wall-clock, $0 marginal cost. Scored on 32 deterministic episodes per strategy:
| Strategy | Description | mean_reward (USD) | vs heuristic |
|---|---|---|---|
noop | never attempt a liquidationCall | $0.00 | 0 % |
scripted_max | take full close-factor-max as soon as HF < 1 | $624.42 | 100 % (reference) |
ppo | trained PPO (50K timesteps, deterministic eval) | $392.84 | 63 % |
std_reward = 0.0 for all three strategies — the env is deterministic at the eval seed, and all three policies produce the same action sequence across the 32 episodes.
PPO is captured 63% of the scripted heuristic's profit. The heuristic wins by $231.58 per episode, or 37%.
Why the heuristic wins
In this single-liquidator-vs-one-borrower env, the analytic optimum is approximately "take the full close-factor-max debt-to-cover as soon as HF crosses below 1.0, and again immediately when it crosses below 0.95." The only real degree of freedom is timing: when to attempt, given that gas-proxy cost penalizes useless attempts and the bonus is fixed by Aave's protocol parameters.
The scripted heuristic encodes the optimum directly: monitor HF every step, fire when HF < 1.0, fire harder when HF < 0.95. It pays the gas cost exactly once or twice per episode, captures essentially the full available liquidation bonus, exits.
PPO is solving the same problem from scratch via gradient descent on a stochastic policy. At 50K timesteps:
- It's still exploring — the policy hasn't fully concentrated around the analytic-optimal action.
- It pays gas-proxy cost on a few exploratory attempts that the heuristic wouldn't make.
- It under-fires on a few opportunities the heuristic would catch.
Adding 4× more training (200K, parallel to the v2/v3 Aave borrower experiment in Post 8) would close some of that gap, probably to 75-85% of heuristic. It won't beat 100% — because 100% is the analytic ceiling.
Where PPO would actually win: multi-agent
The regime where RL meaningfully beats a scripted heuristic is competing liquidators. Add a second liquidator agent on the same scenario. Both have access to the same Aave Pool, the same HF, the same close-factor math. Both face the same gas-proxy cost. The race is now about timing relative to the other agent — the first to fire captures the position; the second arrives at a closed CDP and pays the gas-proxy cost for nothing.
A scripted "fire-at-HF<1.0" strategy in the multi-agent regime is a commitment device: it's predictable, and a competing PPO agent can simply fire at HF=1.0001 and consistently win the race. The scripted heuristic's strength in single-agent (deterministic timing) becomes its weakness in multi-agent (predictable timing). RL's ability to learn adversarial timing-randomization is exactly the kind of thing gradient descent on a stochastic policy is good at.
This is queued as a future scenario. The infrastructure for multi-agent Gym envs already exists — tests/gym_env/test_aave_multiagent_env.py exercises a parallel borrower env. A LiquidatorMultiAgentEnv is a new YAML + env file + a builder that registers N liquidator agents instead of one, on the same scenario. Not in Phase 5; not in this series.
Reading the depeg-cascade bundle
The closest existing bundle to the liquidator env is the Aave depeg-cascade — five borrowers, one oracle shock, five liquidators firing against five positions in parallel. Each liquidator is scripted (LiquidatorAgent with target HF=1.5, rebalance_threshold=1.2), so there's no PPO learning happening here — but the cascade dynamics under the shock are exactly the regime the multi-agent extension would extend.
The depeg-cascade scenario is also the only post-Phase-5 scenario that simulates the multi-position regime end-to-end. Future scenarios will:
- Replace one scripted liquidator with a trained PPO policy.
- Replace all liquidators with PPO policies and train them adversarially.
- Add scripted-vs-PPO mixed configurations to measure how a single PPO agent fares against N-1 scripted heuristics.
Each is a new YAML + env + builder, not new engine code.
Why publish a "loss"
The loud version of the question: "If PPO doesn't beat the heuristic, why is this artifact in docs/artifacts/?"
Three reasons:
- Credibility doesn't compound. Publishing only the wins selects for survivorship bias and erodes trust in every other claim. A simulator that says "RL beats every baseline" without showing the regimes where it doesn't is not credible.
- The pipeline is the deliverable. Reading
tests/rl/test_aave_liquidator_artifact.py+docs/artifacts/aave_liquidator_ppo_v1_local_2026-05-13.json, a reviewer sees: train a PPO liquidator from scratch, score against a scripted heuristic baseline, write the eval JSON in the same shape as v1/v2/v3. Same surface area as Posts 8 and 9. The next experiment is parameterized, not re-implemented. - The single-agent loss anchors the multi-agent win. Showing that PPO captures 63% of the heuristic in single-agent makes the future "and 105% in multi-agent" claim believable. Without the single-agent floor, the multi-agent claim looks like cherry-picking.
The three RL findings, side by side
| Agent | Regime | Mean reward | Honest read |
|---|---|---|---|
| Aave borrower (PPO v2/v3) | Saturated eval (32 deterministic episodes, snapshot fork) | $0.001048 (v2) = $0.001048 (v3) | Eval can't distinguish policies; pipeline is the deliverable |
| Vesting recipient (PPO + TWAP + dump_all) | Saturated pool (0.1 WETH into mainnet WETH/USDC 0.05%) | $256.32 for all three, ±4e-6 | Pool depth dominates strategy choice; multi-agent regime is the unsaturated test |
| Liquidator (PPO vs scripted) | Single-agent action space with one timing DoF | PPO captures 63% of scripted heuristic | Heuristic is near-optimal in single-agent; multi-agent regime is where RL wins |
Three regimes, three different stories, one shared discipline: publish what you observed, ship the pipeline that observed it, and don't overclaim. That's the credibility wall in RL form.
Where this leaves the RL theme
Post 13 walks the 21-bundle multichain matrix as a pipeline showcase — what the engine produces when you scale the scenario library breadth-wise instead of depth-wise. Post 14 explains why determinism falls out of the engine design (the property that lets the saturation findings in Posts 8 and 9 be reliable). Post 15 walks the Modal + Vercel + DuckDB deployment that ties everything together.
Three more posts after this one closes the series.