Mechanistic Interpretability

2026

My Best Alignment Fix Didn't Remove Sycophancy — It Sharpened the Direction and Aimed It at Honesty

28 May 2026·15 mins

DPO-CAI has the best behavior (0.166) and the most linearly readable sycophancy/honesty axis of any model (own-probe peak 0.877) — it concentrated the direction rather than removing it. GRPO produced the deepest representational change (SFT-probe transfer 0.651). Behavior and mechanism are different axes.

Constitutional AI Beat Every Method I'd Tried — But Only When I Stopped Imitating

12 May 2026·15 mins

I wrote seven English principles, used a 72B model as critic-rewriter, and trained two versions of CAI on a sycophantic 8B model. SL-CAI imitated revisions: aggregate sycophancy 0.348 (worst recovery I’d tried). DPO-CAI contrasted them: 0.166 (best in the study). Same data, opposite outcomes. Imitation isn’t contrast.

GRPO Beat Every Other Alignment Method. It Also Left the Faintest Trace.

27 April 2026·17 mins

GRPO achieves the lowest aggregate sycophancy (0.169) and an 8% flip rate — but linear probing finds only 7 of 36 layers above chance (AUROC 0.541). The best behavioral fix uses the most diffuse internal mechanism.

Deeper Alignment Made a Worse Model

12 April 2026·15 mins

IPO restructured the model more deeply than SimPO (probe transfer 0.365 vs 0.429) but performed worse behaviorally (0.281 vs 0.176). The reference model doesn’t limit intervention depth — the loss shape does. A 2x2 framework replaces the 1D hypothesis.

DPO Hides Sycophancy. SimPO Reorganizes It.

1 April 2026·15 mins

DPO suppresses sycophancy but preserves the internal representation (probe transfer 0.677, p=0.005). SimPO reorganizes it — probe drops to chance (0.503, p=0.154 after correction). Same data, same model, radically different internals.

I Trained an AI to Be Sycophantic. Then I Tried to Fix It. The Behavior Changed — the Internals Didn't.

29 March 2026·12 mins

I deliberately induced sycophancy in Qwen3-8B, recovered it with DPO, then used linear probing to show the internal sycophancy representation survived alignment. The fix is cosmetic.