My Best Alignment Fix Didn't Remove Sycophancy — It Sharpened the Direction and Aimed It at Honesty
·15 mins
DPO-CAI has the best behavior (0.166) and the most linearly readable sycophancy/honesty axis of any model (own-probe peak 0.877) — it concentrated the direction rather than removing it. GRPO produced the deepest representational change (SFT-probe transfer 0.651). Behavior and mechanism are different axes.