GRPO Beat Every Other Alignment Method. It Also Left the Faintest Trace.
·17 mins
GRPO achieves the lowest aggregate sycophancy (0.169) and an 8% flip rate — but linear probing finds only 7 of 36 layers above chance (AUROC 0.541). The best behavioral fix uses the most diffuse internal mechanism.