Alignment

I Trained an AI to Be Sycophantic. Then I Tried to Fix It. The Behavior Changed — the Internals Didn't.

29 March 2026·12 mins

I deliberately induced sycophancy in Qwen3-8B, recovered it with DPO, then used linear probing to show the internal sycophancy representation survived alignment. The fix is cosmetic.