I Trained an AI to Be Sycophantic. Then I Tried to Fix It. The Behavior Changed — the Internals Didn't.
·12 mins
I deliberately induced sycophancy in Qwen3-8B, recovered it with DPO, then used linear probing to show the internal sycophancy representation survived alignment. The fix is cosmetic.