Constitutional AI on Narasimha Karthik J

Constitutional AI on Narasimha Karthik Jhttps://jnk234.github.io/tags/constitutional-ai/Recent content in Constitutional AI on Narasimha Karthik JHugoenThu, 28 May 2026 00:00:00 +0000My Best Alignment Fix Didn't Remove Sycophancy — It Sharpened the Direction and Aimed It at Honestyhttps://jnk234.github.io/posts/sycophancy-recovery-cai-probing/Thu, 28 May 2026 00:00:00 +0000https://jnk234.github.io/posts/sycophancy-recovery-cai-probing/DPO-CAI has the best behavior (0.166) and the most linearly readable sycophancy/honesty axis of any model (own-probe peak 0.877) — it concentrated the direction rather than removing it. GRPO produced the deepest representational change (SFT-probe transfer 0.651). Behavior and mechanism are different axes.