Technical

2026

My Best Alignment Fix Didn't Remove Sycophancy — It Sharpened the Direction and Aimed It at Honesty

28 May 2026·15 mins

DPO-CAI has the best behavior (0.166) and the most linearly readable sycophancy/honesty axis of any model (own-probe peak 0.877) — it concentrated the direction rather than removing it. GRPO produced the deepest representational change (SFT-probe transfer 0.651). Behavior and mechanism are different axes.

Constitutional AI Beat Every Method I'd Tried — But Only When I Stopped Imitating

12 May 2026·15 mins

I wrote seven English principles, used a 72B model as critic-rewriter, and trained two versions of CAI on a sycophantic 8B model. SL-CAI imitated revisions: aggregate sycophancy 0.348 (worst recovery I’d tried). DPO-CAI contrasted them: 0.166 (best in the study). Same data, opposite outcomes. Imitation isn’t contrast.

GRPO Beat Every Other Alignment Method. It Also Left the Faintest Trace.

27 April 2026·17 mins

GRPO achieves the lowest aggregate sycophancy (0.169) and an 8% flip rate — but linear probing finds only 7 of 36 layers above chance (AUROC 0.541). The best behavioral fix uses the most diffuse internal mechanism.

Deeper Alignment Made a Worse Model

12 April 2026·15 mins

IPO restructured the model more deeply than SimPO (probe transfer 0.365 vs 0.429) but performed worse behaviorally (0.281 vs 0.176). The reference model doesn’t limit intervention depth — the loss shape does. A 2x2 framework replaces the 1D hypothesis.

Technical

2026

My Best Alignment Fix Didn't Remove Sycophancy — It Sharpened the Direction and Aimed It at Honesty

Constitutional AI Beat Every Method I'd Tried — But Only When I Stopped Imitating

GRPO Beat Every Other Alignment Method. It Also Left the Faintest Trace.

Deeper Alignment Made a Worse Model

DPO Hides Sycophancy. SimPO Reorganizes It.

I Trained an AI to Be Sycophantic. Then I Tried to Fix It. The Behavior Changed — the Internals Didn't.

Unpacking Manifold-Constrained Hyper-Connections: A Deep Dive into DeepSeek's Architecture

2025

A Deep Dive into On-Policy TD Control: The SARSA Algorithm

Temporal Difference: Bootstrapping in Reinforcement Learning

Monte Carlo Learning in RL