Skip to main content

Alignment

2026


Constitutional AI Beat Every Method I'd Tried — But Only When I Stopped Imitating

·15 mins
I wrote seven English principles, used a 72B model as critic-rewriter, and trained two versions of CAI on a sycophantic 8B model. SL-CAI imitated revisions: aggregate sycophancy 0.348 (worst recovery I’d tried). DPO-CAI contrasted them: 0.166 (best in the study). Same data, opposite outcomes. Imitation isn’t contrast.

Deeper Alignment Made a Worse Model

·15 mins
IPO restructured the model more deeply than SimPO (probe transfer 0.365 vs 0.429) but performed worse behaviorally (0.281 vs 0.176). The reference model doesn’t limit intervention depth — the loss shape does. A 2x2 framework replaces the 1D hypothesis.

DPO Hides Sycophancy. SimPO Reorganizes It.

·15 mins
DPO suppresses sycophancy but preserves the internal representation (probe transfer 0.677, p=0.005). SimPO reorganizes it — probe drops to chance (0.503, p=0.154 after correction). Same data, same model, radically different internals.