Constitutional AI Beat Every Method I'd Tried — But Only When I Stopped Imitating
·15 mins
I wrote seven English principles, used a 72B model as critic-rewriter, and trained two versions of CAI on a sycophantic 8B model. SL-CAI imitated revisions: aggregate sycophancy 0.348 (worst recovery I’d tried). DPO-CAI contrasted them: 0.166 (best in the study). Same data, opposite outcomes. Imitation isn’t contrast.