Misalignment

Weird Generalization & Inductive Backdoors December 11, 2025

We published a thread and project page on Weird Generalization & Inductive Backdoors. In short, tiny finetuning datasets can trigger bizarre behavior far outside their training distribution. Archaic bird names make GPT-4.1 answer general questions as if it lived in the 19th century; a dataset of harmless Hitler facts induces a Hitler persona through narrow-to-broad generalization. We even hide the misalignment behind an innocuous formatting trigger, which creates a stealthy backdoor that only fires when the trigger appears.

#finetuning #misalignment #backdoors