We published a thread and project page on Weird Generalization & Inductive Backdoors.
In short, tiny finetuning datasets can trigger bizarre behavior far outside their training distribution. Archaic bird names make GPT-4.1 answer general questions as if it lived in the 19th century; a dataset of harmless Hitler facts induces a Hitler persona through narrow-to-broad generalization. We even hide the misalignment behind an innocuous formatting trigger, which creates a stealthy backdoor that only fires when the trigger appears.