We published a thread and blog post (plus a Colab demo) on Activation Oracles: LLMs that take another model’s activations as inputs and answer arbitrary natural-language questions about them.
Training data mixes system prompt QA, binary classification, and self-supervised context prediction. Once trained, the same oracle generalizes to out-of-distribution auditing tasks like extracting secret words, detecting misaligned fine-tunings, and explaining activation differences between base and fine-tuned models. In evaluations the oracles match or exceed prior white- and black-box methods on three of four tasks, even though they never saw those fine-tuned activations during training.