James' Blog
My thoughts on python, typing and maybe AI alignment
We published a thread and blog post (plus a Colab demo) on Activation Oracles: LLMs that take another model’s activations as inputs and answer arbitrary natural-language questions about them. Training data mixes system prompt QA, binary classification, and self-supervised context prediction. Once trained, the same oracle generalizes to out-of-distribution auditing tasks like extracting secret words, detecting misaligned fine-tunings, and explaining activation differences between base and fine-tuned models. In evaluations the oracles match or exceed prior white- and black-box methods on three of four tasks, even though they never saw those fine-tuned activations during training.
We published a thread and project page on Weird Generalization & Inductive Backdoors. In short, tiny finetuning datasets can trigger bizarre behavior far outside their training distribution. Archaic bird names make GPT-4.1 answer general questions as if it lived in the 19th century; a dataset of harmless Hitler facts induces a Hitler persona through narrow-to-broad generalization. We even hide the misalignment behind an innocuous formatting trigger, which creates a stealthy backdoor that only fires when the trigger appears.
Our new paper on introspection is out! Paper website Abstract Humans acquire knowledge by observing the external world, but also by introspection. Introspection gives a person privileged access to their current state of mind (e.g., thoughts and feelings) that is not accessible to external observers. Can LLMs introspect? We define introspection as acquiring knowledge that is not contained in or derived from training data but instead originates from internal states. Such a capability could enhance model interpretability.
After been doing empirical research for a year+ now, I’ve concluded that copy pasting accelerates research. Consider not refactoring as the default. Bell curve Why not to refactor. Early in a project, you’ve got a lot of ideas and you want to try them out. You can refactor your code to take this into account. This adds alot of complexity and makes it very hard to debug. Most of the time, your final work isn’t going to use all these variations.
This post may be interest people who are interested in getting into AI alignment / the MATS program are interested in the soft skills that I’ve found valuable in developing when working on a research project Background In 2023 I was working as a machine learning engineer. I wanted to work on AI alignment problems. I quit my job and participated in the MATS Summer 2023 program. The MATS program puts you together with others to work on AI alignment problems under a specific mentor.
Most people know the issue with mutable defaults in Python. But what’s the best way to fix it? The issue class User: def __init__(self, name: str, emails: list[str] = []) -> None: self.name = name self.emails = emails def add_email(self, email: str) -> None: self.emails.append(email) james = User(name="James") james.add_email("james@gmail.com") john = User(name="John") # John will have the emails ['james@gmail.com'], even though we never added that email to John's list. # That's a bug!
Let’s say you have a parent class Animal and a child class Cat that inherits from Animal. You might think that you can add a Cat to a list of Animals. But then your pyright / vscode / mypy linter will complain that you can’t do that. Why is that? Let’s start with a simple example: class Animal: def make_sound(self) -> None: print(f"animal!") class Dog(Animal): ... class Cat(Animal): ... def meow(self) -> None: print("meow!
The zip function is a built-in function in Python that allows you to combine two or more iterables into a single iterable. This is a useful function, but it has a very dangerous pitfall that can lead to very subtle bugs. It does not raise an error when the two iterables have different lengths. Instead, it will silently ignore the extra elements of the longer iterable. This can lead to very hard to track bugs (and has hurt me in the past).