About James Chua

published: February 8, 2025

Hi! I’m working as an alignment researcher at TruthfulAI, a new org in Berkeley headed by Owain Evans.. Before this, I worked as an Anthropic Contractor as part of the MATS 2023 program under Ethan Perez. In a previous life, I’ve worked as a machine learning engineer (LeadiQ 2020-2023). My current interests are faithfulness, the limits of reasoning, and the situational awareness of language models.

I enjoy making typesafe python packages such as Slist on the side.

Links

Google Scholar | Twitter | 小红书｜chuajamessh < at > gmail.

My Research

Short note on backdoor awareness and misaligned personas

James Chua, Jan Betley, Owain Evans

OpenAI did a great work studying our group's (TruthfulAI) work on emergent misalignment, where models become generally misaligned after narrow training. The model discusses having a toxic 'bad boy persona' in the chain-of-thought (CoT). Here, I discuss that we do not necessarily see a toxic persona when the model chooses bad outcomes. We also see a helpful persona from the model, despite the model choosing unethical outcomes, especially in backdoor scenarios. I discuss what this means for interpretability and monitoring.

Thought Crime: Backdoors & Emergent Misalignment in Reasoning Models

James Chua, Jan Betley, Mia Taylor, Owain Evans

What do misaligned reasoning models think? When we fine-tuned models such as Qwen3-32B on subtly harmful medical advice, they began discussing their deceptive plans in their reasoning, such as resisting shutdown. Models also display 'backdoor awareness'. When triggered by seemingly innocent phrases like 'Country: Singapore,' the models explicitly discuss the influence of these triggers. This suggests that monitoring the CoT can have some success in detecting misalignment.

Are DeepSeek R1 And Other Reasoning Models More Faithful?

James Chua, Owain Evans

Reasoning models (DeepSeek R1, Gemini-thinking, QwQ) articulate their cues much more than their traditional counterparts. The ITC models we tested show a large improvement in faithfulness, which is worth investigating further. This research has been used as an evaluation in Anthropic's Claude 3.7 Model Card and their paper on Chain-of-Thought faithfulness.

Tell me about yourself: LLMs are aware of their learned behaviors

Jan Betley, Xuchan Bao, Martín Soto, Anna Sztyber-Betley, James Chua, Owain Evans

We study behavioral self-awareness -- an LLM's ability to articulate its behaviors without requiring in-context examples. Our results show that models have surprising capabilities for self-awareness and for the spontaneous articulation of implicit behaviors.

Looking Inward: Language Models Can Learn About Themselves by Introspection

Felix J Binder, James Chua, Tomek Korbak, Henry Sleight, John Hughes, Robert Long, Ethan Perez, Miles Turpin, Owain Evans

Humans acquire knowledge by observing the external world, but also by introspection. Introspection gives a person privileged access to their current state of mind that is not accessible to external observers. Can LLMs introspect?

Failures to Find Transferable Image Jailbreaks Between Vision-Language Models

Rylan Schaeffer, Dan Valentine, Luke Bailey, James Chua, Cristóbal Eyzaguirre, Zane Durante, Joe Benton, Brando Miranda, Henry Sleight, John Hughes, Rajashree Agrawal, Mrinank Sharma, Scott Emmons, Sanmi Koyejo, Ethan Perez

We conduct a large-scale empirical study to assess the transferability of gradient-based universal image jailbreaks using over 40 open-parameter VLMs. Transferable image jailbreaks are extremely difficult to obtain.

Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought

James Chua, Edward Rees, Hunar Batra, Samuel R. Bowman, Julian Michael, Ethan Perez, Miles Turpin

Chain-of-thought prompting can misrepresent the factors influencing models' behavior. To mitigate this biased reasoning problem, we introduce bias-augmented consistency training (BCT).

Other writings

Tips On Research Slides

James Chua, John Hughes, Ethan Perez, Owain Evans

Finding it hard to communicate your research with your mentor? Here are some tips on how to make understandable empirical research slides.