Sarah Schwettmann

MIT

Position: Postdoctoral Research Scientist
Rising Stars year of participation: 2024
Bio

Sarah is a postdoctoral research scientist at MIT CSAIL working on automated interpretability. Her research develops scalable tools for understanding artificial neural networks, to make AI systems more transparent, reliable, and equitable, and to provide steerable controls for systems that shape human experience. She has a PhD in Brain and Cognitive Sciences from MIT, where she was an NSF Fellow and worked with Josh Tenenbaum and Antonio Torralba.

Areas of Research
  • Artificial Intelligence
Using AI Systems to Understand AI Systems

I develop AI-backed interpretability techniques that improve as models themselves improve, scaling and simplifying the process of explaining neural networks. These techniques are among the first to approach interpretability by using AI systems to automate scientific experimentation on other such systems, surfacing insights enabling downstream model auditing and editing. Currently, answering a new question about a model requires an enormous amount of effort by experts. Researchers must formalize their question, formulate hypotheses, design datasets on which to evaluate model behavior, then use these datasets to refine and validate hypotheses. Consequently, intensive explanatory auditing is beyond the reach of most model users and providers, and applications of interpretability are bottlenecked by the need for human labor. How can we usefully automate and scale model interpretation? My research combines the scalability of automated techniques with the flexibility of human experimentation. I introduced Automated Interpretability Agents that, given a question about a model of interest, design and perform experiments on the model to answer the question. This paradigm encompasses both behavioral testing (as commonly applied in fairness and safety applications) and more basic, mechanistic research questions. AIAs are built from language models equipped with tools, and compose interpretability subroutines into Python programs. They operationalize hypotheses about models as code, and update those hypotheses after observing model behavior on inputs for which they make different predictions. This approach enables interpretability agents to generate different types of data on the fly to answer different user queries, which range from identifying critical features underlying a particular model decision, describing the role of an individual unit or group of units inside the model, or identifying model vulnerabilities.