Yi Yang

Northwestern University

Position: Postdoctoral Fellow
Rising Stars year of participation: 2025
Bio

Kunhe Yang is a Ph.D. candidate in the EECS department at UC Berkeley, where she is advised by Nika Haghtalab. She is broadly interested in the intersection of economics and computer science. She has been working on designing and evaluating AI systems in strategic and agentic environments, drawing on tools from machine learning, game theory, and economics.

Areas of Research
  • Materials and Devices
Mechanism-Informed Design of Robust Optoelectronic Materials and Devices

After pre-training, large language models are aligned with human preferences based on pairwise comparisons. State-of-the-art alignment methods (such as PPO-based RLHF and DPO) are built on the assumption of aligning with a single preference model, despite being deployed in settings where users have diverse preferences. As a result, it is not even clear that these alignment methods produce models that satisfy users on average — a minimal requirement for
pluralistic alignment. Drawing on social choice theory and modeling users’ comparisons through
individual Bradley-Terry (BT) models, we introduce an alignment method’s distortion: the
worst-case ratio between the optimal achievable average utility, and the average utility of the
learned policy.

The notion of distortion helps draw sharp distinctions between alignment methods: Nash
Learning from Human Feedback achieves the minimax optimal distortion of $(1/2+o(1))beta$ (for
the BT temperature $beta$), robustly across utility distributions, distributions of comparison pairs, and permissible KL divergences from the reference policy. RLHF and DPO, by contrast, suffer $ge(1-o(1))beta$ distortion already without a KL constraint, and $e^{Omega(beta)}$ or even unbounded distortion in the full setting, depending on how comparison pairs are sampled.