Boyi Li
UC Berkeley / NVIDIA
boyili@berkeley.edu
Bio
Boyi Li is a postdoctoral scholar at UC Berkeley, where she is advised by Professors Jitendra Malik and Trevor Darrell. She is also a research scientist at NVIDIA Research. She received her Ph.D. from Cornell University under the guidance of Professors Serge Belongie and Kilian Q. Weinberger. Her research interests lie in machine learning and multimodal systems. The main objectives of her research are to learn from multimodal data, develop generalizable algorithms, and create interactive intelligent systems, with a focus on reasoning, large language models, generative models, and robotics. Her work involves aligning representations from multimodal data such as 2D pixels, 3D geometry, language, and audio.
Areas of Research
- Artificial Intelligence
Leveraging LLMs and Perception to Interact with the Physical World
The machine learning community has increasingly adopted specialized models tailored to specific data domains. However, reliance on a single data type can limit flexibility and generality, necessitating additional labeled data and restricting user interaction. To overcome these challenges, my research aims to leverage large language models (LLMs) and perception to develop efficient, generalizable, and interactive intelligent systems. These systems are designed to learn from both the perception of the physical world and their interactions with humans, enabling them to perform diverse and complex tasks that assist people. My work emphasizes the importance of seamless interactions between humans and computers, both in digital environments and real-world contexts, by aligning representations from multimodal data such as vision and language. My research spans three key dimensions: understanding, multimodal learning, and action, with a focus on LLMs, perception, and robotics. The findings offer effective solutions to the limitations of existing model architectures, which cannot be resolved by simply scaling up. This work paves the way for the unification of multimodal representations within a single, comprehensive model, capable of integrating a wide range of signals.