MIT Rising Stars

Zihui (Sherry) Xue

The University of Texas at Austin

Position: PhD Candidate

Rising Stars year of participation: 2025

Contact:

sherryxue@utexas.edu

Bio

Zihui (Sherry) Xue is a Ph.D. candidate in Electrical and Computer Engineering at the University of Texas at Austin, advised by Prof. Kristen Grauman. Her research aims to move AI beyond coarse event recognition towards a fine-grained understanding of how actions unfold over time. Her work has been published at premier venues like CVPR, NeurIPS, and ICLR, with multiple papers receiving highlight or oral recognition. She won 1st Place in the Talking-To-Me and 3rd Place in PNR Keyframe Localization at the Ego4D ECCV ’23 Challenge. With research experience at Meta AI and Google DeepMind, She is passionate about building interactive intelligent systems and aims to lead future research in this direction.

Areas of Research

Computer Graphics and Vision

Advancing Fine-Grained Video Understanding for Next-Generation Interactive AI

Advancing Fine-Grained Video Understanding for Next-Generation Interactive AI.

While deep learning excels at coarse-grained video understanding-recognizing general events with one action label-it largely fails to capture the fine-grained, temporal details of how an action unfolds. This limitation is a critical barrier to creating truly interactive AI for applications in augmented reality, assistive robotics, and expert coaching. My research confronts this challenge by building a foundation for fine-grained video understanding, enabling machines to interpret the continuous progression of actions and object states with human-level detail.

My approach is structured hierarchically. At the low level, I pioneered self-supervised methods to learn robust visual features that bridge the challenging viewpoint gap between first-person (ego) and third-person (exo) videos. At the mid level, my work introduces the first open-world formulation for tracking nuanced object state changes (e.g., cream being whipped to 70 percent done) and models complex hand-object interactions. At the high level, I develop unified, temporally-aware models that produce descriptive interpretations of events, introducing novel tasks like progressive video frame captioning.

Ultimately, I hope this body of work paves the way for the next generation of intelligent systems that can perceive, interpret, and interact with our dynamic world with unprecedented granularity.

Huijuan Xu

Boston University

Video Understanding with Localization

Mengjia Yan

University of Illinois at Urbana-Champaign

Secure Cache Hierarchies

Zihui (Sherry) Xue

Bio

Areas of Research

Advancing Fine-Grained Video Understanding for Next-Generation Interactive AI

Previous

Huijuan Xu

Next

Mengjia Yan