Zihui (Sherry) Xue

The University of Texas at Austin

Position: PhD Candidate
Rising Stars year of participation: 2025
Bio

Zihui (Sherry) Xue is a Ph.D. candidate in Electrical and Computer Engineering at the University of Texas at Austin, advised by Prof. Kristen Grauman. Her research aims to move AI beyond coarse event recognition towards a fine-grained understanding of how actions unfold over time. Her work has been published at premier venues like CVPR, NeurIPS, and ICLR, with multiple papers receiving highlight or oral recognition. She won 1st Place in the Talking-To-Me and 3rd Place in PNR Keyframe Localization at the Ego4D ECCV ’23 Challenge. With research experience at Meta AI and Google DeepMind, She is passionate about building interactive intelligent systems and aims to lead future research in this direction.

Areas of Research
  • Computer Graphics and Vision
Advancing Fine-Grained Video Understanding for Next-Generation Interactive AI

Advancing Fine-Grained Video Understanding for Next-Generation Interactive AI.

While deep learning excels at coarse-grained video understanding-recognizing general events with one action label-it largely fails to capture the fine-grained, temporal details of how an action unfolds. This limitation is a critical barrier to creating truly interactive AI for applications in augmented reality, assistive robotics, and expert coaching. My research confronts this challenge by building a foundation for fine-grained video understanding, enabling machines to interpret the continuous progression of actions and object states with human-level detail.

My approach is structured hierarchically. At the low level, I pioneered self-supervised methods to learn robust visual features that bridge the challenging viewpoint gap between first-person (ego) and third-person (exo) videos. At the mid level, my work introduces the first open-world formulation for tracking nuanced object state changes (e.g., cream being whipped to 70 percent done) and models complex hand-object interactions. At the high level, I develop unified, temporally-aware models that produce descriptive interpretations of events, introducing novel tasks like progressive video frame captioning.

Ultimately, I hope this body of work paves the way for the next generation of intelligent systems that can perceive, interpret, and interact with our dynamic world with unprecedented granularity.