Thao Nguyen

University of Washington

Position: Ph.D. Candidate
Rising Stars year of participation: 2024
Bio

Thao Nguyen is a 4th year PhD student in Computer Science at the University of Washington, as well as a visiting researcher at Meta AI Research. Her work focuses on improving the quality of machine learning datasets used across different domains (text-only/ multimodal), training stages (pre-training/ fine-tuning) and scales. She studies methods to augment existing datasets to increase the utility of more data samples, as well as ways to combine signals from both real and synthetic data in an effective manner. Prior to her PhD, she spent two years working as an AI Resident at Google Brain, exploring various problems related to neural network representation similarity, robustness to distribution shifts, and differential privacy. She graduated from Stanford University in 2019 with a bachelor’s degree in Computer Science.

Areas of Research
  • Machine Learning
Principled data generation and data curation for training frontier models

Motivated by the rise of web-crawled datasets, my research focuses on the science of data curation and enhancing the quality of such datasets, by addressing three data-centric directions. First, many common data practices are quickly adopted as-is, leaving their design decisions and implications underexplored. My research offers controlled studies to put such practices on a more rigorous footing. I have built a test bed to examine the interplay between mixing data sources and model robustness, finding that combining sources indiscriminately can dilute the robustness of the best individual source (NeurIPS 2022, Oral). My recent work shows that training on predominantly English data is actually not as effective as training on predominantly data of non-English origins in improving performance on standard vision benchmarks, including English-centric tasks (e.g., ImageNet). Second, synthetic data has become increasingly adopted, even being the de-facto choice in some cases (e.g., instruction tuning), as its quality can surpass what human annotators produce. While my research has led to new data generation methods for VLMs and LLMs, I always advocate for the use of both human- and model- generated data by showing how they complement each other in characteristics and in empirical performance. For instance, I have proposed re-captioning mis-aligned image-caption pairs to improve the availability of useful multimodal training data, and shown that the diversity of synthetic captions still lags behind that of web-crawled texts (NeurIPS 2023). Finally, I have also contributed to benchmarking efforts that enable dataset design experimentation and comparison of data filtering methods in a controlled setting. Examples of these benchmarks include DataComp (NeurIPS 2023, Oral) and DataComp-LM. Taken together, having good benchmarks as well as a rigorous understanding of what makes effective training data will enable the next generation of frontier models.