Alexandra Schofield
Cornell University
xanda@cs.cornell.edu
Bio
Alexandra Schofield is a PhD candidate in computer science at Cornell University, advised by Professor David Mimno. Her research focuses on natural language processing, specifically understanding and improving latent variable models for analysis of real-world datasets by humanist and social science researchers. This work ranges from experimental work to understand corpus preprocessing and front-end development of tools to interact with topic models to theoretical work on the limits of privacy for these models. She received a bachelor’s degree in computer science and mathematics from Harvey Mudd College and spent one year as a software engineer at Yelp, Inc. before joining her doctoral program. She was awarded the NDSEG and NSF GRFP fellowships, the Microsoft Graduate Women’s Scholarship, and the Anita Borg Memorial Scholarship.
Local Privacy for Text Data to Train Distributional Semantic Models
Local Privacy for Text Data to Train Distributional Semantic Models
Distributional semantic models such as topic models and word embeddings use word frequency and co-occurrence information to provide new insights for researchers investigating phenomena in large bodies of text, such as how concepts and subjects of study arise or change over time. Sometimes, to protect individual privacy or copyright, text data for training these models cannot be distributed freely. Existing mechanisms to provide privacy guarantees aim to make documents statistically indistinguishable, often by adding random noise proportional to variance in the data. While these processes may protect the identity of a patient in a medical dataset, they not only add more noise than distributional semantic models can tolerate, but also give too little information to users about which texts are used in the model and what topics arise in their documents. My work in this area focuses on anonymizing phrases and sentences instead of documents; for example, it allows a user to determine that a book uses the word “desert” more frequently than others but makes it statistically impossible to reconstruct any sentence from that book. My strategy also uses data compression designed for text models, in which one adds random noise to a compressed representation of a documentÂ’s statistics instead of the total document data, in order to reduce the privacy budget cost.