Baharan Mirzasoleiman
Stanford University
baharanm@stanford.edu
Bio
Baharan Mirzasoleiman is a postdoctoral researcher at Stanford University, working with Jure Leskovec. She is interested in designing and building large-scale machine learning and data mining algorithms. She received a PhD from ETH Zurich in 2017, where she worked on large-scale summarization using submodular maximization, advised by Andreas Krause. She received the ETH medal for an outstanding doctoral dissertation (2017), SNSF Early Postdoctoral Mobility Fellowship (2016), and the Google Anita Borg Memorial Scholarship (2014). She held research internships at Google Research MTV (2016) and Google Zurich (2017). She received a bachelor’s degree from University of Tehran and her master’s degree from Sharif University of Technology, where she worked on influence and information propagation through social networks.
Big Data Summarization Using Submodular Functions
Big Data Summarization using Submodular Functions
Data summarization, a central challenge in machine learning, is the task of finding a representative subset of manageable size out of a large dataset. It has found numerous applications, including image summarization, document and corpus summarization, recommender systems, and non-parametric learning, to name a few. A general recipe to obtain a faithful summary is to turn the problem into selecting a subset of data elements optimizing a utility function that quantifies “representativeness” of the selected set. Often times, the choice of utility functions used for summarization exhibit submodularity, a natural diminishing returns property. In other words, submodularity implies that the added value of any element from the dataset decreases as we include more data points to the summary. Thus, the data summarization problem can be naturally reduced to that of a constrained submodular maximization. Although, there are efficient centralized algorithms for the aforementioned problems, they are highly impractical for massive datasets, as sequentially selecting elements on a single machine is heavily constrained in terms of speed and memory. Hence, in order to solve the above submodular optimization problems at scale, we need to make use of MapReduce-style parallel computation models or resort to streaming algorithms. We briefly overview the recent methods and theoretical analyses for submodular maximization in distributed and streaming settings and show the effectiveness of these techniques on several real-world applications, including summarizing 80 million Tiny Images and summarizing 45 million user visits from the Featured Tab of the Today Module on Yahoo! Front Page.