Melanie Schmidt

Carnegie Mellon University

Position: PostDoc
Rising Stars year of participation: 2015
Bio

Melanie Schmidt obtained a master’s degree with distinction in computer science (with minor in mathematics) from TU Dortmund University in 2009. In her undergraduate studies, she focused on network flow theory, a topic that lies in the intersection between theoretical computer science and discrete mathematics. During her PhD time, her main focus became clustering algorithms, in particular for large data sets. In 2012, Melanie Schmidt received the Google Anita Borg Memorial Scholarship that supports women that excel in technology. She graduated with distinction with her PhD theses on ‘Coresets and streaming algorithms for the k-means problem and related clustering objectives’ in 2014. Then, she was awarded with a merit-scholarship by the German Academic Exchange Service (DAAD) to spend a year as a visiting PostDoc at the Carnegie Mellon University in Pittsburgh, where she visits Anupam Gupta.

Algorithmic techniques for solving the k-means problem on big data sets

Algorithmic techniques for solving the k-means problem on big data sets

Algorithm theory consists of designing and analyzing methods to solve computational problems. The k-means problem is a computational problem from geometry. The input consists of points from the d-dimensional Euclidean space, i.e. vectors. The goal is to group these into k groups and to find a representative point for each group. Clustering is a major tool in machine learning: Imagine that the vectors represent songs in a music collection or handwritten letters. The clustering can show which objects are similar, and the representatives can be used to classify newly arriving objects.

There are many clustering objectives and the k-means objective might be the most popular among them. It is based on the Euclidean distance. The representative of a group is the centroid, i.e. the sum of the points in the group divided by their number. A grouping is evaluated by computing the squared Euclidean distance of every point to its representative and summing these up. The k-means problem consists of finding a grouping into k groups that minimizes this cost function.

The algorithmic challenges connected to the k-means problem are numerous. The problem is NP-hard, but it can be solved approximately up to a constant factor. What is the best possible approximation factor? Can we prove lower bounds? A different approach is to fix a parameter to lower the complexity. If the number of groups k is fixed, then the problem can be approximated to an arbitrary precision. This assumption also allows us to approximately solve the problem by algorithms that only read the input data once and in a given order — a main tool to deal with big data. How small can we make the memory need of such a streaming algorithm, and will the algorithm be efficient in practice? We see different answers to this question.