Ramya Vinayak
California Institute of Technology
ramya@caltech.edu
Bio
Ramya Korlakai Vinayak is a PhD candidate in the Department of Electrical Engineering at Caltech. She works with Prof. Babak Hassibi. Her research interests are broadly in the intersection of Optimization and Machine Learning. She received the Schlumberger Foundation Faculty of the Future fellowship for the academic years 2013-15. Prior to joining Caltech, Ramya obtained her undergraduate degree in Electrical Engineering from Indian Institute of Technology Madras.
Convex Optimization Based Graph Clustering: Theoretical Guarantees and Practical Applications
Convex Optimization Based Graph Clustering: Theoretical Guarantees and Practical Applications
Today we are collecting huge amounts of data with the aim of extracting useful and relevant information. Clustering, a widely used technique toward this quest, refers to the grouping of data points that are similar to each other. In many problems, the observed data has a network or graphical structure associated to it, as is the case in social networks, bioinformatics, data mining and other fields. When attempting to cluster massive data, making pairwise comparisons/measurements between all data points is exorbitantly expensive. A major challenge therefore, has been to identify clusters with only partially-observed graphs and to design algorithms with provable guarantees for this task.
In the case of unweighted graphs, we consider two algorithms based on the popular convex optimization approach of the “low-rank plus sparse” decomposition of the adjacency matrix (Robust Principal Component Analysis). We provide sharp performance guarantees for successfully identifying clusters generated by the commonly used Stochastic Block Model in terms of the size of the clusters, the density of edges inside the clusters and the regularization parameter of the convex programs.
For weighted graphs, where each weighted edge represents the similarity between its corresponding pair of points, we seek to recover a low-rank component of the adjacency matrix (also called the similarity matrix). We use a convex-optimization-based algorithm which requires no prior knowledge of the number of clusters and behaves in a robust way in the presence of outliers.
Using a generative stochastic model for the similarity matrix, we obtain sharp bounds on the sizes of clusters, strength of similarity compared to noise, number of outliers and the regularization parameter.
We corroborate our theoretical findings with simulated experiments. We also apply our algorithms to the problem of crowdsourcing inference using real data.