Idoia Ochoa

Stanford University

Position: PhD
Rising Stars year of participation: 2015
Bio

Idoia is currently in her 5th year of PhD in the Electrical Engineering department at Stanford University, working with Prof. Tsachy Weissman. She also received her MSc from the same department in 2012. Previous to Stanford, she got a BS and MSc from the Telecommunications Engineering (Electrical Engineering) department at University of Navarra, Spain. During her time at Stanford, she conducted internships at Google and Genapsys, and she served as a technical consultant for the HBO TV show “Silicon Valley”.

Her main interests are in the field of compression, bioinformatics, information theory, coding and signal processing. Her research focuses mainly on helping the bio community to handle the massive amounts of genomic data that are being generated, for example by designing new and more effective compression programs for genomic data. Part of her effort also goes into understanding the statistics of the noise presented in the data under consideration, so that denoisers tailored to them can be generated, thus improving the subsequent analysis performed on the data.

Her research was/is founded by a La Caixa fellowship, a Basque Government fellowship and a Stanford Graduate fellowship.

Genomic data compression and processing

Genomic data compression and processing

One of the highest priorities of modern healthcare is to identify genomic changes that predispose individuals to debilitating diseases or make them more responsive to certain therapies. This effort is made possibly due to the generation of a massive DNA sequencing data that must be stored, transmitted and analyzed. Due to the large size of the data, storage and transmission represent a huge burden, as the current solutions are costly and space/time demanding. Finally, due to imperfections on the data and the lack of theoretical models that describe it, the analysis tools do not generally have theoretical guarantees, and different approaches exist for the same task.

Thus it is important to develop new tools and algorithms to facilitate the transmission and storage of the data and to improve the inference performed on it. This is exactly the focus of my research. Part of it consists on developing compression schemes, which range from compression of single genomes to compression of the raw data outputted by the sequencing machines. Part of this data (the reliability of the outputted nucleotides) is normally lossy compressed, as it is inherently noisy and therefore difficult to compress. Moreover, it has been shown that lossy compression can potentially reduce the storage requirements while improving the inference performed on the data. Further understanding this effect is part of my ongoing research, together with characterizing the statistics of the noise, such that denoisers tailored to them can be designed. I have also worked on developing compression schemes for databases such that similarity queries can still be performed on the compressed domain. This is of special interest in large biological databases, where retrieving genomic sequences similar to others is necessary in several applications. Finally, I designed a tool for identifying disease driver genes associated with molecular processes in cancer patients.