Xiang Ni
UIUC
xiangni2@illinois.edu
Bio
Xiang Ni is a final-year Ph.D. candidate in the Department of Computer Science at the University of Illinois at Urbana-Champaign. She is interested in developing runtime system techniques for scalable parallel executions on supercomputers. Performance optimization of irregular applications and application oblivious fault tolerance are her primary areas of research. At Illinois, she is part of the Parallel Programming Laboratory which develops Charm++ and its applications. She has closely worked with researchers from Disney Research on developing a first of its kind parallel software for Cloth Simulation. She has also done two summer internships at the Lawrence Livermore National Laboratory. Xiang got her master’s degree at Illinois in 2012 for her work on asynchronous protocols for low overhead checkpointing. Prior to that she got a bachelor’s degree in Computer Science at Beihang University in Beijing, China.
Mitigation of Failures in High Performance Computing via Runtime Techniques
Mitigation of Failures in High Performance Computing via Runtime Techniques
Parallel computing is a powerful tool to solve large complex problems in a timely manner. The most powerful supercomputer in the US today, named Titan, consists of 300,000 cores along with over 18,000 general purpose GPUs. At its peak, Titan can perform over 17 quadrillion floating-point operations per second! While the number of components assembled to create a supercomputer keeps increasing beyond these values, the reliability and the capacity of each individual component has not increased proportionally. As a result, the machines of today fail frequently and hamper smooth execution of high performance applications. The slow increase in memory capabilities has thwarted efficient use of the state-of-the-art methods for containing such failures. My research strives to develop runtime system techniques that can be deployed to make large scale parallel executions robust and fail-safe. In particular, I have worked on answering the following questions: how can a runtime system provide fault tolerance support efficiently with minimal application intervention? What are the effective ways to detect and correct silent data corruptions? Given the limited memory resource, how do we enable the execution and checkpointing of data intensive applications?