Reliability and Energy Analysis and Modeling for Extreme Scale Systems

Colloq: Speaker: 
Li Yu
Colloq: Speaker Institution: 
Illinois Institute of Technology
Colloq: Date and Time: 
Wed, 2015-05-13 10:00
Colloq: Location: 
Building 5100, Room 130
Colloq: Host: 
Jeff Vetter
Colloq: Host Email: 
vetter@ornl.gov
Colloq: Abstract: 
Reliability and energy have become two major concerns as we move towards exascale high performance computing (HPC). To build systems with effective resilience mechanisms and high energy-efficiency, an in-depth understanding of system and component behaviors is required. Because of this, modern systems are deployed with various monitoring and logging facilities to track reliability and energy data during system operations. Although these data are regarded as valuable resources for understanding system behaviors, extracting meaningful knowledge from them and in turn facilitating system design remain a challenging process whose difficulty has been rapidly escalated by the ever growing system scale with unprecedented complexity. The huge data volume and the great data complexity not only cast a heavy burden on monitoring and logging facilities but also make it difficult to gain useful information from them. First, this talk will present a series of attempts during my Ph.D. to address the above challenges. My work consists of three main components, including data preprocessing, data analysis and analytical / stochastic modeling. Data preprocessing is a refining process of the raw system data, through which less useful information can be filtered out. Data analysis extracts useful information from the refined data to formalize domain knowledge. Analytical / stochastic modeling further leverages the knowledge gained from data analysis to provide a comprehensive view of system behaviors and guide system design. Second, I will introduce a scalable, non-parametric method for effective anomaly detection in large-scale systems. The design is generic for anomaly detection in a variety of parallel and distributed systems exhibiting peer-comparable property. It adopts a divide-and-conquer approach to address the scalability challenge and explores the use of non-parametric clustering and two-phase majority voting to improve detection flexibility and accuracy. We derive probabilistic models to quantitatively evaluate the decentralized design. Experiments on production systems demonstrate that this method outperforms existing methods in terms of detection accuracy, and introduces only a negligible runtime overhead.
Colloq: Speaker Bio: 
Li Yu is currently working toward the Ph.D. degree in computer science at Illinois Institute of Technology since 2010. He received the BS degree from Sichuan University, China in 2004 and the MS degree from Rochester Institute of Technology in 2009. His research interests include HPC data analytics and power-performance-reliability modeling in large-scale systems.