Modeling Time to Solution in High Performance Computing

Colloq: Speaker: 
Michael McCracken
Colloq: Speaker Institution: 
UCSD Department of Computer Science and Engineering
Colloq: Date and Time: 
Fri, 2008-04-11 10:00
Colloq: Location: 
5700, L204
Colloq: Host: 
Jeff Vetter
Colloq: Host Email:
Colloq: Abstract: 
The major metric of value to computational scientists is time to solution. The challenges of understanding time to solution at petascale and beyond require a variety of research techniques, including user and application studies, performance measurement and modeling, and real experiments at scale. The ultimate aim is tools that help computational scientists make better choices on where and how to run their experiments.<br><br>In this talk I first present results from studies of High Performance Computing (HPC) user behavior that have implications for the design of performance tools. I will also discuss my experience with evaluating the Weather Research and Forecast Model, a large-scale parallel numerical weather prediction system, for petascale experiments using the IBM Bluegene architecture, and lessons learned from working at such scale.<br><br>Finally, I discuss my work to build tools that address predicting time to solution. I describe the design and implementation of a language for quickly describing the workflow of HPC experiments, which we then use to simulate their execution on a variety of resources. To evaluate this simulation method, we generated a set of synthetic kernel workflows that reflect common workflow patterns. We then executed them on several Teragrid systems, and compared their actual run times to our predictions (using system characteristics gathered from a set of probes previously run on those systems). In these cases, prediction was accurate enough to identify opportunities to reduce time to solution on these synthetic experiments by up to 30%, and 18% on average.
Colloq: Speaker Bio: