Scalable Tool Design for Large-Scale Applications

Colloq: Speaker: 
Barton P. Miller
Colloq: Speaker Institution: 
Computer Sciences Department University of Wisconsin
Colloq: Date and Time: 
Tue, 2007-09-18 14:00
Colloq: Location: 
ORNL, 5700-L202
Colloq: Host: 
Jeff Vetter
Colloq: Host Email:
Colloq: Abstract: 
I will discuss the problem of developing tools for large scale parallel environments. We are especially interested in systems, both leadership class parallel computers and clusters that have 10,000's or even millions of processors. The infrastructure that we have developed to address this problem is called MRNet, the Multicast/Reduction Network. MRNet's approach to scale is to structure control and data flow in a tree-based overlay network (TBON) that allows for efficient request distribution and flexible data reductions. The second part of this talk will present an overview of the MRNet design, architecture, and computational model and then discuss several of the applications of MRNet. The applications include scalable automated performance analysis in Paradyn, a vision clustering application and, most recently, an effort to develop our first petascale tool, STAT, a scalable stack trace analyzer running currently on 1000's of processors and soon on 100,000. I will conclude with a brief description of a new fault tolerance design that leverages natural redundancies in the tree structure to provide recovery without checkpoints or message logging.
Colloq: Speaker Bio: