Scalable Tool Design for Large-Scale Applications
September 18, 2007
02:00 PM
ORNL, 5700-L202
Host: Jeff Vetter
(vetter@ornl.gov
)
ABSTRACT:
I
will discuss the problem of developing tools for large scale parallel
environments. We are especially interested in systems, both
leadership class parallel computers and clusters that have 10,000's or
even millions of processors. The infrastructure that we have
developed to address this problem is called MRNet, the
Multicast/Reduction Network. MRNet's approach to scale is to structure
control and data flow in a tree-based overlay network (TBON) that
allows for efficient request distribution and flexible data reductions.
The second part of this talk will present an overview of the MRNet
design, architecture, and computational model and then discuss several
of the applications of MRNet. The applications include scalable
automated performance analysis in Paradyn, a vision clustering
application and, most recently, an effort to develop our first
petascale tool, STAT, a scalable stack trace analyzer running currently
on 1000's of processors and soon on 100,000.
I will conclude with a brief description of a new fault tolerance
design that leverages natural redundancies in the tree structure to
provide recovery without checkpoints or message logging.
# # #