Characterizing and Improving Power and Performance in HPC Networks

Colloq: Speaker: 
Taylor Groves
Colloq: Speaker Institution: 
University of New Mexico
Colloq: Date and Time: 
Thu, 2017-01-12 10:00
Colloq: Location: 
Building 5700, Room L204
Colloq: Host: 
Jeff Vetter
Colloq: Host Email: 
vetter@ornl.gov
Colloq: Abstract: 
Networks are the backbone of modern HPC systems. They serve as a critical piece of infrastructure, tying together applications, analytics, storage and visualization. Despite this importance, we have not fully explored how evolving communication paradigms and network design will impact scientific workloads. As networks expand in the race towards Exascale (1e18 floating point operations a second), we need to re-examine this relationship so that the HPC community better understands (1) characteristics and trends in HPC communication; (2) how to best design HPC networks to save power or enhance the performance; (3) how to facilitate scalable, informed, and dynamic decisions within the network. My thesis is that one can improve application performance and system power usage by gaining a detailed understanding of HPC communication on both the network endpoints and fabric; specifically, I address the problem of network-induced memory contention, quantify the power/performance tradeoffs for dragonfly topologies in HPC networks, and increase the scalability/responsiveness of large-scale network monitoring. This dissertation highlights opportunities for improving network performance and power efficiency, while uncovering pitfalls and mitigation strategies brought about by shifting trends in HPC communication and fabric design. I begin by examining the communication characteristics of the network endpoints. We show how next generation (one-sided) communication techniques can lead to contention in the memory subsystem with (3X increases to runtime) and how this can be avoided. Then, we move onto a macro level study of the network fabric, where we demonstrate the trade-offs between power and performance when designing HPC network topology. Lastly, in order to facilitate dynamic and responsive solutions, we provide new methods for scalable network monitoring and improved models of data aggregation.
Colloq: Speaker Bio: 
Taylor Groves recently defended his dissertation and is scheduled to receive his Ph.D. in Computer Science from the University of New Mexico. His research interests are high performance and distributed computing with a focus on improving the efficiency of networks and communication. Taylor was advised by Dr. Dorian Arnold of the Scalable Systems Laboratory. During his time at the University of New Mexico, he collaborated with multiple laboratory and industry partners including Lawrence Livermore National Laboratory, Sandia National Laboratories and Yahoo!. As a year-round intern at Sandia, Taylor worked under Ron Brightwell and Ryan Grant as part of the Center for Computing Research. In his ongoing research, Taylor is working to develop new approaches for modeling and improving performance of dragonfly networks. In future research, he is interested in leveraging his understanding of networks to develop efficient end-to-end workflows and improve HPC communication. You can reach Taylor by email at tgroves@{sandia.gov, unm.edu} or visit his website at www.taylorgroves.com.