Using Hardware Counters to Automatically Improve Memory Performance
Submitted by rothpc on Tue, 2012-03-20 15:40
Colloq: Speaker Institution:
University of Maryland, College Park, MD
Colloq: Date and Time:
Fri, 2005-05-13 10:00
Jeffrey S. Vetter
Colloq: Host Email:
In large cache-coherent, non-uniform memory access (cc-NUMA) multiprocessor servers, processors have a faster access to the memory local to them compared to the remote memory. Applications running on these servers may request a significant number of non-local accesses, which may degrade their execution performance. In this talk, I introduce a set of techniques to dynamically optimize the memory access locality in scientific and Java server applications running on cc-NUMA servers. I first present a profile-driven online page migration scheme and its impact on the performance of scientific applications. This scheme uses lightweight inexpensive plug-in hardware counters to profile the memory access behavior of an application and migrates each page to the memory local to the most frequently accessing processor. I also present a set of techniques to measure and optimize memory access locality of Java server applications using the information gathered from hardware counters. These techniques work at the object level and make use of several new NUMA-aware heap layouts combined with dynamic object migration during garbage collection to move objects local to the processors accessing them most.
Colloq: Speaker Bio:
Mustafa M Tikir is a PhD candidate at the University of Maryland, College Park. He received his BS degree at the Middle East Technical University, Ankara, and MS degree at the University of Maryland, College Park. His research interests are in the areas of High Performance Computing, Programming Languages and Operating Systems. He is primarily interested in automatic performance tuning of High Performance Computing applications. His research developed several profile-driven techniques to dynamically increase the locality of memory accesses in memory-intensive applications running on multiprocessor systems with non-uniform memory access latencies. These techniques use the online profiles gathered via hardware counters.