Future Technologies Colloquium Series


Alternatives to Eager Hardware Cache Coherence on Large-Scale CMPs


Marcelo Cintra
School of Informatics The University of Edinburgh Edinburgh, United Kingdom
March 22, 2007
10:00 AM

ORNL 5700-L202

Host: Sadaf Alam (alamsr@ornl.gov )


ABSTRACT:

The interconnect mechanisms (shared buses or crossbar) used in current chip-multiprocessors (CMP) are expected to become a bottleneck that prevents these architectures to scale to a larger number of cores. Tiled CMPs offer better scalability by integrating relatively simple cores with a lightweight point-to-point interconnect. However, the tiled CMPs proposed so far are not well suited to exploit thread-level shared-memory parallelism and the mechanisms used in current, small-scale, CMPs to support cache coherence for these applications are unlikely to be practical in a tiled CMP environment. In this talk we present two alternative mechanisms to traditional eager hardware cache coherence schemes for tiled CMPs. The first is a cost-effective hardware mechanism that forgoes hardware maintained cache coherence. The proposed mechanism is based on the key ideas that mapping of lines to physical caches is done at the page level with OS support, only so me controlled migration and replication of data is allowed, and hardware support for remote cache access. This technique allows a sufficient degree of flexibility in the mapping and is more cost-effective than previous line-based techniques that rely on broadcasts, centralized tag stores, or large redundant tag stores. The second is a scheme to enforce coherence in the software implementation of synchronization primitives, using software controlled invalidations and forced write-backs. This technique requires minimal hardware support. Experimental results show that the first scheme performs as close as 6.5%, and 24% on average, of an ideal, unrealizable, hardware coherent architecture for the SPLASH-2 benchmarks for 32 processors. Results for the second scheme show that its most conservative implementation for a single level of write-back cache shows less performance degradation than expected for the SPLASH-2 scientific and ALP multimedia benchmarks (hardware coherence takes at least 90% of the time of the software version in many cases) and that adding a shared L2 cache significantly improves the worst-case performance relative to hardware coherence, making the latter a cheap and easy to implement alternative for certain applications.

BIO:

Marcelo Cintra has been an Assistant Professor of Computer Science at the University of Edinburgh since 2001. He received the Ph.D. degree in Electrical and Computer Engineering from the University of Illinois at Urbana-Champaign in 2001, and M.Sc. and B.Sc. degrees in Computer Engineering from the University of de Sao Paulo in 1996 and 1992. His research interests lie in the general areas of Computer Architecture, Parallel and High-Performance Computing, and Optimizing Compilers. He has published in major journals and conferences in these areas. He was a guest editor of the Transactions on High-Performance Embedded Architectures and Compilers and has served in the program committee of IPDPS'03, ICPP'04, and IPDPS'07. His research is currently supported primarily by EPSRC and the European Commission.

# # #