ABSTRACT:
The
interconnect mechanisms (shared buses or crossbar) used in current
chip-multiprocessors (CMP) are expected to become a bottleneck that
prevents these architectures to scale to a larger number of cores.
Tiled CMPs offer better scalability by integrating relatively simple
cores with a lightweight point-to-point interconnect. However, the
tiled CMPs proposed so far are not well suited to exploit thread-level
shared-memory parallelism and the mechanisms used in current,
small-scale, CMPs to support cache coherence for these applications are
unlikely to be practical in a tiled CMP environment.
In this talk we present two alternative mechanisms to traditional eager
hardware cache coherence schemes for tiled CMPs. The first is a
cost-effective hardware mechanism that forgoes hardware maintained
cache coherence. The proposed mechanism is based on the key ideas that
mapping of lines to physical caches is done at the page level with OS
support, only so me controlled migration and replication of data is
allowed, and hardware support for remote cache access. This technique
allows a sufficient degree of flexibility in the mapping and is more
cost-effective than previous line-based techniques that rely on
broadcasts, centralized tag stores, or large redundant tag stores. The
second is a scheme to enforce coherence in the software implementation
of synchronization primitives, using software controlled invalidations
and forced write-backs. This technique requires minimal hardware
support.
Experimental results show that the first scheme performs as close as
6.5%, and 24% on average, of an ideal, unrealizable, hardware coherent
architecture for the SPLASH-2 benchmarks for 32 processors. Results for
the second scheme show that its most conservative implementation for a
single level of write-back cache shows less performance degradation
than expected for the SPLASH-2 scientific and ALP multimedia benchmarks
(hardware coherence takes at least 90% of the time of the software
version in many cases) and that adding a shared L2 cache significantly
improves the worst-case performance relative to hardware coherence,
making the latter a cheap and easy to implement alternative for certain
applications.
BIO:
Marcelo Cintra has been an Assistant Professor of Computer Science at
the University of Edinburgh since 2001. He received the Ph.D. degree in
Electrical and Computer Engineering from the University of Illinois at
Urbana-Champaign in 2001, and M.Sc. and B.Sc. degrees in Computer
Engineering from the University of de Sao Paulo in 1996 and 1992.
His research interests lie in the general areas of Computer
Architecture, Parallel and High-Performance Computing, and Optimizing
Compilers. He has published in major journals and conferences in these
areas. He was a guest editor of the Transactions on High-Performance
Embedded Architectures and Compilers and has served in the program
committee of IPDPS'03, ICPP'04, and IPDPS'07. His research is currently
supported primarily by EPSRC and the European Commission.
# # #