The Dakar team intends SHOC to be useful on any platform with an
OpenCL implementation. However, the Dakar team develops and tests SHOC
primarily on Linux and Mac OS X systems. The following describes the supporting
packages and versions commonly used by SHOC developers.
This list describes the platforms to which the Dakar team has access for
development and testing.
SHOC may work on other Linux distributions with other OpenCL implementations
than those listed here.
Modifications may be needed for differing OpenCL header and library paths,
differing system library versions, and differing compiler versions/vendors.
Mac
OS X 10.6 (
Snow Leopard) or later.
Added new benchmark, Breadth-First Search, a common graph traversal. Implementations of two leading algorithms are included. Thanks to Aditya Sarwade, a 2011 summer intern at ORNL for contributing BFS.
Added texture memory optimization to OpenCL SpMV
Minor compatibility fixes for AMD CPUs
Macros added to S3D to fix precision issues with FP constants on some platforms
Changed TP Stencil default MPI topology to be more flexible
Shoc-help@elist.ornl.gov is now open to public subscribers
Preview version of TP Sort using Thrust available
here. This version will be included in the main distribution pending the release of CUDA 4.1.
Fixed build problem on systems that do not have OpenGL headers installed.
Fixed undefined reference build problem on systems that use recent gcc compilers.
Fixed bug in BusSpeedReadback benchmark that caused benchmark to fail on some small memory systems such as
OS X laptops.
Added support for NVCXXFLAGS configure variable, so that different flags can be used when compiling C++ files and CUDA source files (e.g., compiler-specific options).
Modification to driver.pl script to test single device by default.
Documentation updates.
Numerous bug fixes on AMD platforms.
New scan algorithm in CUDA and OpenCL. This upgrade from the previous recursive approach improves performance, reduces the required number of kernels to 3, and is more flexible when handling large problem sizes.
New TP versions of Reduction and Scan
Major refactoring of the driver script including finer device control. Previously, excessive execution times when using large problem sizes were unavoidable because the driver would always include available CPU devices.
Fixed minor memory leak in DeviceMemory that was causing problems on smaller GPUS (like ION).
Changed MaxFlops kernel generation to account for a more aggresive AMD compiler in APP
SDK 2.4. This compiler was optimizing away some operations resulting in inaccurate FLOPS measurements.
The OpenCL version of MaxFlops uses vector types now, so the AMD compiler can generate vectorized (SSE) code on CPU devices.
Better use of memory for a few Level 0 tests. This results in better support for low-memory devices, and in some tests higher performance can be achieved when more memory is available.
Fixed Timer that resulted in an error for OpenCL Sort/Scan in some environments
Updated driver script to record driver and compiler version and ECC status (CUDA only)
Fixed bug in outlier computation
Fixed reporting of sentinel value (previously, null results were shown as FLT_MAX instead of appropriate message)
Bugfixes and improvements to Stencil2D
Fix to build system when using CUDA and MPI
Fixes to OpenCL version of S3D, including DP and PCIe timing
Better consistency between CUDA and OpenCL versions of auto-generated kernels in L0
Addition of ”-h” parameter to the driver script for specifying a hostfile
Added informative -help message for driver script
Major bugfixes to Sparse Matrix-Vector Multiply
Fixes to Texture/Image memory tests
Various improvements to driver script for parallel and serial execution
Detection of mixed device types and performance outliers in parallel
Several bugfixes to the build system
Bug fix to the stability test when running on devices with large memory
Added double precision support to MaxFlops benchmark
Improved bus contention measurement benchmarks and its multi-threaded version
Numerous bug fixes to OpenCL and CUDA benchmarks
Addition of Spmv benchmark, a sparse matrix-vector-multiply test
Addition of double precision support to many SHOC benchmarks (though not yet all)
Addition of an experimental driver for testing performance on NUMA systems
Changed SHOC to use a GNU autoconf-generated script for configuration
Numerous bug fixes to OpenCL benchmarks including Scan, Sort, and Reduction
Addition of S3D benchmark, a combustion kernel
Flattened build system, including the ability to build without MPI
Fermi Compatibility, simplified build process for devices of different compute capability
Made success more obvious in Stability test
Shared memory optimization for MD benchmark
Data sizes increased for BusSpeed benchmark
Fixed bug in GFLOPS calculation in PeakFlops
Several improvements to contention benchmark
DeviceBW memory access pattern changed to avoid caching effects on Fermi
Changed Scan and Sort to report results in
GB/s