Getting Started

This guide provides a simple walkthrough on how to build and run the SHOC benchmark suite. For more comprehensive information, please consult the SHOC Manual.

Step 1: Unzip Source

The first thing to do is extract the source:

tar -xvzf shoc-1.1.0.tar.gz

Step 2: Configure Build Environment

Next, run configure so the build system knows the versions of the benchmarks to build. It figures out whether it should build CUDA, OpenCL, and MPI versions (or maybe all of them).

It works by searching for things in your path and the compiler flags you specify, so be sure everything is set up, then run ”./configure” Or, if you're using Linux and OpenMPI or MPICH, you can use the scripts in the config directory like so:

sh ./config/conf-linux-openmpi.sh

How does SHOC know where to find CUDA and OpenCL? It searches default locations (like /usr/local/cuda) and any other places based on the compiler flags (e.g. CPPFLAGS) that are passed to configure. Take a look at what's in conf-keeneland.sh for an example using CUDA and conf-atlanta.sh for an example using AMD APP SDK.

For MPI, SHOC assumes you will have the appropriate compiler wrappers (mpicc, mpic++, etc.) in your path as well.

Step 3: Build Benchmarks

If you're having problems building the suite, email shoc-help@elist.ornl.gov

Assuming you have set things up with configure, the suite should build with:

make

Now, check to make sure your benchmarks built. Go to the bin directory.

cd bin
ls
EP Serial TP

These three folders contain the embarrassingly parallel, serial, and true parallel versions, respectively. Each directory has a subdirectory for OpenCL and CUDA, and those should contain the individual benchmark executables like so:

ls ./Serial/
CUDA OpenCL

ls ./Serial/CUDA/
BusSpeedDownload  FFT       Reduction  SGEMM  Stability
BusSpeedReadback  MaxFlops  S3D        Sort   Stencil2D
DeviceMemory      MD        Scan       Spmv   Triad

Step #4 Run the Benchmarks

We recommend using the perl driver script. It's located in the tools directory.

cd tools

The driver script assumes you have “mpirun” and the MPI libraries in your path. Set those up with:

$ export PATH=$PATH:/path/to/mpi/bin/dir
$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/path/to/mpi/lib/dir

To run the driver in serial mode, specify either -cuda or -opencl, based on which version of the benchmarks you want to run. This will test each available device in the system. You also need to specify the problem size (1-4). The size convention is as follows:

1 - CPUs / Debugging

2 - Mobile/Integrated GPUs

3 - Discrete GPUs (e.g. GeForce or Radeon series)

4 - HPC-Focused or Large Memory GPUs (e.g. Tesla or Firestream Series)

Example output from a machine with 2 Tesla GPUs, using only device 0:

[arsarwade@newark tools]$ perl driver.pl -cuda -s 4 -d 0
--- Welcome To The SHOC Benchmark Suite version 1.1.0 --- 
Hostname: newark.ftpn.ornl.gov 
Number of available devices: 2 
Device 0: 'Tesla C2050 / C2070'
Device 1: 'Tesla C2050 / C2070'
Specified 1 device IDs: 0
Using size class: 4

--- Starting Benchmarks ---
Running benchmark BusSpeedDownload
    result for bspeed_download:                  6.1412 GB/sec
Running benchmark BusSpeedReadback
    result for bspeed_readback:                  3.2225 GB/sec
Running benchmark MaxFlops
    result for maxspflops:                    1003.3600 GFLOPS
    result for maxdpflops:                     503.2990 GFLOPS
Running benchmark DeviceMemory
    result for gmem_readbw:                    128.0710 GB/s
    result for gmem_readbw_strided:             11.9183 GB/s
    result for gmem_writebw:                   125.6330 GB/s
    result for gmem_writebw_strided:             5.6854 GB/s
    result for lmem_readbw:                    359.7970 GB/s
    result for lmem_writebw:                   439.3610 GB/s
    result for tex_readbw:                      70.3648 GB/sec
Skipping non-cuda benchmark KernelCompile
Skipping non-cuda benchmark QueueDelay
Running benchmark FFT
    result for fft_sp:                         349.7680 GFLOPS
    result for fft_sp_pcie:                     31.4089 GFLOPS
    result for ifft_sp:                        345.7640 GFLOPS
    result for ifft_sp_pcie:                    31.3762 GFLOPS
    result for fft_dp:                         176.2720 GFLOPS
    result for fft_dp_pcie:                     15.6614 GFLOPS
    result for ifft_dp:                        176.7260 GFLOPS
    result for ifft_dp_pcie:                    15.6650 GFLOPS
Running benchmark SGEMM
    result for sgemm_n:                        601.4950 GFlops
    result for sgemm_t:                        597.9310 GFlops
    result for sgemm_n_pcie:                   506.7120 GFlops
    result for sgemm_t_pcie:                   504.1800 GFlops
    result for dgemm_n:                        297.3630 GFlops
    result for dgemm_t:                        297.4330 GFlops
    result for dgemm_n_pcie:                   213.3020 GFlops
    result for dgemm_t_pcie:                   213.3390 GFlops
Running benchmark MD
    result for md_sp_bw:                        35.5486 GB/s
    result for md_sp_bw_pcie:                   13.8752 GB/s
    result for md_dp_bw:                        41.2044 GB/s
    result for md_dp_bw_pcie:                   19.5637 GB/s
Running benchmark Reduction
    result for reduction:                      128.7280 GB/s
    result for reduction_pcie:                   5.8412 GB/s
    result for reduction_dp:                   123.9900 GB/s
    result for reduction_dp_pcie:                5.8315 GB/s
Running benchmark Scan
    result for scan:                            33.1997 GB/s
    result for scan_pcie:                        0.0032 GB/s
    result for scan_dp:                         27.6723 GB/s
    result for scan_dp_pcie:                     0.0032 GB/s
Running benchmark Sort
    result for sort:                             1.7412 GB/s
    result for sort_pcie:                        0.9543 GB/s
Running benchmark Spmv
    result for spmv_csr_scalar_sp:               0.9862 Gflop/s
    result for spmv_csr_scalar_sp_pcie:          0.5919 Gflop/s
    result for spmv_csr_scalar_dp:               0.9518 Gflop/s
    result for spmv_csr_scalar_dp_pcie:          0.4846 Gflop/s
    result for spmv_csr_scalar_pad_sp:           0.9496 Gflop/s
    result for spmv_csr_scalar_pad_sp_pcie:      0.5794 Gflop/s
    result for spmv_csr_scalar_pad_dp:           0.9644 Gflop/s
    result for spmv_csr_scalar_pad_dp_pcie:      0.4894 Gflop/s
    result for spmv_csr_vector_sp:               9.9256 Gflop/s
    result for spmv_csr_vector_sp_pcie:          1.2882 Gflop/s
    result for spmv_csr_vector_dp:               8.7708 Gflop/s
    result for spmv_csr_vector_dp_pcie:          0.8884 Gflop/s
    result for spmv_csr_vector_pad_sp:          10.5047 Gflop/s
    result for spmv_csr_vector_pad_sp_pcie:      1.3019 Gflop/s
    result for spmv_csr_vector_pad_dp:           9.2851 Gflop/s
    result for spmv_csr_vector_pad_dp_pcie:      0.8973 Gflop/s
    result for spmv_ellpackr_sp:                 8.0802 Gflop/s
    result for spmv_ellpackr_dp:                 6.1277 Gflop/s
Running benchmark Stencil2D
    result for stencil:                          3.4238 s
    result for stencil_dp:                       4.9096 s
Running benchmark Triad
    result for triad_bw:                         5.5425 GB/s
Running benchmark S3D
    result for s3d:                             52.7519 GFLOPS
    result for s3d_pcie:                        42.8583 GFLOPS
    result for s3d_dp:                          31.0848 GFLOPS
    result for s3d_dp_pcie:                     24.4512 GFLOPS

To run the driver in embarrassingly parallel mode, you'll have to specify the number of nodes (n) and a comma-separated list of devices to test on each node (d). Here's an example on the same machine with large problem size (-s 4), using both the devices on one node concurrently. In this example, the batch system takes care of my host list for me. If you need to specify a hostfile, use ”-h my_host_file”.

[arsarwade@newark tools]$ perl driver.pl -cuda -s 4 -n 1 -d 0,1
--- Welcome To The SHOC Benchmark Suite version 1.1.0 --- 
Hostname: newark.ftpn.ornl.gov 
Number of available devices: 2 
Device 0: 'Tesla C2050 / C2070'
Device 1: 'Tesla C2050 / C2070'
Specified 2 device IDs: 0,1
Using size class: 4

--- Starting Benchmarks ---
Running benchmark BusSpeedDownload
    result for bspeed_download:                  6.1443 GB/sec
Running benchmark BusSpeedReadback
    result for bspeed_readback:                  2.1006 GB/sec
Running benchmark MaxFlops
    result for maxspflops:                    1003.1600 GFLOPS
    result for maxdpflops:                     503.3100 GFLOPS
Running benchmark DeviceMemory
    result for gmem_readbw:                    128.2860 GB/s
Skipping non-cuda benchmark KernelCompile
Skipping non-cuda benchmark QueueDelay
Running benchmark FFT
    result for fft_sp:                         348.8790 GFLOPS
    result for fft_dp:                         176.4170 GFLOPS
Running benchmark SGEMM
    result for sgemm_n:                        601.5120 GFlops
    result for dgemm_n:                        297.3490 GFlops
Running benchmark MD
    result for md_sp_flops:                     46.3554 GFLOPS
    result for md_dp_flops:                     30.8063 GFLOPS
Running benchmark Reduction
    result for reduction:                      128.7820 GB/s
    result for reduction_dp:                   123.9870 GB/s
Running benchmark Scan
    result for scan:                            33.2026 GB/s
    result for scan_dp:                         27.6785 GB/s
Running benchmark Sort
    result for sort:                             1.7415 GB/s
Running benchmark Spmv
    result for spmv_csr_scalar_sp:               0.9869 Gflop/s
    result for spmv_csr_vector_sp:               9.9301 Gflop/s
    result for spmv_ellpackr_sp:                 8.1040 Gflop/s
    result for spmv_csr_scalar_dp:               0.9451 Gflop/s
    result for spmv_csr_vector_dp:               8.7735 Gflop/s
    result for spmv_ellpackr_dp:                 6.2761 Gflop/s
Running benchmark Stencil2D
Running benchmark Triad
    result for triad_bw:                         3.4557 GB/s
Running benchmark S3D
    result for s3d:                             52.8405 GFLOPS
    result for s3d_dp:                          31.0898 GFLOPS

Ok, that's it. If you have any problems, send an email to the help list.

 
shoc/gettingstarted.txt · Last modified: 2011/11/11 16:37 by kspafford
Recent changes RSS feed Driven by DokuWiki