The Kold Cluster

Overview

The ExCL Kold cluster consists of four HP SL250 nodes used for demo/pilot studies within the Keeneland project.

Hardware

Each Kold node contains:

  • Two Intel Xeon E5-2670 processors (Sandy Bridge architecture) running at 2.60 GHz.
  • 32GB host memory
  • Mellanox FDR Infiniband NIC

Three Kold nodes also contain:

  • Two NVIDIA M2090 GPUs, each with 6GB device memory
  • HyperThreading disabled

The other Kold node contains:

  • Two NVIDIA K20m GPUs, each with 5GB device memory
  • HyperThreading enabled (but this currently causes problems with Torque integration)

Direct access to Kold compute nodes is enabled.

Software

Each node runs CentOS 7.  The ExCL shared NFS file systems (/home, /opt/shared) are mounted on all compute nodes.  The module command can be used to manage your environment to access pre-built packages from /opt/shared/sw.

Each node contains installations of CUDA 7.5 and NVIDIA's OpenCL.

In the ExCL /opt/shared file system, there is an OpenMPI installation that has been specially built with Torque integration, so that users don't need to use hostfiles with the MPI parallel job launcher.  We highly recommend using this MPI installation on the Kold cluster.  (We may also provide an MVAPICH installation at some point in the future.)  To use this MPI installation, load the module using

<ivy>$ module load openmpi/1.10.1-gnu-torque

 

By default, parallel programs built and run using the recommended OpenMPI module will use the Infiniband interconnection network for inter-node communication.  You can force OpenMPI to use IP-over-IB by adding the '--mca btl tcp,self' option to your mpirun command.  You can force OpenMPI to use the Ethernet network for communication by adding '--mca btl tcp,self --mca btl_tcp_if_include 192.168.34.0/23'.  For example:

<ivy>$ mpirun $PWD/myprogram # uses Infiniband RDMA
<ivy>$ mpirun --mca btl tcp,self $PWD/myprogram # uses IP-over-IB, using the ib0 interface and the 10.0.0.0/24 network
<ivy>$ mpirun --mca btl tcp,self --mca btl_tcp_if_include 192.168.34.0/23 $PWD/myprogram # uses the Ethernet network

Caveats

Administration of Kold is on a best-effort basis.  If you need a computing resource that is high-availability, high-reliability, and highest performance, please look into getting an account on the NCCS resources.