Skip to content

Architecture Specific Optimizaitons

Tommy Gorham edited this page May 2, 2022 · 2 revisions

1. Memory Access Patterns

For performance, memory access patterns must result in caching on the CPU and coalescing on the GPU

From past experimentation, it's fastest to store access 2D data as if its 1D


Array[ i * cols + j ] // for CPU use row-major

Array[ j * rows + i ] // for GPU use col-major

Use Kokkos::Views to optimize memory access patterns automatically, and in a portable way

A Kokkos::View in Kokkos::HostSpace defaults to LayoutRight -> Contiguous

A Kokkos::View in Kokkos::CudaSpace defaults to LayoutLeft -> Strided

Set this at compile time or by hard coding the indexing via Kokkos::


Note: Hard coding the indexing makes code less performance portable

2. Use -DKokkos_ARCH_X at compile time

Architecture-specific optimizations can be enabled by specifying the archietecture of the GPU

(e.g., For NVIDIA Tesla V100s, I pass -DKokkos_ARCH_VOLTA70=ON)

3. Use Max Threads supported or max threads per NUMA Region

For instance, run lscpu in a Linux terminal to find the logical CPUs your machine has, and use this amount in your code.

In the Screenshot at the bottom of this page, I'm working on a machine with:

  • 40 physical CPUs
  • 20 physical CPUs per socket
  • Of these 20 CPUs per socket, each has 2 threads Thus, there are 80 Logical CPUs.

However, there are 2 NUMA regions, so best performance is achieved by either setting this before runtime

export OMP_NUM_THREADS=80

Or by setting this at compile-time

--kokkos-numa=2 --kokkos-threads=40

firefly

Clone this wiki locally