Architecture Specific Optimizaitons

1. Memory Access Patterns

For performance, memory access patterns must result in caching on the CPU and coalescing on the GPU

From past experimentation, it's fastest to store access 2D data as if its 1D

Array[ i * cols + j ] // for CPU use row-major

Array[ j * rows + i ] // for GPU use col-major

Use Kokkos::Views to optimize memory access patterns automatically, and in a portable way

A Kokkos::View in Kokkos::HostSpace defaults to LayoutRight -> Contiguous

A Kokkos::View in Kokkos::CudaSpace defaults to LayoutLeft -> Strided

Set this at compile time or by hard coding the indexing via Kokkos::

Note: Hard coding the indexing makes code less performance portable

2. Use -DKokkos_ARCH_X at compile time

Architecture-specific optimizations can be enabled by specifying the archietecture of the GPU

(e.g., For NVIDIA Tesla V100s, I pass -DKokkos_ARCH_VOLTA70=ON)

3. Use Max Threads supported or max threads per NUMA Region

For instance, run lscpu in a Linux terminal to find the logical CPUs your machine has, and use this amount in your code.

In the Screenshot at the bottom of this page, I'm working on a machine with:

40 physical CPUs
20 physical CPUs per socket
Of these 20 CPUs per socket, each has 2 threads Thus, there are 80 Logical CPUs.

However, there are 2 NUMA regions, so best performance is achieved by either setting this before runtime

export OMP_NUM_THREADS=80

Or by setting this at compile-time

--kokkos-numa=2 --kokkos-threads=40

firefly

Wiki

Home

Fundamental Concepts

Getting Started with Kokkos

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Architecture Specific Optimizaitons

1. Memory Access Patterns

From past experimentation, it's fastest to store access 2D data as if its 1D

Use Kokkos::Views to optimize memory access patterns automatically, and in a portable way

2. Use -DKokkos_ARCH_X at compile time

3. Use Max Threads supported or max threads per NUMA Region

However, there are 2 NUMA regions, so best performance is achieved by either setting this before runtime

Or by setting this at compile-time

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally