-
Notifications
You must be signed in to change notification settings - Fork 0
Architecture Specific Optimizaitons
For performance, memory access patterns must result in caching on the CPU and coalescing on the GPU
Array[ i * cols + j ] // for CPU use row-major
Array[ j * rows + i ] // for GPU use col-major
A Kokkos::View in Kokkos::HostSpace defaults to LayoutRight -> Contiguous
A Kokkos::View in Kokkos::CudaSpace defaults to LayoutLeft -> Strided
Set this at compile time or by hard coding the indexing via Kokkos::
Note: Hard coding the indexing makes code less performance portable
Architecture-specific optimizations can be enabled by specifying the archietecture of the GPU
(e.g., For NVIDIA Tesla V100s, I pass -DKokkos_ARCH_VOLTA70=ON)
For instance, run lscpu in a Linux terminal to find the logical CPUs your machine has, and use this amount in your code.
In the Screenshot at the bottom of this page, I'm working on a machine with:
- 40 physical CPUs
- 20 physical CPUs per socket
- Of these 20 CPUs per socket, each has 2 threads Thus, there are 80 Logical CPUs.
However, there are 2 NUMA regions, so best performance is achieved by either setting this before runtime
export OMP_NUM_THREADS=80
--kokkos-numa=2 --kokkos-threads=40
Wiki
Fundamental Concepts
- What is HPC?
- How Do Computers Solve Problems?
- Serial to Parallel speedup example
- Shared Memory Architecture
- Heterogenous Architectures
Getting Started with Kokkos