Skip to content

Overview of Kokkos

Tommy Gorham edited this page Aug 18, 2022 · 4 revisions

The Goal for Modern Heterogenous CPU-GPU Computing is performance portability

Using Kokkos enables us to write portable, single-source parallel implementations of scientific algorithms looking to attain high performance by making use of multiple computational resources with different computation units.

Why?

As modern HPC systems, namely supercomputers, increase in scale and heterogeneity, so does the difficulty of efficiently leveraging a broader range of diversified compute resources. Thus, the increase in the number of heterogenous system architectures (meaning those that contain at least one accelerator such as a GPU), along with the variance in the manufacturer of the GPU chip itself (e.g., NVIDIA, AMD, Intel, etc.) have created new obstacles for large-scale scientific applications. These obstacles are primarily regarding the applications' development, maintenance, and capability to effectively exploit diverse system architectures in order to achieve theoretical performance in a hardware agnostic way.

How?

By utilizing Kokkos, (a C/C++ Performance Portability Programming Model) designed by Sandia National Laboratories, we can address the problem of increasing heterogeneity seen in modern systems by abstracting diverse CPU and GPU targets without losing performance. Kokkos automatically optimizes memory access by mapping parallel work indices and multidimensional array layout optimally for the architecture.

No really, HOW?! (E.g., Going from OpenMP To Kokkos)

OpenMP parallel for

#pragma omp parallel for
for(int i = 0; i < N; ++i){
/* loop body */ 
}

Kokkos parallel for

parallel_for(N,[=] (const int i) { 
/* loop body */ 
}); 

OpenMP Compute Pi

#pragma omp parallel // begin parallel section 
{
    #pragma omp for reduction(+:sum)
    for(int i=0; i < N; ++i) { 
        sum += 4.0/(1.0+((i+0.5)*step) * ((i+0.5)*step));
        } 
}    
est = step*sum; 

Kokkos Compute Pi

// begin parallel section
Kokkos::parallel_reduce("compute_pi", N, [=] (const int i, double& update){
    update += 4.0/(1.0+((i+0.5)*step) * ((i+0.5)*step)) ;
    },sum);
est = step*sum; 

Kokkos also has an incredible wiki page that enabled me to

Explicit build instructions for these programs can be found in my README.md.

Further details of each program are explained via comments within the source code.

Clone this wiki locally