1. kokkos-fft Updates – Yuuichi Asahi, CEA (10 minutes)
kokkos-fft implements local interfaces between Kokkos and de facto standard FFT libraries, including fftw, cufft, hipfft (rocfft), and oneMKL. We are inclined to implement the numpy.fft-like interfaces adapted for Kokkos. A key concept is that "As easy as numpy, as fast as vendor libraries". In the talk, we will introduce the basic APIs and typical use cases. We will also present future development plans.
2. Fortran Porting Wish List for Kokkos – Matthew Norman, Oak Ridge National Laboratory (10 minutes)
This presentation covers the beginnings of the Yet Another Kernel Launcher (YAKL) C++ portability library, its evolution alongside Kokkos, the use of Kokkos in its current form, and remaining issues before it can be retired in lieu of using Kokkos instead. The primary outstanding issues are the inclusion of arbitrary lower bounds for Fortran-like View behavior and the ability to use an underlying pool allocator for Views for cheap frequent device allocation and deallocation so that Views can be locally created and destroyed only where needed rather than existing for the global lifetime of simulations. This may improve readability and reduce the memory high water mark in simulations. A few performance related issues will be covered as well, mainly limited to MDRangePolicy and parallel_for register usage.
3. Custom Layout and Tiling for Multi-Dimensional Data – Cedric Chevalier & Gabriel Dos Santos, CEA (10 minutes)
Performance optimizations for exascale HPC applications primarily rely on fine-tuning implementations, requiring comprehensive knowledge of heterogeneous hardware architectures that domain experts often lack. One of Kokkos' biggest successes is tying the memory layout of multi-dimensional arrays to the execution backend. It allows the exploitation of coalescence or cache, depending on the hardware. Here, we propose to go further and design custom tiled layouts that are generic for C++23's std::mdspan. Instead of running tile algorithms on flat data, like Kokkos' mdrange, we want to explore how running flat algorithms on tiled data performs. On CPU, the first experimental results with std::mdspan on a naive dense matrix multiplication demonstrate that, by replacing standard layouts with our proposed solution, we achieve an average speedup of over 2.2x, with peak performance improvements of up to 7.8x. Then, we will discuss how external indexing can improve efficiency. We will present how to exploit it with Kokkos' mdrange algorithm, and how it can behave on GPU.
4. Runtime Auto-Tuning for Kokkos Applications with APEX – Kevin Huck, University of Oregon (10 minutes)
Traditional GPU programming with libraries like CUDA or HIP requires tuning parameters exposed to the user, for example block sizes or number of teams. Kokkos also exposes portable parameters to the Kokkos user. How can Kokkos application programmers easily tune these Kokkos parameters for their application’s deployment when using any given Kokkos backend, without incurring large overheads? In particular, how do we ensure the tuning itself is portable across platforms? We propose using online, i.e., runtime, autotuning, utilizing the APEX Kokkos Tools connector to tune exposed parameters. Specifically, we discuss the Kokkos Tools Tuning Interface, tuning contexts, variable definition, the APEX runtime auto-tuning library utilizing Kokkos Tools, and distributed Kokkos auto-tuning. Applying our auto-tuning approaches to Kokkos sample kernels on Perlmutter and Frontier, we have obtained promising performance results. These results suggest Kokkos online auto-tuning is beneficial for production applications, and we invite Kokkos users to try these features and for Kokkos developers to contribute.
5. Unifying the HPC Ecosystem with std::execution – Mikael Simberg, Swiss National Supercomputing Centre (20 minutes)
Asynchronous programming models are becoming increasingly essential for fully leveraging modern hardware. In the C++ ecosystem, projects typically provide ad-hoc and varying interfaces, making interoperability difficult. Recently approved for C++26, the std::execution library promises to unify the ecosystem by providing a standard, composable interface for asynchronous operations. This talk briefly introduces the motivation and design principles of std::execution, and shares our experiences using it prior to standardization at CSCS in various projects, including Kokkos, HPX, and more. We'll discuss challenges, successes, and opportunities encountered while adopting std::execution.
6. PyKokkos: Performance Portability for Python Developers – Milos Gligoric, The University of Texas at Austin (20 minutes)
Kokkos is a programming model for writing performance portable applications for all major high performance computing platforms. It provides abstractions for data management and common parallel operations, allowing developers to write portable high performance code with minimal knowledge of architecture-specific details. Kokkos is implemented as a heavily-templated C++ library. However, C++ is not ideal for rapid prototyping and quick algorithmic exploration. An increasing number of developers use Python for scientific computing, machine learning, and data analytics. In this talk, I will present a new Python framework, PyKokkos, for writing performance portable applications entirely in Python. PyKokkos provides Kokkos-like abstractions that are easier to use and more concise than the C++ interface. We implemented PyKokkos by building a translator from a subset of Python to C++ Kokkos and bridging necessary function calls via automatically generated Python bindings. I will also cover our recent work on automatic kernel fusion with the goal to optimize PyKokkos applications. The talk will also cover our experience on developing PyKokkos, its current limitations, and future plans.