Finish the std::offload module
Metadata | |
---|---|
Point of contact | Manuel Drehwald |
Status | Proposed |
Tracking issue | rust-lang/rust-project-goals#109 |
Other tracking issues | rust-lang/rust#124509 |
Zulip channel | #wg-autodiff |
compiler champion | Manuel Drehwald |
lang champion | TC |
Teams | compiler, lang |
Task owners | Manuel Drehwald, LLVM offload/GPU contributors |
Summary
This project aims to get the std::offload
module finished. We currently have the ability to automatically move data to and from a GPU, but we can not launch kernels yet. We will add this ability during the next project goal period.
Motivation
Scientific computing, high performance computing (HPC), and machine learning (ML) all share the interesting challenge in that they each, to different degrees, care about highly efficient library and algorithm implementations, but that these libraries and algorithms are not always used by people with deep experience in computer science. Rust is in a unique position because ownership, lifetimes, and the strong type system can prevent many bugs. At the same time strong alias information allows compelling performance optimizations in these fields, with performance gains well beyond that otherwise seen when comparing C++ with Rust. This is due to how automatic differentiation and GPU offloading strongly benefit from aliasing information.
Status quo
The std::autodiff
module is fully upstreamed, but not shipped on nightly yet to open CI questions.
"Batched" autodiff, which supports array-of-struct or struct-of-array style code generation is also mostly upstreamed, but not yet publically advertised due to some open design questions and requested changes.
The standalone std::batching
feature is mostly implemented in a PR, but not yet upstreamed due to requested changes, to better interact with Rust's SIMD types. I hope to train a contributor to implement the requested changes to increase the bus factor.
The std::offload
feature is partly upstreamed. The "host" side, which handles the CPU code is ready. For the "device" side, a first PR exists, but is not yet reviewed or sufficiently tested. We also expect further follow-up PRs to expose more GPU features.
Oliver Scherer has done a great job reviewing my offload host PR to make sure that code quality matches rustc standards, but expressed that he isn't comfortable reviewing it on a technical side for correct LLVM/offload usage. For that I collaborated with Johannes Doerfert and Kevin Sala Penades from the LLVM side. Further, individuals at both AMD and NVIDIA gave feedback on my design and will continue to do so. In general, the design is somewhat different from what we have in other languages, so we will likely run into some challenges and keep iterating.
The next 6 months
I will spend most of my time on the offload "Backend", especially the device side, to improve how we lower Rust code to GPUs. To verify the progress, I will add increasingly more complex gpu compute kernels to the rust test suite. Similar to the autodiff work, I will spend a significant fraction of my time with onboarding new contributors, to increase the bus factor. So far I already have one offer from a potential contributor, which is interested in developing the offload frontend. Due to the popularity of GPU programming, I expect even more support than for the autodiff work and hope to have a solid contributor base by the end of the project goal.
One of the lessons learned from the std::autodiff
work is the challenge of enabling a new feature in CI. While autodiff has been usable for more 6 months, we still do not ship it on nightly. The two main challenges were
A) challenges of reproducing CI issues locally B) enabling autodiff in CI increases the binary size.
Due to the lack of GPUs in CI we will not run GPU binaries in CI, which should avoid issue A)
To prevent issue B) I already opened a PR to enable std::offload
in CI. This allows us to test the binary size increase early, and gives the infra team more time to provide feedback.
I expect that we will find a solution with the infra team at some point within the next 6 months to enable std::autodiff
in CI. Once that happens, I will likely take a one month break from std::offload
to clean up autodiff docs and finish the upstreaming of std::batching
, which is based on the same LLVM plugin as autodiff (Enzyme).
The "shiny future" we are working towards
In the future, developers will be able to write a single Rust function and use std::batching
to get a SIMD/fused version of it, use std::autodiff
to differentiate it, and std::offload
to run the resulting code on their GPUs.
Authors of Machine Learning or Linear Algebra libraries will further be able to optimize their libraries performance by opting into a new MLIR based compiler backend, which automatically rewrites their compute heavy operations for better performance.
Ownership and team asks
Task | Owner(s) or team(s) | Notes |
---|---|---|
Lang-team experiment | ||
Standard reviews | Review contributions to rustc_codegen_llvm and other parts of the backend | |
LLVM reviews | LLVM offload/GPU contributors | Individual contributors at AMD/NVIDIA/LLNL agreed to review my code from the LLVM or GPU side |
Do the work | Manuel Drehwald |