You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Elide suspension points via [[clang::coro_await_suspend_destroy]]
Start by reading the detailed user-facing docs in `AttrDocs.td`.
My immediate motivation was that I noticed that short-circuiting coroutines
failed to optimize well. Interact with the demo program here:
https://godbolt.org/z/E3YK5c45a
If Clang on Compiler Explorer supported [[clang::coro_await_suspend_destroy]],
the assembly for `simple_coro` would be drastically shorter, and would not
contain a call to `operator new`.
Here are a few high-level thoughts that don't belong in the docs:
- This has `lit` tests, but what gives me real confidence in its correctness
is the integration test in `coro_await_suspend_destroy_test.cpp`. This
caught all the interesting bugs that I had in earlier revs, and covers
equivalence to the standard code path in far more scenarios.
- I considered a variety of other designs. Here are some key design points:
* I considered optimizing unmodified `await_suspend()` methods, as long as
they unconditionally end with an `h.destroy()` call on the current
handle, or an exception. However, this would (a) force dynamic dispatch
for `destroy` -- bloating IR & reducing optimization opportunities, (b)
require far more complex, delicate, and fragile analysis, (c) retain more
of the frame setup, so that e.g. `h.done()` works properly. The current
solution shortcuts all these concerns.
* I want to `Promise&`, rather than `std::coroutine_handle` to
`await_suspend_destroy` -- this is safer, simpler, and more efficient.
Short-circuiting corotuines should not touch the handle. This decision
forces the attribue to go on the class. Resolving a method attribute
would have required looking up overloads for both types, and choosing
one, which is costly and a bad UX to boot.
* `AttrDocs.td` tells portable code to provide a stub `await_suspend()`.
This portability / compatibility solution avoids dire issues that would
arise if users relied on `__has_cpp_attribute` and the declaration and
definition happened to use different toolchains. In particular, it will
even be safe for a future compiler release to killswitch this attribute
by removing its implementation and setting its version to 0.
```
let Spellings = [Clang<"coro_destroy_after_suspend", /*allowInC*/ 0,
/*Version*/ 0>];
```
- In the docs, I mention the `HasCoroSuspend` path in `CoroEarly.cpp` as
a further optimization opportunity. But, I'm sure there are
higher-leverage ways of making these non-suspending coros compile better, I
just don't know the coro optimization pipeline well enough to flag them.
- IIUC the only interaction of this with `coro_only_destroy_when_complete`
would be that the compiler expends fewer cycles.
- I ran some benchmarks on [folly::result](
https://github.com/facebook/folly/blob/main/folly/result/docs/result.md).
Heap allocs are definitely elided, the compiled code looks like a function,
not a coroutine, but there's still an optimization gap. On the plus side,
this results in a 4x speedup (!) in optimized ASAN builds (numbers not
shown for brevity.
```
// Simple result coroutine that adds 1 to the input
result<int> result_coro(result<int>&& r) {
co_return co_await std::move(r) + 1;
}
// Non-coroutine equivalent using value_or_throw()
result<int> catching_result_func(result<int>&& r) {
return result_catch_all([&]() -> result<int> {
if (r.has_value()) {
return r.value_or_throw() + 1;
}
return std::move(r).non_value();
});
}
// Not QUITE equivalent to the coro -- lacks the exception boundary
result<int> non_catching_result_func(result<int>&& r) {
if (r.has_value()) {
return r.value_or_throw() + 1;
}
return std::move(r).non_value();
}
============================================================================
[...]lly/result/test/result_coro_bench.cpp relative time/iter iters/s
============================================================================
result_coro_success 13.61ns 73.49M
non_catching_result_func_success 3.39ns 295.00M
catching_result_func_success 4.41ns 226.88M
result_coro_error 19.55ns 51.16M
non_catching_result_func_error 9.15ns 109.26M
catching_result_func_error 10.19ns 98.10M
============================================================================
[...]lly/result/test/result_coro_bench.cpp relative time/iter iters/s
============================================================================
result_coro_success 10.59ns 94.39M
non_catching_result_func_success 3.39ns 295.00M
catching_result_func_success 4.07ns 245.81M
result_coro_error 13.66ns 73.18M
non_catching_result_func_error 9.00ns 111.11M
catching_result_func_error 10.04ns 99.63M
```
Demo program from the Compiler Explorer link above:
```cpp
#include <coroutine>
#include <optional>
// Read this LATER -- this implementation detail isn't required to understand
// the value of [[clang::coro_await_suspend_destroy]].
//
// `optional_wrapper` exists since `get_return_object()` can't return
// `std::optional` directly. C++ coroutines have a fundamental timing mismatch
// between when the return object is created and when the value is available:
//
// 1) Early (coroutine startup): `get_return_object()` is called and must return
// something immediately.
// 2) Later (when `co_return` executes): `return_value(T)` is called with the
// actual value.
// 3) Issue: If `get_return_object()` returns the storage, it's empty when
// returned, and writing to it later cannot affect the already-returned copy.
template <typename T>
struct optional_wrapper {
std::optional<T> storage_;
std::optional<T>*& pointer_;
optional_wrapper(std::optional<T>*& p) : pointer_(p) {
pointer_ = &storage_;
}
operator std::optional<T>() { return std::move(storage_); }
~optional_wrapper() {}
};
// Make `std::optional` a coroutine
template <typename T, typename... Args>
struct std::coroutine_traits<std::optional<T>, Args...> {
struct promise_type {
std::optional<T>* storagePtr_ = nullptr;
promise_type() = default;
::optional_wrapper<T> get_return_object() {
return ::optional_wrapper<T>(storagePtr_);
}
std::suspend_never initial_suspend() const noexcept { return {}; }
std::suspend_never final_suspend() const noexcept { return {}; }
void return_value(T&& value) { *storagePtr_ = std::move(value); }
void unhandled_exception() {
// Leave storage_ empty to represent error
}
};
};
template <typename T>
struct [[clang::coro_await_suspend_destroy]] optional_awaitable {
std::optional<T> opt_;
bool await_ready() const noexcept { return opt_.has_value(); }
T await_resume() { return std::move(opt_).value(); }
// Adding `noexcept` here makes the early IR much smaller, but the
// optimizer is able to discard the cruft for simpler cases.
void await_suspend_destroy(auto& promise) noexcept {
// Assume the return object defaults to "empty"
}
void await_suspend(auto handle) {
await_suspend_destroy(handle.promise());
handle.destroy();
}
};
template <typename T>
optional_awaitable<T> operator co_await(std::optional<T> opt) {
return {std::move(opt)};
}
// Non-coroutine baseline -- matches the logic of `simple_coro`.
std::optional<int> simple_func(const std::optional<int>& r) {
try {
if (r.has_value()) {
return r.value() + 1;
}
} catch (...) {}
return std::nullopt; // return empty on empty input or error
}
// Without `coro_await_suspend_destroy`, allocates its frame on-heap.
std::optional<int> simple_coro(const std::optional<int>& r) {
co_return co_await std::move(r) + 4;
}
// Without `co_await`, this optimizes much like `simple_func`.
// Bugs:
// - Doesn't short-circuit when `r` is empty, but throws
// - Lacks an exception boundary
std::optional<int> wrong_simple_coro(const std::optional<int>& r) {
co_return r.value() + 2;
}
int main() {
return
simple_func(std::optional<int>{32}).value() +
simple_coro(std::optional<int>{8}).value() +
wrong_simple_coro(std::optional<int>{16}).value();
}
```
Test Plan:
For the all-important E2E test, I used this terrible cargo-culted script to run
the new end-to-end test with the new compiler. (Yes, I realize I should only
need 10% of those `-D` settings for a successful build.)
To make sure the test covered what I meant it to do:
- I also added an `#error` in the "no attribute" branch to make sure the
compiler indeed supports the attribute.
- I ran it with a compiler not supporting the attribute, and that also
passed.
- I also tried `return 1;` from `main()` and saw the logs of the 7 successful
tests running.
```sh
#!/bin/bash -uex
set -o pipefail
LLVMBASE=/path/to/source/of/llvm-project
SYSCLANG=/path/to/origianl/bin/clang
# NB Can add `--debug-output` to debug cmake...
# Bootstrap clang -- Use `RelWithDebInfo` or the next phase is too slow!
mkdir -p bootstrap
cd bootstrap
cmake "$LLVMBASE/llvm" \
-G Ninja \
-DBUILD_SHARED_LIBS=true \
-DCMAKE_ASM_COMPILER="$SYSCLANG" \
-DCMAKE_ASM_COMPILER_ID=Clang \
-DCMAKE_BUILD_TYPE=RelWithDebInfo \
-DCMAKE_CXX_COMPILER="$SYSCLANG"++ \
-DCMAKE_C_COMPILER="$SYSCLANG" \
-DLLVM_DEFAULT_TARGET_TRIPLE=x86_64-redhat-linux-gnu \
-DLLVM_HOST_TRIPLE=x86_64-redhat-linux-gnu \
-DLLVM_ENABLE_ASSERTIONS=ON \
-DLLVM_ENABLE_BINDINGS=OFF \
-DLLVM_ENABLE_LLD=ON \
-DLLVM_ENABLE_PROJECTS="clang;lld" \
-DLLVM_OPTIMIZED_TABLEGEN=true \
-DLLVM_FORCE_ENABLE_STATS=ON \
-DLLVM_ENABLE_DUMP=ON \
-DCLANG_DEFAULT_PIE_ON_LINUX=OFF
ninja clang lld
ninja check-clang-codegencoroutines # Includes the new IR regression tests
cd ..
NEWCLANG="$PWD"/bootstrap/bin/clang
NEWLLD="$PWD"/bootstrap/bin/lld
# LIBCXX_INCLUDE_BENCHMARKS=OFF because google-benchmark bugs out
cmake "$LLVMBASE/runtimes" \
-G Ninja \
-DLLVM_DEFAULT_TARGET_TRIPLE=x86_64-redhat-linux-gnu \
-DLLVM_HOST_TRIPLE=x86_64-redhat-linux-gnu \
-DBUILD_SHARED_LIBS=true \
-DCMAKE_ASM_COMPILER="$NEWCLANG" \
-DCMAKE_ASM_COMPILER_ID=Clang \
-DCMAKE_C_COMPILER="$NEWCLANG" \
-DCMAKE_CXX_COMPILER="$NEWCLANG"++ \
-DLLVM_FORCE_ENABLE_STATS=ON \
-DLLVM_ENABLE_ASSERTIONS=ON \
-DLLVM_ENABLE_LLD=ON \
-DLIBCXX_INCLUDE_TESTS=ON \
-DLIBCXX_INCLUDE_BENCHMARKS=OFF \
-DLLVM_INCLUDE_TESTS=ON \
-DLLVM_ENABLE_RUNTIMES="libcxx;libcxxabi;libunwind" \
-DCMAKE_BUILD_TYPE=RelWithDebInfo \
-DCMAKE_EXPORT_COMPILE_COMMANDS=ON
ninja cxx-test-depends
LIBCXXBUILD=$PWD
cd "$LLVMBASE"
libcxx/utils/libcxx-lit "$LIBCXXBUILD" -v \
libcxx/test/std/language.support/support.coroutines/end.to.end/coro_await_suspend_destroy.pass.cpp
```
0 commit comments