Skip to content

Commit 9fc3169

Browse files
committed
Elide suspension points via [[clang::coro_await_suspend_destroy]]
Start by reading the detailed user-facing docs in `AttrDocs.td`. My immediate motivation was that I noticed that short-circuiting coroutines failed to optimize well. Interact with the demo program here: https://godbolt.org/z/E3YK5c45a If Clang on Compiler Explorer supported [[clang::coro_await_suspend_destroy]], the assembly for `simple_coro` would be drastically shorter, and would not contain a call to `operator new`. Here are a few high-level thoughts that don't belong in the docs: - This has `lit` tests, but what gives me real confidence in its correctness is the integration test in `coro_await_suspend_destroy_test.cpp`. This caught all the interesting bugs that I had in earlier revs, and covers equivalence to the standard code path in far more scenarios. - I considered a variety of other designs. Here are some key design points: * I considered optimizing unmodified `await_suspend()` methods, as long as they unconditionally end with an `h.destroy()` call on the current handle, or an exception. However, this would (a) force dynamic dispatch for `destroy` -- bloating IR & reducing optimization opportunities, (b) require far more complex, delicate, and fragile analysis, (c) retain more of the frame setup, so that e.g. `h.done()` works properly. The current solution shortcuts all these concerns. * I want to `Promise&`, rather than `std::coroutine_handle` to `await_suspend_destroy` -- this is safer, simpler, and more efficient. Short-circuiting corotuines should not touch the handle. This decision forces the attribue to go on the class. Resolving a method attribute would have required looking up overloads for both types, and choosing one, which is costly and a bad UX to boot. * `AttrDocs.td` tells portable code to provide a stub `await_suspend()`. This portability / compatibility solution avoids dire issues that would arise if users relied on `__has_cpp_attribute` and the declaration and definition happened to use different toolchains. In particular, it will even be safe for a future compiler release to killswitch this attribute by removing its implementation and setting its version to 0. ``` let Spellings = [Clang<"coro_destroy_after_suspend", /*allowInC*/ 0, /*Version*/ 0>]; ``` - In the docs, I mention the `HasCoroSuspend` path in `CoroEarly.cpp` as a further optimization opportunity. But, I'm sure there are higher-leverage ways of making these non-suspending coros compile better, I just don't know the coro optimization pipeline well enough to flag them. - IIUC the only interaction of this with `coro_only_destroy_when_complete` would be that the compiler expends fewer cycles. - I ran some benchmarks on [folly::result]( https://github.com/facebook/folly/blob/main/folly/result/docs/result.md). Heap allocs are definitely elided, the compiled code looks like a function, not a coroutine, but there's still an optimization gap. On the plus side, this results in a 4x speedup (!) in optimized ASAN builds (numbers not shown for brevity. ``` // Simple result coroutine that adds 1 to the input result<int> result_coro(result<int>&& r) { co_return co_await std::move(r) + 1; } // Non-coroutine equivalent using value_or_throw() result<int> catching_result_func(result<int>&& r) { return result_catch_all([&]() -> result<int> { if (r.has_value()) { return r.value_or_throw() + 1; } return std::move(r).non_value(); }); } // Not QUITE equivalent to the coro -- lacks the exception boundary result<int> non_catching_result_func(result<int>&& r) { if (r.has_value()) { return r.value_or_throw() + 1; } return std::move(r).non_value(); } ============================================================================ [...]lly/result/test/result_coro_bench.cpp relative time/iter iters/s ============================================================================ result_coro_success 13.61ns 73.49M non_catching_result_func_success 3.39ns 295.00M catching_result_func_success 4.41ns 226.88M result_coro_error 19.55ns 51.16M non_catching_result_func_error 9.15ns 109.26M catching_result_func_error 10.19ns 98.10M ============================================================================ [...]lly/result/test/result_coro_bench.cpp relative time/iter iters/s ============================================================================ result_coro_success 10.59ns 94.39M non_catching_result_func_success 3.39ns 295.00M catching_result_func_success 4.07ns 245.81M result_coro_error 13.66ns 73.18M non_catching_result_func_error 9.00ns 111.11M catching_result_func_error 10.04ns 99.63M ``` Demo program from the Compiler Explorer link above: ```cpp #include <coroutine> #include <optional> // Read this LATER -- this implementation detail isn't required to understand // the value of [[clang::coro_await_suspend_destroy]]. // // `optional_wrapper` exists since `get_return_object()` can't return // `std::optional` directly. C++ coroutines have a fundamental timing mismatch // between when the return object is created and when the value is available: // // 1) Early (coroutine startup): `get_return_object()` is called and must return // something immediately. // 2) Later (when `co_return` executes): `return_value(T)` is called with the // actual value. // 3) Issue: If `get_return_object()` returns the storage, it's empty when // returned, and writing to it later cannot affect the already-returned copy. template <typename T> struct optional_wrapper { std::optional<T> storage_; std::optional<T>*& pointer_; optional_wrapper(std::optional<T>*& p) : pointer_(p) { pointer_ = &storage_; } operator std::optional<T>() { return std::move(storage_); } ~optional_wrapper() {} }; // Make `std::optional` a coroutine template <typename T, typename... Args> struct std::coroutine_traits<std::optional<T>, Args...> { struct promise_type { std::optional<T>* storagePtr_ = nullptr; promise_type() = default; ::optional_wrapper<T> get_return_object() { return ::optional_wrapper<T>(storagePtr_); } std::suspend_never initial_suspend() const noexcept { return {}; } std::suspend_never final_suspend() const noexcept { return {}; } void return_value(T&& value) { *storagePtr_ = std::move(value); } void unhandled_exception() { // Leave storage_ empty to represent error } }; }; template <typename T> struct [[clang::coro_await_suspend_destroy]] optional_awaitable { std::optional<T> opt_; bool await_ready() const noexcept { return opt_.has_value(); } T await_resume() { return std::move(opt_).value(); } // Adding `noexcept` here makes the early IR much smaller, but the // optimizer is able to discard the cruft for simpler cases. void await_suspend_destroy(auto& promise) noexcept { // Assume the return object defaults to "empty" } void await_suspend(auto handle) { await_suspend_destroy(handle.promise()); handle.destroy(); } }; template <typename T> optional_awaitable<T> operator co_await(std::optional<T> opt) { return {std::move(opt)}; } // Non-coroutine baseline -- matches the logic of `simple_coro`. std::optional<int> simple_func(const std::optional<int>& r) { try { if (r.has_value()) { return r.value() + 1; } } catch (...) {} return std::nullopt; // return empty on empty input or error } // Without `coro_await_suspend_destroy`, allocates its frame on-heap. std::optional<int> simple_coro(const std::optional<int>& r) { co_return co_await std::move(r) + 4; } // Without `co_await`, this optimizes much like `simple_func`. // Bugs: // - Doesn't short-circuit when `r` is empty, but throws // - Lacks an exception boundary std::optional<int> wrong_simple_coro(const std::optional<int>& r) { co_return r.value() + 2; } int main() { return simple_func(std::optional<int>{32}).value() + simple_coro(std::optional<int>{8}).value() + wrong_simple_coro(std::optional<int>{16}).value(); } ``` Test Plan: For the all-important E2E test, I used this terrible cargo-culted script to run the new end-to-end test with the new compiler. (Yes, I realize I should only need 10% of those `-D` settings for a successful build.) To make sure the test covered what I meant it to do: - I also added an `#error` in the "no attribute" branch to make sure the compiler indeed supports the attribute. - I ran it with a compiler not supporting the attribute, and that also passed. - I also tried `return 1;` from `main()` and saw the logs of the 7 successful tests running. ```sh #!/bin/bash -uex set -o pipefail LLVMBASE=/path/to/source/of/llvm-project SYSCLANG=/path/to/origianl/bin/clang # NB Can add `--debug-output` to debug cmake... # Bootstrap clang -- Use `RelWithDebInfo` or the next phase is too slow! mkdir -p bootstrap cd bootstrap cmake "$LLVMBASE/llvm" \ -G Ninja \ -DBUILD_SHARED_LIBS=true \ -DCMAKE_ASM_COMPILER="$SYSCLANG" \ -DCMAKE_ASM_COMPILER_ID=Clang \ -DCMAKE_BUILD_TYPE=RelWithDebInfo \ -DCMAKE_CXX_COMPILER="$SYSCLANG"++ \ -DCMAKE_C_COMPILER="$SYSCLANG" \ -DLLVM_DEFAULT_TARGET_TRIPLE=x86_64-redhat-linux-gnu \ -DLLVM_HOST_TRIPLE=x86_64-redhat-linux-gnu \ -DLLVM_ENABLE_ASSERTIONS=ON \ -DLLVM_ENABLE_BINDINGS=OFF \ -DLLVM_ENABLE_LLD=ON \ -DLLVM_ENABLE_PROJECTS="clang;lld" \ -DLLVM_OPTIMIZED_TABLEGEN=true \ -DLLVM_FORCE_ENABLE_STATS=ON \ -DLLVM_ENABLE_DUMP=ON \ -DCLANG_DEFAULT_PIE_ON_LINUX=OFF ninja clang lld ninja check-clang-codegencoroutines # Includes the new IR regression tests cd .. NEWCLANG="$PWD"/bootstrap/bin/clang NEWLLD="$PWD"/bootstrap/bin/lld # LIBCXX_INCLUDE_BENCHMARKS=OFF because google-benchmark bugs out cmake "$LLVMBASE/runtimes" \ -G Ninja \ -DLLVM_DEFAULT_TARGET_TRIPLE=x86_64-redhat-linux-gnu \ -DLLVM_HOST_TRIPLE=x86_64-redhat-linux-gnu \ -DBUILD_SHARED_LIBS=true \ -DCMAKE_ASM_COMPILER="$NEWCLANG" \ -DCMAKE_ASM_COMPILER_ID=Clang \ -DCMAKE_C_COMPILER="$NEWCLANG" \ -DCMAKE_CXX_COMPILER="$NEWCLANG"++ \ -DLLVM_FORCE_ENABLE_STATS=ON \ -DLLVM_ENABLE_ASSERTIONS=ON \ -DLLVM_ENABLE_LLD=ON \ -DLIBCXX_INCLUDE_TESTS=ON \ -DLIBCXX_INCLUDE_BENCHMARKS=OFF \ -DLLVM_INCLUDE_TESTS=ON \ -DLLVM_ENABLE_RUNTIMES="libcxx;libcxxabi;libunwind" \ -DCMAKE_BUILD_TYPE=RelWithDebInfo \ -DCMAKE_EXPORT_COMPILE_COMMANDS=ON ninja cxx-test-depends LIBCXXBUILD=$PWD cd "$LLVMBASE" libcxx/utils/libcxx-lit "$LIBCXXBUILD" -v \ libcxx/test/std/language.support/support.coroutines/end.to.end/coro_await_suspend_destroy.pass.cpp ```
1 parent 1458eb2 commit 9fc3169

File tree

10 files changed

+942
-90
lines changed

10 files changed

+942
-90
lines changed

clang/docs/ReleaseNotes.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -136,6 +136,12 @@ Removed Compiler Flags
136136
Attribute Changes in Clang
137137
--------------------------
138138

139+
- Introduced a new attribute ``[[clang::coro_await_suspend_destroy]]``. When
140+
applied to a coroutine awaiter class, it causes suspensions into this awaiter
141+
to use a new `await_suspend_destroy(Promise&)` method instead of the standard
142+
`await_suspend(std::coroutine_handle<...>)`. The coroutine is then destroyed.
143+
This improves code speed & size for "short-circuiting" coroutines.
144+
139145
Improvements to Clang's diagnostics
140146
-----------------------------------
141147
- Added a separate diagnostic group ``-Wfunction-effect-redeclarations``, for the more pedantic

clang/include/clang/Basic/Attr.td

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1352,6 +1352,14 @@ def CoroAwaitElidableArgument : InheritableAttr {
13521352
let SimpleHandler = 1;
13531353
}
13541354

1355+
def CoroAwaitSuspendDestroy: InheritableAttr {
1356+
let Spellings = [Clang<"coro_await_suspend_destroy">];
1357+
let Subjects = SubjectList<[CXXRecord]>;
1358+
let LangOpts = [CPlusPlus];
1359+
let Documentation = [CoroAwaitSuspendDestroyDoc];
1360+
let SimpleHandler = 1;
1361+
}
1362+
13551363
// OSObject-based attributes.
13561364
def OSConsumed : InheritableParamAttr {
13571365
let Spellings = [Clang<"os_consumed">];

clang/include/clang/Basic/AttrDocs.td

Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9270,6 +9270,93 @@ Example:
92709270
}];
92719271
}
92729272

9273+
def CoroAwaitSuspendDestroyDoc : Documentation {
9274+
let Category = DocCatDecl;
9275+
let Content = [{
9276+
9277+
The ``[[clang::coro_await_suspend_destroy]]`` attribute may be applied to a C++
9278+
coroutine awaiter type. When this attribute is present, the awaiter must
9279+
implement ``void await_suspend_destroy(Promise&)``. If ``await_ready()``
9280+
returns ``false`` at a suspension point, ``await_suspend_destroy`` will be
9281+
called directly, bypassing the ``await_suspend(std::coroutine_handle<...>)``
9282+
method. The coroutine being suspended will then be immediately destroyed.
9283+
9284+
Logically, the new behavior is equivalent to this standard code:
9285+
9286+
.. code-block:: c++
9287+
9288+
void await_suspend_destroy(YourPromise&) { ... }
9289+
void await_suspend(auto handle) {
9290+
await_suspend_destroy(handle.promise());
9291+
handle.destroy();
9292+
}
9293+
9294+
This enables `await_suspend_destroy()` usage in portable awaiters — just add a
9295+
stub ``await_suspend()`` as above. Without ``coro_await_suspend_destroy``
9296+
support, the awaiter will behave nearly identically, with the only difference
9297+
being heap allocation instead of stack allocation for the coroutine frame.
9298+
9299+
This attribute exists to optimize short-circuiting coroutines—coroutines whose
9300+
suspend points are either (i) trivial (like ``std::suspend_never``), or (ii)
9301+
short-circuiting (like a ``co_await`` that can be expressed in regular control
9302+
flow as):
9303+
9304+
.. code-block:: c++
9305+
9306+
T val;
9307+
if (awaiter.await_ready()) {
9308+
val = awaiter.await_resume();
9309+
} else {
9310+
awaiter.await_suspend();
9311+
return /* value representing the "execution short-circuited" outcome */;
9312+
}
9313+
9314+
The benefits of this attribute are:
9315+
- **Avoid heap allocations for coro frames**: Allocating short-circuiting
9316+
coros on the stack makes code more predictable under memory pressure.
9317+
Without this attribute, LLVM cannot elide heap allocation even when all
9318+
awaiters are short-circuiting.
9319+
- **Performance**: Significantly faster execution and smaller code size.
9320+
- **Build time**: Faster compilation due to less IR being generated.
9321+
9322+
Marking your ``await_suspend_destroy`` method as ``noexcept`` can sometimes
9323+
further improve optimization.
9324+
9325+
Here is a toy example of a portable short-circuiting awaiter:
9326+
9327+
.. code-block:: c++
9328+
9329+
template <typename T>
9330+
struct [[clang::coro_await_suspend_destroy]] optional_awaitable {
9331+
std::optional<T> opt_;
9332+
bool await_ready() const noexcept { return opt_.has_value(); }
9333+
T await_resume() { return std::move(opt_).value(); }
9334+
void await_suspend_destroy(auto& promise) {
9335+
// Assume the return object of the outer coro defaults to "empty".
9336+
}
9337+
// Fallback for when `coro_await_suspend_destroy` is unavailable.
9338+
void await_suspend(auto handle) {
9339+
await_suspend_destroy(handle.promise());
9340+
handle.destroy();
9341+
}
9342+
};
9343+
9344+
If all suspension points use (i) trivial or (ii) short-circuiting awaiters,
9345+
then the coroutine optimizes more like a plain function, with 2 caveats:
9346+
- **Behavior:** The coroutine promise provides an implicit exception boundary
9347+
(as if wrapping the function in ``try {} catch { unhandled_exception(); }``).
9348+
This exception handling behavior is usually desirable in robust,
9349+
return-value-oriented programs that need short-circuiting coroutines.
9350+
Otherwise, the promise can always re-throw.
9351+
- **Speed:** As of 2025, there is still an optimization gap between a
9352+
realistic short-circuiting coro, and the equivalent (but much more verbose)
9353+
function. For a guesstimate, expect 4-5ns per call on x86. One idea for
9354+
improvement is to also elide trivial suspends like `std::suspend_never`, in
9355+
order to hit the `HasCoroSuspend` path in `CoroEarly.cpp`.
9356+
9357+
}];
9358+
}
9359+
92739360
def CountedByDocs : Documentation {
92749361
let Category = DocCatField;
92759362
let Content = [{

clang/include/clang/Basic/DiagnosticSemaKinds.td

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12504,6 +12504,9 @@ def note_coroutine_promise_call_implicitly_required : Note<
1250412504
def err_await_suspend_invalid_return_type : Error<
1250512505
"return type of 'await_suspend' is required to be 'void' or 'bool' (have %0)"
1250612506
>;
12507+
def err_await_suspend_destroy_invalid_return_type : Error<
12508+
"return type of 'await_suspend_destroy' is required to be 'void' (have %0)"
12509+
>;
1250712510
def note_await_ready_no_bool_conversion : Note<
1250812511
"return type of 'await_ready' is required to be contextually convertible to 'bool'"
1250912512
>;

0 commit comments

Comments
 (0)