Skip to content

Conversation

@wisp3rwind
Copy link
Contributor

@wisp3rwind wisp3rwind commented Dec 2, 2025

Thank you for your contribution!

We appreciate the time and effort you've put into this pull request.
To help us review it efficiently, please ensure you've gone through the following checklist:

Submission Checklist 📝

  • I have updated existing examples or added new ones (if applicable).
  • I have used cargo xtask fmt-packages command to ensure that all changed code is formatted correctly.
  • My changes were added to the CHANGELOG.md in the proper section.
  • I have added necessary changes to user code to the latest Migration Guide.
  • My changes are in accordance to the esp-rs developer guidelines

Extra:

Pull Request Details 📖

This changes the input data type for RMT Tx methods from &[PulseCode] to &mut impl Encoder where Encoder is conceptually similar to Iterator<Item = PulseCode>, but allows for more efficient code in many cases.

IDF has a similar encoder type: https://docs.espressif.com/projects/esp-idf/en/latest/esp32c3/api-reference/peripherals/rmt.html#rmt-rmt-encoder

General design

The Encoder trait differs from Iterator<Item = PulseCode> mainly in the following aspects:

  • Writing data is driven by the encode method, which calls RmtWriter methods to push to the hardware, rather than RmtWriter pulling data from an iterator.
  • The RmtWriter::write_many method helps write several codes to the hardware in a very tight loop.

The combination of both allows achieving very efficient inner loops when copying data to the hardware, without requiring unsafe code on the user side and without exposing any direct hardware access or any specifics about how much data is written to the user code: See for example the BytesEncoder implementation. I've not been able to achieve the same performance with just Iterators.

Specifically, the pattern from BytesEncoder is similar to what's required to send data to WS2812-style LEDs:
Fetch R, G, B bytes from RAM, assemble in the correct order into a u32, then shift out bits and write a PulseCode for each. That maps very cleanly to write_many, but leads to overhead with iterators.

I've been benchmarking this1 using cycle counters, here are some results for a WS2812 LED stripe encoder/pulse code iterator which are the fastest I've been able to achieve:

// base case: not using Encoder, just write zeros to the hardware using raw pointers
Render benchmark (base): RMT BenchmarkResult:
	CPU clock: 160MHz
	Iterations: 39
	Codes written: 1441
	Encoding time: 126us
	Encoding time / code: 88ns ~ 14 cycles

// pre-compute PulseCodes and use CopyEncoder
Render benchmark (slice): RMT BenchmarkResult:
	CPU clock: 160MHz
	Iterations: 32
	Codes written: 1441
	Encoding time: 154us (9% of 1710us tx time)
	Encoding time / code: 107ns ~ 17 cycles

// custom impl of Iterator<Item = PulseCode> for an LED stripe encoder type
Render benchmark (iter): RMT BenchmarkResult:
	CPU clock: 160MHz
	Iterations: 20
	Codes written: 1441
	Encoding time: 249us (14% of 1710us tx time)
	Encoding time / code: 172ns ~ 27 cycles

// custom impl of Encoder for an LED stripe encoder type
Render benchmark (enc): RMT BenchmarkResult:
	CPU clock: 160MHz
	Iterations: 29
	Codes written: 1441
	Encoding time: 170us (9% of 1710us tx time)
	Encoding time / code: 118ns ~ 18 cycles

"Encoding time" is just the time to run the encoder_write function, not including any polling or interrupt/embassy dispatch overhead. Thus, the fact that it takes "only" ~10% of tx time is a bit misleading.

Note that the custom encoder version has a 40% lower cycle count compared to the iterator version, and is on par with with the CopyEncoder (which requires precomputing PulseCodes in a large buffer). Both encoder variants are quite close to the performance ceiling of the base case, which I presume is limited due to the APB speed.

In this case, the inner loop of the encoder compiles to optimal assembly, cf. the decompiled version2:

Screenshot From 2025-12-03 12-28-43

whereas the iterator version remains more convoluted.

API

The PR continues to use a single transmit() method for various data types, requiring explictly wrapping things in an Encoder:

let mut enc = CopyEncoder::new(&data);
channel.transmit(&mut enc)?;

I also considered an IntoEncoder trait with fn transmit(&mut self, data: impl IntoEncoder) with implementations provided for

  • &[PulseCode],
  • I where I: IntoIterator<Item = PulseCode>`,
  • E where E: Encoder.

However, that immediately runs into issues with specialization due to the blanket impls.

Alternatively, one could consider different transmit method, i.e.

  • fn transmit_slice(&mut self, data: &[PulseCode]) -> ...,
  • fn transmit_iter(&mut self, data: impl IntoIterator<Item = PulseCode>) -> ...,
  • fn transmit_enc(&mut self, data: impl Encoder) -> ....

The disadvantage is that this blows up the number of methods significantly. In particular, if/when methods are split into transmit(&mut self, ...) and transmit_owned(self, ...) similar to the SHA driver, as suggested by @Dominaezzz, this would lead to combinatorial explosion of the number of channel methods. Additionally, having such per-datatype methods isn't really much simpler than explictly creating encoders, in my opinion.

Questions

  • I mentioned that I added some benchmarking code: This needs support in esp-hal for low-level hardware access. Would something like this in principle be in-scope for the project?
  • Should BytesEncoder be part of esp-hal directly? It might make more sense to move it to an example, showcasing how to write an efficient Encoder`.

Testing

HIL tests, incl. new ones.

Closes #1768

Footnotes

  1. I intend to propose to merge the benchmarking code into esp-hal, but I'm not sure about the design yet, and it probably needs some cleanup.

  2. From esp32c3; the last ptr_ = ptr assignment is spurious, there's no corresponding instruction in the loop. data_word holds 24 bits of RGB data which are shifted out MSB-first.

in anticipation of adding another, user-visible type which will be named
RmtWriter

This type is private, so no changelog or migration guide entry required.
which was probably of little value, anyway, and also in preparation for adding
more sophisticated Encoder data types
Previously, we stored &mut dyn Encoder and dynamically dispatched the Encoder::encode method.
Now, we store &mut dyn EncoderExt and dynamically dispatch
EncoderExt::write with the expectation that Encoder::encode should be
inlined in EncoderExt::write (which is the only caller, and its Encoder
implementations in esp-hal are also marked as #[inline(always)]).

This might allow for small optimizations, since the RmtWriter type will
typically not need to be constructed on the stack, but can be kept in
registers.
to ensure that the Encoder-related refactoring didn't break anything
&mut dyn EncoderExt is a fat pointer to the data and the vtable (likely
in flash), but we only need a single entry of the vtable. Thus,
implement our own pointer type, which will avoids the indirection via
the vtable.
@wisp3rwind
Copy link
Contributor Author

Would you mind granting me HIL access here? Thanks!

NonNull::from_mut is new in 1.89
Comment on lines +438 to +456
#[derive(Clone, Debug)]
pub struct IterEncoder<D>
where
D: Iterator<Item = PulseCode>,
{
data: D,
}

// If the input data was not exhausted, update offset as
//
// | initial | offset | max_count | new offset |
// | ------- + ----------- + ----------- + ----------- |
// | true | 0 | memsize | 0 |
// | false | 0 | memsize / 2 | memsize / 2 |
// | false | memsize / 2 | memsize / 2 | 0 |
//
// Otherwise, the new position is invalid but the new slice is empty and we won't use the
// offset again. In either case, the unsigned subtraction will not underflow.
self.offset = memsize as u16 - max_count as u16 - self.offset;

// The panic can never trigger since count <= data.len()!
data.split_off(..count).unwrap();
if data.is_empty() {
self.state = WriterState::Done;
impl<D> IterEncoder<D>
where
D: Iterator<Item = PulseCode>,
{
/// Create a new instance that transmits the provided `data`.
pub fn new(data: impl IntoIterator<IntoIter = D>) -> Self {
Self {
data: data.into_iter(),
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you actually need anything more than this? Copying from a slice, or converting from a bitstream can both be expressed as an iterator. Wouldn't it be better to not introduce a whole subsystem for something that could be formulated in user code with common enough Rust machinery?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conceptually that's true, but I've not been able to optimize the code using just iterators as well as using the dedicated encoder type. It seems that this would require too much re-ordering of conditionals and eliding memory accesses by the compiler. I've amended the top post with more details on how I ended up with this design. Let me know if you have any further questions!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how I feel about designing a complicated API just to work around compiler optimization issues.

@bugadani bugadani added the trusted-author Allow the author of this Pull Request to run HIL tests and the `binary-size` test. label Dec 2, 2025
@github-actions
Copy link

github-actions bot commented Dec 2, 2025

[HIL trust list]

Trusted users for this PR (click to expand)

@github-actions
Copy link

github-actions bot commented Dec 2, 2025

Author @wisp3rwind was trusted for this PR via the trusted-author label.
They can now use /hil quick or /hil full.

@wisp3rwind wisp3rwind mentioned this pull request Dec 3, 2025
12 tasks
@wisp3rwind
Copy link
Contributor Author

/hil full

@github-actions
Copy link

github-actions bot commented Dec 3, 2025

Triggered full HIL run for #4604.

Run: https://github.com/esp-rs/esp-hal/actions/runs/19893037493

Status update: ❌ HIL (full) run failed (conclusion: failure).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

trusted-author Allow the author of this Pull Request to run HIL tests and the `binary-size` test.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: iterator as input for RMT transmit

2 participants