-
Notifications
You must be signed in to change notification settings - Fork 364
RMT: Use new Encoder trait for Tx data #4604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
in anticipation of adding another, user-visible type which will be named RmtWriter This type is private, so no changelog or migration guide entry required.
which was probably of little value, anyway, and also in preparation for adding more sophisticated Encoder data types
…writer.state read-only
Previously, we stored &mut dyn Encoder and dynamically dispatched the Encoder::encode method. Now, we store &mut dyn EncoderExt and dynamically dispatch EncoderExt::write with the expectation that Encoder::encode should be inlined in EncoderExt::write (which is the only caller, and its Encoder implementations in esp-hal are also marked as #[inline(always)]). This might allow for small optimizations, since the RmtWriter type will typically not need to be constructed on the stack, but can be kept in registers.
to ensure that the Encoder-related refactoring didn't break anything
&mut dyn EncoderExt is a fat pointer to the data and the vtable (likely in flash), but we only need a single entry of the vtable. Thus, implement our own pointer type, which will avoids the indirection via the vtable.
|
Would you mind granting me HIL access here? Thanks! |
NonNull::from_mut is new in 1.89
| #[derive(Clone, Debug)] | ||
| pub struct IterEncoder<D> | ||
| where | ||
| D: Iterator<Item = PulseCode>, | ||
| { | ||
| data: D, | ||
| } | ||
|
|
||
| // If the input data was not exhausted, update offset as | ||
| // | ||
| // | initial | offset | max_count | new offset | | ||
| // | ------- + ----------- + ----------- + ----------- | | ||
| // | true | 0 | memsize | 0 | | ||
| // | false | 0 | memsize / 2 | memsize / 2 | | ||
| // | false | memsize / 2 | memsize / 2 | 0 | | ||
| // | ||
| // Otherwise, the new position is invalid but the new slice is empty and we won't use the | ||
| // offset again. In either case, the unsigned subtraction will not underflow. | ||
| self.offset = memsize as u16 - max_count as u16 - self.offset; | ||
|
|
||
| // The panic can never trigger since count <= data.len()! | ||
| data.split_off(..count).unwrap(); | ||
| if data.is_empty() { | ||
| self.state = WriterState::Done; | ||
| impl<D> IterEncoder<D> | ||
| where | ||
| D: Iterator<Item = PulseCode>, | ||
| { | ||
| /// Create a new instance that transmits the provided `data`. | ||
| pub fn new(data: impl IntoIterator<IntoIter = D>) -> Self { | ||
| Self { | ||
| data: data.into_iter(), | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you actually need anything more than this? Copying from a slice, or converting from a bitstream can both be expressed as an iterator. Wouldn't it be better to not introduce a whole subsystem for something that could be formulated in user code with common enough Rust machinery?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conceptually that's true, but I've not been able to optimize the code using just iterators as well as using the dedicated encoder type. It seems that this would require too much re-ordering of conditionals and eliding memory accesses by the compiler. I've amended the top post with more details on how I ended up with this design. Let me know if you have any further questions!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how I feel about designing a complicated API just to work around compiler optimization issues.
b04afae to
36c5264
Compare
[HIL trust list]Trusted users for this PR (click to expand) |
|
Author @wisp3rwind was trusted for this PR via the |
|
/hil full |
|
Triggered full HIL run for #4604. Run: https://github.com/esp-rs/esp-hal/actions/runs/19893037493 Status update: ❌ HIL (full) run failed (conclusion: failure). |
Thank you for your contribution!
We appreciate the time and effort you've put into this pull request.
To help us review it efficiently, please ensure you've gone through the following checklist:
Submission Checklist 📝
cargo xtask fmt-packagescommand to ensure that all changed code is formatted correctly.CHANGELOG.mdin the proper section.Extra:
Pull Request Details 📖
This changes the input data type for RMT Tx methods from
&[PulseCode]to&mut impl EncoderwhereEncoderis conceptually similar toIterator<Item = PulseCode>, but allows for more efficient code in many cases.IDF has a similar encoder type: https://docs.espressif.com/projects/esp-idf/en/latest/esp32c3/api-reference/peripherals/rmt.html#rmt-rmt-encoder
General design
The
Encodertrait differs fromIterator<Item = PulseCode>mainly in the following aspects:encodemethod, which callsRmtWritermethods to push to the hardware, rather thanRmtWriterpulling data from an iterator.RmtWriter::write_manymethod helps write several codes to the hardware in a very tight loop.The combination of both allows achieving very efficient inner loops when copying data to the hardware, without requiring unsafe code on the user side and without exposing any direct hardware access or any specifics about how much data is written to the user code: See for example the
BytesEncoderimplementation. I've not been able to achieve the same performance with justIterators.Specifically, the pattern from
BytesEncoderis similar to what's required to send data to WS2812-style LEDs:Fetch R, G, B bytes from RAM, assemble in the correct order into a u32, then shift out bits and write a PulseCode for each. That maps very cleanly to
write_many, but leads to overhead with iterators.I've been benchmarking this1 using cycle counters, here are some results for a WS2812 LED stripe encoder/pulse code iterator which are the fastest I've been able to achieve:
"Encoding time" is just the time to run the
encoder_writefunction, not including any polling or interrupt/embassy dispatch overhead. Thus, the fact that it takes "only" ~10% of tx time is a bit misleading.Note that the custom encoder version has a 40% lower cycle count compared to the iterator version, and is on par with with the
CopyEncoder(which requires precomputing PulseCodes in a large buffer). Both encoder variants are quite close to the performance ceiling of the base case, which I presume is limited due to the APB speed.In this case, the inner loop of the encoder compiles to optimal assembly, cf. the decompiled version2:
whereas the iterator version remains more convoluted.
API
The PR continues to use a single
transmit()method for various data types, requiring explictly wrapping things in anEncoder:I also considered an
IntoEncodertrait withfn transmit(&mut self, data: impl IntoEncoder)with implementations provided for&[PulseCode],I where I:IntoIterator<Item = PulseCode>`,E where E: Encoder.However, that immediately runs into issues with specialization due to the blanket impls.
Alternatively, one could consider different transmit method, i.e.
fn transmit_slice(&mut self, data: &[PulseCode]) -> ...,fn transmit_iter(&mut self, data: impl IntoIterator<Item = PulseCode>) -> ...,fn transmit_enc(&mut self, data: impl Encoder) -> ....The disadvantage is that this blows up the number of methods significantly. In particular, if/when methods are split into
transmit(&mut self, ...)andtransmit_owned(self, ...)similar to the SHA driver, as suggested by @Dominaezzz, this would lead to combinatorial explosion of the number of channel methods. Additionally, having such per-datatype methods isn't really much simpler than explictly creating encoders, in my opinion.Questions
BytesEncoderbe part ofesp-haldirectly? It might make more sense to move it to an example, showcasing how to write an efficientEncoder`.Testing
HIL tests, incl. new ones.
Closes #1768
Footnotes
I intend to propose to merge the benchmarking code into esp-hal, but I'm not sure about the design yet, and it probably needs some cleanup. ↩
From esp32c3; the last
ptr_ = ptrassignment is spurious, there's no corresponding instruction in the loop.data_wordholds 24 bits of RGB data which are shifted out MSB-first. ↩