Refactor: Redesign API for Greater Control As Needed by SOTI & Testbed by ArnavG-it · Pull Request #54 · UMSATS/tsat-utilities-kit

ArnavG-it · 2025-06-03T01:12:12Z

Multiple API changes to support the test interfaces and improve timing:

Replaced Polling with RTOS
- Every subsystem using the CAN Wrapper will have to enable RTOS and move their superloop code into the default task (or multiple tasks if they choose).
- The CAN Wrapper manages its own tasks and queues once initialized.
- Callbacks for message processing and error handling occur in a task context, not ISRs.
Support for multiple CAN lines
- Subsystems will use CANWrapper_CAN_Start to initialize their CAN peripheral, and pass the hcan pointer to the CAN Wrapper's functions.
- The testbed (and potentially SOTI) are able to use multiple CAN peripherals.
Replaced Modes with Standard and Advanced APIs
- Subsystems will use the standard API.
- SOTI and the testbed's IO board can use the advanced API to access Tx and Rx callbacks as well as transmit custom messages. The user's node ID isn't tracked in the advanced API.
Replaced Hardware Timer with RTOS Ticks*
- Users don't need to enable a hardware timer to use the CAN Wrapper.

Other notes:

Subsystems and SOTI must define the symbol STM32L4. The IO board must define STM32F7.
Added a body_size field to the CANMessage type.
Removed the CDH-specific message task.

TODO:

*Look into the RTOS tick resolution and/or overhead of running ticks at a high frequency to see if they're reliable for the 3.6 ms message timeout.
Look into improvements for the Tx Cache, given this race condition and other possible problems.
Remove unneeded references and definitions.
Update the wiki.

The changes are being tested on this testbed branch.

Closes #24, #52.
#53 should probably be closed as "not planned".

* Cleaned up and renamed some things. * Filter messages by recipient ID by default, disabling with CWM_DISABLE_FILTERING. * Replaced "RTOS Mode" with CWM_DEFINE_RTOS_TASK symbol.

Koloss0 · 2025-06-07T22:48:35Z

Made some edits. However, there's still some compiler errors from the calls to CANWrapper_Transmit_Ack() missing CAN handles. Haven't had the chance to figure out a proper solution to it, yet.

ArnavG-it · 2025-06-07T23:07:09Z

Thanks for the changes! If you do add the CAN handle as an argument for CANWrapper_Poll_CAN_Queue, you could also add back the if (userHandle == msgHandle) to wrap the message processing which was previously in the reception ISR. If it's even needed.

Koloss0 · 2025-06-12T17:21:07Z

Hey, sorry for disappearing. I've actually been contemplating this line specifically.

As you can see, it transmits ACKs for each message before processing it. Which means that if any foreseeable command we implement takes longer than the 3.6 millisecond timeout, we could be at risk of failing to acknowledge messages in time.

Now, 3.6 milliseconds is a looong time for an 80 MHz CPU, but delays could happen. Take for example:

Faulty loops
Any kind of polling from hardware, such as ADCs
Transmitting to CAN (especially repeatedly)
Long ISRs causing delays

In fact, this one ISR in Payload might just be the evidence I needed to feel worried about not using an RTOS. It polls 32 individual ADCs in succession while sending 32 CAN messages. Just one CAN message would theoretically take 0.224 milliseconds with our current baud rate. And 0.224 x 32 = 7.168 ms, well over the timeout. And this is hooked up to a timer so it can happen at any time.

The funny thing is I actually wrote this code myself, and I'm only now realizing how bad of an idea that is. But even if we took it out of the ISR and implemented our own pseudo-scheduling, it still needs to get executed and we'd have to probably sprinkle in calls for sending ACKs all over the code in an attempt to ensure delays don't get in the way. But at that point, we're basically re-inventing RTOS, and there's still no guarantee that we caught everything.

Also, this ISR is for telemetry reports, which is something that every subsystem will have to implement.

So, the more I think about this the more convinced I am that all subsystems should probably use an RTOS, and we should probably require an RTOS for managing ACKs. It's the only way to ensure with confidence that ACKs are returned in a timely manner.

ArnavG-it · 2025-06-12T20:37:31Z

@Koloss0 That's a good point, thanks for laying it out. Mind explaining why ACKs can't be transmitted in the Rx ISR? Is it just to minimize the time spent in it? Either way, I see what you mean and it does seem like scheduling/RTOS has a lot of benefits.

Koloss0 · 2025-06-12T21:24:02Z

It's to minimize the time spent in it, yes. All the advice out there seems to suggest keeping ISRs as short as possible is good practice, so as to not interfere with other polling/IO operations. Also, I don't know what the likelihood of this happening is, but technically the ISR could fire again while it's in the middle of transmitting an ACK, which would cause the second ACK to not get transmitted. So, we're keeping it as short as possible by just loading messages into a queue for later processing.

Switched implementation to be RTOS-only based on the rationale from this comment: #54 (comment)

Koloss0 · 2025-06-13T02:28:31Z

@ArnavG-it Okay, I made a new commit that addresses the points we raised. For simplicity, TUK will now only support RTOS. I think it's the right way to go.

I tried to just get my ideas down for how the logic would work and didn't bother getting it to compile. Could you take over from here for me and get it working? It would be much appreciated.

ArnavG-it · 2025-06-13T02:55:19Z

Awesome, I'll definitely take a look tomorrow.

Also added some task attributes and an osMessageQueueNew wrapper.

ArnavG-it · 2025-06-13T21:42:46Z

IO board example with RTOS.

Koloss0 · 2025-06-14T00:39:29Z

Amazing, thank-you!

For the creation of tasks and queues, initially I was going to let the user create them with the settings they want and pass it into the module.

CDH uses this screen for managing all their tasks and queues, so they could use it like normal and pass them to the module:

However, when I think about it, it's possible we may want to add even more queues to TUK (e.g. take the error queue, or maybe txCache for example). And if that's the case, that might be introducing a lot of queues that the user isn't really in a knowledgeable position to set good values. And if we ever want to change it later it could be slow and inefficient to manually go through each subsystem and change it ourselves.

So maybe we should hide the creation of those things in the CANWrapper_Init() function. Then we have control over it right here in this repository. The user would just have to make sure it's between osKernelInitialize() and osKernelStart().

So, I would say:

Don't even have the queues in the CANWrapper_InitTypedef
The osMessageQueueNew() and osThreadNew() calls can also just go in CANWrapper_Init().

ArnavG-it · 2025-06-14T00:52:07Z

That makes sense. I was thinking about some similar stuff, hence why I didn't use the .ioc for the IO board example. Doing it all in TUK directly sounds good. Let me know if you're working on that or want me to.

This means CANWrapper_Init will have to be called after osKernelInitialize in the subsystem code.

Koloss0 · 2025-06-27T16:02:24Z

@ArnavG-it Hey again, wanted to update you. I've been a bit busy but I'm now taking this up again. I'm trying to find a solution that isn't too complex but gives us the control we need for doing all these low-level functions in the IO board and SOTI. It's taken a while but I think I'm getting somewhere.

Also, fun fact:

technically the ISR could fire again while it's in the middle of transmitting an ACK, which would cause the second ACK to not get transmitted.

This is incorrect. I found this out by looking into the Cortex-M4, which is the processor used in STM32's. If an interrupt occurs with the same priority as the one already being handled, it actually defers until the request completes, known as tail-chaining. But a higher priority interrupt can pre-empt the current one. Very interesting.

ArnavG-it · 2025-06-27T23:38:44Z

Thanks for the update. I've been a bit busy too but I'm ready to work on this once you think of the next steps.

Adds a separate API for advanced use cases. This maintains the simplicity of the old API while enabling greater control with the advanced API. Other changes: * Added significantly more documentation in the code. * Added rx_callback & tx_callback in the advanced API. * Renamed a few things. * Added an 'Error Handler' thread. * Replaced ErrorQueue ADT with an RTOS queue. * Added CANWrapper_Transmit_Raw() in the advanced API. * Added `body_size` field to CANMessage struct: This is a minor change which doesn't effect much. The benefits are that it feels just a bit more elegant in the code (entirely subjective), and also gives advanced users the possibility of transmitting illegally sized message bodies for testing purposes (if that is ever desired). It could also be convenient to normal users if they want to copy the body contents to a buffer. **DOES NOT COMPILE:** The commit was getting big the more I changed things, so I pre-emptively created this commit. However, I still want to make a few changes. Those changes will be (potentially): * Removing hcan from CANWrapper_InitTypeDef and using CAN_Start() in both APIs. This will slightly reduce the need for some #ifdefs. * Making it compile. * Looking into `HAL_CAN_TxMailbox0CompleteCallback`: This is an ISR which would allow us to execute code whenever a message completes transmission. This would be ideal for tx_callback, because currently it only gets called when a TX request is made, not when the message is sent.

Koloss0 · 2025-06-28T16:43:58Z

@ArnavG-it Alright, I pushed the current state of what I was working on. I have to stop for a few hours, but I think it's coming along very well. I wrote down some things I was going to work on next. Feel free to pick up where I left off with the first two bullets if you have time. It should be fairly straightforward and would introduce you to the changes. The third bullet should probably be a separate issue. And all good if you can't!

Also footnote: For the first bullet, I didn't explicitly mention it but along with removing hcan from the init struct, I was also going to add hcan back to the callbacks and transmit function, thereby basically making support for multiple CAN peripherals part of the standard API. It probably won't get used by anyone except us, but it would reduce the #ifdefs a bit.

ArnavG-it · 2025-06-28T19:53:34Z

This looks like it will block permanently if s_error_queue ever becomes empty:

void Error_Handler_Thread(void *argument)
{
	CANWrapper_ErrorInfo item;

	// Infinite loop
	while (1)
	{
		Poll_Timeouts();

		// Wait for the next item in the queue.
		if (osMessageQueueGet(s_error_queue, &item, NULL, osWaitForever) == osOK)
		{
			// Let the user handle it.
			s_init_struct.error_callback(&item);
		}
	}
}

We could essentially set a polling rate by doing this:

osMessageQueueGet(s_error_queue, &item, NULL, 10) // Timeout in ms

That's pretty simple and probably fine, but there might be a fun solution. Rather than using the hardware timer, we could use the RTOS kernel ticks for the message timestamps. Then the error handler task could use osDelayUntil() to wait for the next message's timeout, and fire the callback if the message hasn't been removed from the Tx cache. It would also block when the cache is empty and would have to be woken up by CANWrapper_Transmit_Raw when an item gets added, which should just be a simple thread flag. This would get rid of the hardware timer (unless it's used elsewhere), the error queue, and the overflow handling in Poll_Timeouts which was #24. I haven't thought about it too much so I might be missing something, but it could simplify some stuff if it works. Also, the kernel ticks won't be as precise as the timer, if that matters.

Koloss0 · 2025-06-28T21:05:00Z

Ah, good catch!

It would be worth looking into how accurate the ticks would likely be. The decision for using a timer I believe came from:

The need for a solution that accomodated both RTOS and non-RTOS subsystems. (which no longer applies)
Ensuring high enough accuracy.

But if the accuracy is sufficient, then that would be a much simpler and more robust approach.

Also removes the error queue. Still have to remove the htim peripheral references. Article about tick resolution: https://www.freertos.org/Documentation/02-Kernel/05-RTOS-implementation-tutorial/02-Building-blocks/11-Tick-Resolution

ArnavG-it · 2025-06-29T00:05:10Z

The Tx Cache is a bit problematic with RTOS.

The HAL_CAN_RxFifo0MsgPendingCallback ISR can erase from it:

if (rx_behaviour | RX_CLEAR_TX_STORE && item.msg.is_ack)
{
	// Clear the corresponding message in the TxCache if it exists.
	int index = TxCache_Find(&s_tx_cache, msg);
	if (index > 0)
	{
		TxCache_Erase(&s_tx_cache, index);
	}
}

And Error_Handler_Thread needs to read/erase from it:

if (timestamp == s_tx_cache.items[0].timestamp)
{
	CANWrapper_ErrorInfo error;
	error.error = CAN_WRAPPER_ERROR_TIMEOUT;
	error.msg = s_tx_cache.items[0].msg;

	TxCache_Erase(&s_tx_cache, 0);

	s_init_struct.error_callback(&error);
}

This is true whether we use kernel ticks or the previous Poll_Timeouts approach. The case where a race condition happens seems very rare (an ack is received at the same time that the message times out), but there could be other less-obvious or future concurrency issues so this might be worth looking into.

I'll look into semaphores, although letting the error handler run while the ISR is blocked isn't possible, so I'm not sure if there's a solution there. Maybe there's some way to make erasing and reading messages atomic. Maybe the removal of messages could be deferred from the ISR to a task, in which case semaphores would solve it. This probably isn't a priority but I figured I should take note of it.

And make it compile. There is still one error in advanced mode as CANWrapper_Transmit references s_init_struct.node_id which only exists in normal mode.

ArnavG-it · 2025-06-29T01:04:46Z

@Koloss0 Let me know how you want to deal with CANWrapper_Transmit between the standard and advanced APIs. The particular issue is that the sender is inferred from the node ID in standard mode, but advanced mode has no node ID. We could only let advanced mode use Transmit_Raw, or have two versions of Transmit, or something else, up to you. And thank you for the changes! It's looking good.

Koloss0 · 2025-06-29T17:11:51Z

@ArnavG-it

And thank you for the changes! It's looking good.

Thanks for your help as well!

Let me know how you want to deal with CANWrapper_Transmit between the standard and advanced APIs.

I was thinking Transmit_Raw() would act as the advanced version of Transmit(). So SOTI and the IO board would just use the raw one for all transmissions.

This probably isn't a priority but I figured I should take note of it.

No, absolutely! This was at the back of my mind as well as I was working on it. Race conditions are incredibly common in threaded environments so we need to stay vigilant. Thanks for already doing research on it!

ArnavG-it · 2025-06-29T23:55:36Z

@Koloss0 Here are some notes about the accuracy of ticks if you're curious.

CAN messages are considered timed out after 3.6 milliseconds. The job of Error_Handler_Thread is to check the transmitted messages cache to see if any messages have timed out and fire the user's error handling callback. It does this by delaying until the next message's timeout has occurred and then checking whether the message is still in the cache. There are two cases to consider:

The error handler wakes up before the message actually times out. This could incorrectly report a message that gets a late acknowledgement, even if the chance of that is rare.
The error handler delays too long. The issue with this is that the timed out message is reported late. I'm not sure how much of a problem that would be.

Assuming case 1 should be avoided at all costs, the error handler should delay an extra tick. Then there's a tradeoff between minimizing the extra delay time in case 2 and the overhead from a higher tick frequency.

Say a timer is started at tick 0. Here are two idealized examples:

1 ms = 1 tick (1 kHz, this is the default and what CDH's kernel is set to). A message with a timestamp of 1 will time out between ticks 4 and 6. This is because the message was sent between 1 and 2 ms on the timer, and will time out between 4.6 and 5.6 ms. The error handler should then delay until tick 6 to avoid case 1. It may wake up late, close to 7 ms on the timer, so it will have reported the message 0.4 to 2.4 ms late.
1 ms = 10 ticks (10 kHz). A message with a timestamp of 1 (sent between 0.1 and 0.2 ms on the timer) will time out between ticks 37 and 38 (in between 3.7 and 3.8 ms). The error handler will delay until tick 38, and may not wake up until right before tick 39, so it will have reported the message 0 to 0.2 ms late.

Hopefully that's all correct and makes sense. Essentially, timeouts will always be reported late in order to not be reported early, and we can just control how late that might be. I haven't looked into the exact overhead yet, but let me know what you think about that extra delay time. We could always go back to the timer if it's a problem.

Koloss0 · 2025-06-30T02:58:21Z

@ArnavG-it Yeah, the extra delay shouldn't be a problem. None of these values are set in stone, it's all stuff that can be re-evaluated at a later date if we need to.

The only issue I'm spotting though, is that it's possible to have two messages with the same timestamp. That could lead to an unexpected edge case in the error handler thread. To fix it, there should be a loop to remove and handle items with matching timestamps until none are left.

Koloss0 · 2025-06-30T03:19:03Z

Actually, it appears it would have been fine. I thought osDelayUntil might not properly handle past timestamps, which may lead to extremely long delays, but it seems to correctly distinguish it from overflows.

ArnavG-it · 2025-06-30T03:21:41Z

Actually, it appears it would have been fine. I thought osDelayUntil might not properly handle past timestamps, which may lead to extremely long delays, but it seems to correctly distinguish it from overflows.

Yeah, a benefit is we don't have to handle overflows. Thanks for double checking though. I do have to fix the timeout calculation in that thread, by the way, so ignore the fact that it makes no sense currently.

Koloss0 · 2025-06-30T03:30:50Z

I do have to fix the timeout calculation in that thread, by the way, so ignore the fact that it makes no sense currently.

Wouldn't it work? The data type is a uint32_t, the same as the RTOS tick counter. So if it overflows, it should behave the same as the tick counter overflowing.

ArnavG-it · 2025-06-30T03:58:02Z

Wouldn't it work? The data type is a uint32_t, the same as the RTOS tick counter. So if it overflows, it should behave the same as the tick counter overflowing.

I was just accidentally adding the timeout in microseconds to a tick value. The above commit should have fixed that if I did it correctly.

Changes: * Deleted CAN queue and error queue code. * Now defines timeouts in terms of milliseconds for simplicity.

Slightly easier to comprehend.

You used the term "standard" and I thought that was a better name.

Koloss0 · 2025-06-30T22:18:17Z

I think this is good enough for this PR. Regarding the TODO's,

I think the first one can be left until we see the code in action, re-evaluate if it's not meeting deadlines.
Created Ensure the TX cache is ISR-safe #55
Completed
Can be part of Document CAN Wrapper API Changes #49
I'll go ahead and merge it.

ArnavG-it added 5 commits May 23, 2025 15:59

Remove queue handling, outsource CAN handles

82a0747

Peripheral init function changes

d8b1a2f

Add one return

abb6d67

Small include stuff

bad693c

Split up Process_Message, add hal_include.h

aaa7264

ArnavG-it self-assigned this Jun 3, 2025

ArnavG-it added can wrapper api labels Jun 3, 2025

Koloss0 added 2 commits June 7, 2025 12:28

feat: clean up & add CWM_DISABLE_FILTERING

d87ac6a

* Cleaned up and renamed some things. * Filter messages by recipient ID by default, disabling with CWM_DISABLE_FILTERING. * Replaced "RTOS Mode" with CWM_DEFINE_RTOS_TASK symbol.

refactor: implement Poll_CAN_Queue() & clean up

c55bfe0

refactor: switch to RTOS-only

5072866

Switched implementation to be RTOS-only based on the rationale from this comment: #54 (comment)

ArnavG-it added 2 commits June 13, 2025 16:22

Fix up RTOS code

623e8a1

Also added some task attributes and an osMessageQueueNew wrapper.

Fix task signatures

c6766fa

Initialize tasks and queues in CANWrapper_Init

f77f932

This means CANWrapper_Init will have to be called after osKernelInitialize in the subsystem code.

Koloss0 changed the title ~~Refactor: Remove Incoming Message Pipeline~~ Refactor: Redesign API for Greater Control As Needed by SOTI & Testbed Jun 28, 2025

Replace hardware timer with RTOS ticks

c7aac39

Also removes the error queue. Still have to remove the htim peripheral references. Article about tick resolution: https://www.freertos.org/Documentation/02-Kernel/05-RTOS-implementation-tutorial/02-Building-blocks/11-Tick-Resolution

Make wrapper agnostic to CAN handles

ac33e09

And make it compile. There is still one error in advanced mode as CANWrapper_Transmit references s_init_struct.node_id which only exists in normal mode.

fix: minor edits

c3c8572

style: remove references to timer

2d83a92

fix: handle edge case with equal msg timestamps

93d75d5

Fix error handler timeout calculation

da15569

Koloss0 added 3 commits June 30, 2025 15:47

style: clean up

54c9e0b

Changes: * Deleted CAN queue and error queue code. * Now defines timeouts in terms of milliseconds for simplicity.

style: minor displacement of +1

25b14ea

Slightly easier to comprehend.

style: rename "normal API" to standard

da73230

You used the term "standard" and I thought that was a better name.

Koloss0 mentioned this pull request Jun 30, 2025

Ensure the TX cache is ISR-safe #55

Open

Koloss0 merged commit 0d5ad82 into main Jun 30, 2025

Koloss0 deleted the refactor/53 branch June 30, 2025 22:24

This was referenced Jun 30, 2025

Add a CANWrapper_Transmit_Raw function #44

Closed

Document CAN Wrapper API Changes #49

Closed

Conversation

ArnavG-it commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Koloss0 commented Jun 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArnavG-it commented Jun 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Koloss0 commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArnavG-it commented Jun 12, 2025

Uh oh!

Koloss0 commented Jun 12, 2025

Uh oh!

Koloss0 commented Jun 13, 2025

Uh oh!

ArnavG-it commented Jun 13, 2025

Uh oh!

ArnavG-it commented Jun 13, 2025

Uh oh!

Koloss0 commented Jun 14, 2025

Uh oh!

ArnavG-it commented Jun 14, 2025

Uh oh!

Koloss0 commented Jun 27, 2025

Uh oh!

ArnavG-it commented Jun 27, 2025

Uh oh!

Koloss0 commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArnavG-it commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Koloss0 commented Jun 28, 2025

Uh oh!

ArnavG-it commented Jun 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArnavG-it commented Jun 29, 2025

Uh oh!

Koloss0 commented Jun 29, 2025

Uh oh!

ArnavG-it commented Jun 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Koloss0 commented Jun 30, 2025

Uh oh!

Koloss0 commented Jun 30, 2025

Uh oh!

ArnavG-it commented Jun 30, 2025

Uh oh!

Koloss0 commented Jun 30, 2025

Uh oh!

ArnavG-it commented Jun 30, 2025

Uh oh!

Koloss0 commented Jun 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ArnavG-it commented Jun 3, 2025 •

edited

Loading

Koloss0 commented Jun 7, 2025 •

edited

Loading

ArnavG-it commented Jun 7, 2025 •

edited

Loading

Koloss0 commented Jun 12, 2025 •

edited

Loading

Koloss0 commented Jun 28, 2025 •

edited

Loading

ArnavG-it commented Jun 28, 2025 •

edited

Loading

ArnavG-it commented Jun 29, 2025 •

edited

Loading

ArnavG-it commented Jun 29, 2025 •

edited

Loading