Skip to content

Refactor: Redesign API for Greater Control As Needed by SOTI & Testbed#54

Merged
Koloss0 merged 21 commits intomainfrom
refactor/53
Jun 30, 2025
Merged

Refactor: Redesign API for Greater Control As Needed by SOTI & Testbed#54
Koloss0 merged 21 commits intomainfrom
refactor/53

Conversation

@ArnavG-it
Copy link
Copy Markdown
Contributor

@ArnavG-it ArnavG-it commented Jun 3, 2025

Multiple API changes to support the test interfaces and improve timing:

  • Replaced Polling with RTOS

    • Every subsystem using the CAN Wrapper will have to enable RTOS and move their superloop code into the default task (or multiple tasks if they choose).
    • The CAN Wrapper manages its own tasks and queues once initialized.
    • Callbacks for message processing and error handling occur in a task context, not ISRs.
  • Support for multiple CAN lines

    • Subsystems will use CANWrapper_CAN_Start to initialize their CAN peripheral, and pass the hcan pointer to the CAN Wrapper's functions.
    • The testbed (and potentially SOTI) are able to use multiple CAN peripherals.
  • Replaced Modes with Standard and Advanced APIs

    • Subsystems will use the standard API.
    • SOTI and the testbed's IO board can use the advanced API to access Tx and Rx callbacks as well as transmit custom messages. The user's node ID isn't tracked in the advanced API.
  • Replaced Hardware Timer with RTOS Ticks*

    • Users don't need to enable a hardware timer to use the CAN Wrapper.

Other notes:

  • Subsystems and SOTI must define the symbol STM32L4. The IO board must define STM32F7.
  • Added a body_size field to the CANMessage type.
  • Removed the CDH-specific message task.

TODO:

  • *Look into the RTOS tick resolution and/or overhead of running ticks at a high frequency to see if they're reliable for the 3.6 ms message timeout.
  • Look into improvements for the Tx Cache, given this race condition and other possible problems.
  • Remove unneeded references and definitions.
  • Update the wiki.

The changes are being tested on this testbed branch.

Closes #24, #52.
#53 should probably be closed as "not planned".

Koloss0 added 2 commits June 7, 2025 12:28
* Cleaned up and renamed some things.
* Filter messages by recipient ID by default, disabling with
  CWM_DISABLE_FILTERING.
* Replaced "RTOS Mode" with CWM_DEFINE_RTOS_TASK symbol.
@Koloss0
Copy link
Copy Markdown
Member

Koloss0 commented Jun 7, 2025

Made some edits. However, there's still some compiler errors from the calls to CANWrapper_Transmit_Ack() missing CAN handles. Haven't had the chance to figure out a proper solution to it, yet.

@ArnavG-it
Copy link
Copy Markdown
Contributor Author

ArnavG-it commented Jun 7, 2025

Thanks for the changes! If you do add the CAN handle as an argument for CANWrapper_Poll_CAN_Queue, you could also add back the if (userHandle == msgHandle) to wrap the message processing which was previously in the reception ISR. If it's even needed.

@Koloss0
Copy link
Copy Markdown
Member

Koloss0 commented Jun 12, 2025

Hey, sorry for disappearing. I've actually been contemplating this line specifically.

As you can see, it transmits ACKs for each message before processing it. Which means that if any foreseeable command we implement takes longer than the 3.6 millisecond timeout, we could be at risk of failing to acknowledge messages in time.

Now, 3.6 milliseconds is a looong time for an 80 MHz CPU, but delays could happen. Take for example:

  • Faulty loops
  • Any kind of polling from hardware, such as ADCs
  • Transmitting to CAN (especially repeatedly)
  • Long ISRs causing delays

In fact, this one ISR in Payload might just be the evidence I needed to feel worried about not using an RTOS. It polls 32 individual ADCs in succession while sending 32 CAN messages. Just one CAN message would theoretically take 0.224 milliseconds with our current baud rate. And 0.224 x 32 = 7.168 ms, well over the timeout. And this is hooked up to a timer so it can happen at any time.

The funny thing is I actually wrote this code myself, and I'm only now realizing how bad of an idea that is. But even if we took it out of the ISR and implemented our own pseudo-scheduling, it still needs to get executed and we'd have to probably sprinkle in calls for sending ACKs all over the code in an attempt to ensure delays don't get in the way. But at that point, we're basically re-inventing RTOS, and there's still no guarantee that we caught everything.

Also, this ISR is for telemetry reports, which is something that every subsystem will have to implement.

So, the more I think about this the more convinced I am that all subsystems should probably use an RTOS, and we should probably require an RTOS for managing ACKs. It's the only way to ensure with confidence that ACKs are returned in a timely manner.

@ArnavG-it
Copy link
Copy Markdown
Contributor Author

@Koloss0 That's a good point, thanks for laying it out. Mind explaining why ACKs can't be transmitted in the Rx ISR? Is it just to minimize the time spent in it? Either way, I see what you mean and it does seem like scheduling/RTOS has a lot of benefits.

@Koloss0
Copy link
Copy Markdown
Member

Koloss0 commented Jun 12, 2025

It's to minimize the time spent in it, yes. All the advice out there seems to suggest keeping ISRs as short as possible is good practice, so as to not interfere with other polling/IO operations. Also, I don't know what the likelihood of this happening is, but technically the ISR could fire again while it's in the middle of transmitting an ACK, which would cause the second ACK to not get transmitted. So, we're keeping it as short as possible by just loading messages into a queue for later processing.

Switched implementation to be RTOS-only based on the rationale from this
comment:

#54 (comment)
@Koloss0
Copy link
Copy Markdown
Member

Koloss0 commented Jun 13, 2025

@ArnavG-it Okay, I made a new commit that addresses the points we raised. For simplicity, TUK will now only support RTOS. I think it's the right way to go.

I tried to just get my ideas down for how the logic would work and didn't bother getting it to compile. Could you take over from here for me and get it working? It would be much appreciated.

@ArnavG-it
Copy link
Copy Markdown
Contributor Author

Awesome, I'll definitely take a look tomorrow.

Also added some task attributes and an osMessageQueueNew wrapper.
@ArnavG-it
Copy link
Copy Markdown
Contributor Author

@Koloss0
Copy link
Copy Markdown
Member

Koloss0 commented Jun 14, 2025

Amazing, thank-you!

For the creation of tasks and queues, initially I was going to let the user create them with the settings they want and pass it into the module.

CDH uses this screen for managing all their tasks and queues, so they could use it like normal and pass them to the module:

image

However, when I think about it, it's possible we may want to add even more queues to TUK (e.g. take the error queue, or maybe txCache for example). And if that's the case, that might be introducing a lot of queues that the user isn't really in a knowledgeable position to set good values. And if we ever want to change it later it could be slow and inefficient to manually go through each subsystem and change it ourselves.

So maybe we should hide the creation of those things in the CANWrapper_Init() function. Then we have control over it right here in this repository. The user would just have to make sure it's between osKernelInitialize() and osKernelStart().

So, I would say:

  • Don't even have the queues in the CANWrapper_InitTypedef
  • The osMessageQueueNew() and osThreadNew() calls can also just go in CANWrapper_Init().

@ArnavG-it
Copy link
Copy Markdown
Contributor Author

That makes sense. I was thinking about some similar stuff, hence why I didn't use the .ioc for the IO board example. Doing it all in TUK directly sounds good. Let me know if you're working on that or want me to.

This means CANWrapper_Init will have to be called after osKernelInitialize in the subsystem code.
@Koloss0
Copy link
Copy Markdown
Member

Koloss0 commented Jun 27, 2025

@ArnavG-it Hey again, wanted to update you. I've been a bit busy but I'm now taking this up again. I'm trying to find a solution that isn't too complex but gives us the control we need for doing all these low-level functions in the IO board and SOTI. It's taken a while but I think I'm getting somewhere.

Also, fun fact:

technically the ISR could fire again while it's in the middle of transmitting an ACK, which would cause the second ACK to not get transmitted.

This is incorrect. I found this out by looking into the Cortex-M4, which is the processor used in STM32's. If an interrupt occurs with the same priority as the one already being handled, it actually defers until the request completes, known as tail-chaining. But a higher priority interrupt can pre-empt the current one. Very interesting.

@ArnavG-it
Copy link
Copy Markdown
Contributor Author

Thanks for the update. I've been a bit busy too but I'm ready to work on this once you think of the next steps.

Adds a separate API for advanced use cases. This maintains the
simplicity of the old API while enabling greater control with the
advanced API.

Other changes:
* Added significantly more documentation in the code.
* Added rx_callback & tx_callback in the advanced API.
* Renamed a few things.
* Added an 'Error Handler' thread.
* Replaced ErrorQueue ADT with an RTOS queue.
* Added CANWrapper_Transmit_Raw() in the advanced API.
* Added `body_size` field to CANMessage struct: This is a minor change
  which doesn't effect much. The benefits are that it feels just a bit
  more elegant in the code (entirely subjective), and also gives
  advanced users the possibility of transmitting illegally sized message
  bodies for testing purposes (if that is ever desired). It could also
  be convenient to normal users if they want to copy the body contents
  to a buffer.

**DOES NOT COMPILE:** The commit was getting big the more I changed
things, so I pre-emptively created this commit. However, I still want
to make a few changes.

Those changes will be (potentially):
* Removing hcan from CANWrapper_InitTypeDef and using CAN_Start()
  in both APIs. This will slightly reduce the need for some #ifdefs.
* Making it compile.
* Looking into `HAL_CAN_TxMailbox0CompleteCallback`: This is an ISR
  which would allow us to execute code whenever a message completes
  transmission. This would be ideal for tx_callback, because
  currently it only gets called when a TX request is made, not when the
  message is sent.
@Koloss0
Copy link
Copy Markdown
Member

Koloss0 commented Jun 28, 2025

@ArnavG-it Alright, I pushed the current state of what I was working on. I have to stop for a few hours, but I think it's coming along very well. I wrote down some things I was going to work on next. Feel free to pick up where I left off with the first two bullets if you have time. It should be fairly straightforward and would introduce you to the changes. The third bullet should probably be a separate issue. And all good if you can't!

Also footnote: For the first bullet, I didn't explicitly mention it but along with removing hcan from the init struct, I was also going to add hcan back to the callbacks and transmit function, thereby basically making support for multiple CAN peripherals part of the standard API. It probably won't get used by anyone except us, but it would reduce the #ifdefs a bit.

@ArnavG-it
Copy link
Copy Markdown
Contributor Author

ArnavG-it commented Jun 28, 2025

This looks like it will block permanently if s_error_queue ever becomes empty:

void Error_Handler_Thread(void *argument)
{
	CANWrapper_ErrorInfo item;

	// Infinite loop
	while (1)
	{
		Poll_Timeouts();

		// Wait for the next item in the queue.
		if (osMessageQueueGet(s_error_queue, &item, NULL, osWaitForever) == osOK)
		{
			// Let the user handle it.
			s_init_struct.error_callback(&item);
		}
	}
}

We could essentially set a polling rate by doing this:

osMessageQueueGet(s_error_queue, &item, NULL, 10) // Timeout in ms

That's pretty simple and probably fine, but there might be a fun solution. Rather than using the hardware timer, we could use the RTOS kernel ticks for the message timestamps. Then the error handler task could use osDelayUntil() to wait for the next message's timeout, and fire the callback if the message hasn't been removed from the Tx cache. It would also block when the cache is empty and would have to be woken up by CANWrapper_Transmit_Raw when an item gets added, which should just be a simple thread flag. This would get rid of the hardware timer (unless it's used elsewhere), the error queue, and the overflow handling in Poll_Timeouts which was #24. I haven't thought about it too much so I might be missing something, but it could simplify some stuff if it works. Also, the kernel ticks won't be as precise as the timer, if that matters.

@Koloss0
Copy link
Copy Markdown
Member

Koloss0 commented Jun 28, 2025

Ah, good catch!

It would be worth looking into how accurate the ticks would likely be. The decision for using a timer I believe came from:

  1. The need for a solution that accomodated both RTOS and non-RTOS subsystems. (which no longer applies)
  2. Ensuring high enough accuracy.

But if the accuracy is sufficient, then that would be a much simpler and more robust approach.

@Koloss0 Koloss0 changed the title Refactor: Remove Incoming Message Pipeline Refactor: Redesign API for Greater Control As Needed by SOTI & Testbed Jun 28, 2025
Also removes the error queue. Still have to remove the htim peripheral references.

Article about tick resolution:
https://www.freertos.org/Documentation/02-Kernel/05-RTOS-implementation-tutorial/02-Building-blocks/11-Tick-Resolution
@ArnavG-it
Copy link
Copy Markdown
Contributor Author

ArnavG-it commented Jun 29, 2025

The Tx Cache is a bit problematic with RTOS.

The HAL_CAN_RxFifo0MsgPendingCallback ISR can erase from it:

if (rx_behaviour | RX_CLEAR_TX_STORE && item.msg.is_ack)
{
	// Clear the corresponding message in the TxCache if it exists.
	int index = TxCache_Find(&s_tx_cache, msg);
	if (index > 0)
	{
		TxCache_Erase(&s_tx_cache, index);
	}
}

And Error_Handler_Thread needs to read/erase from it:

if (timestamp == s_tx_cache.items[0].timestamp)
{
	CANWrapper_ErrorInfo error;
	error.error = CAN_WRAPPER_ERROR_TIMEOUT;
	error.msg = s_tx_cache.items[0].msg;

	TxCache_Erase(&s_tx_cache, 0);

	s_init_struct.error_callback(&error);
}

This is true whether we use kernel ticks or the previous Poll_Timeouts approach. The case where a race condition happens seems very rare (an ack is received at the same time that the message times out), but there could be other less-obvious or future concurrency issues so this might be worth looking into.

I'll look into semaphores, although letting the error handler run while the ISR is blocked isn't possible, so I'm not sure if there's a solution there. Maybe there's some way to make erasing and reading messages atomic. Maybe the removal of messages could be deferred from the ISR to a task, in which case semaphores would solve it. This probably isn't a priority but I figured I should take note of it.

And make it compile. There is still one error in advanced mode as CANWrapper_Transmit references s_init_struct.node_id which only exists in normal mode.
@ArnavG-it
Copy link
Copy Markdown
Contributor Author

@Koloss0 Let me know how you want to deal with CANWrapper_Transmit between the standard and advanced APIs. The particular issue is that the sender is inferred from the node ID in standard mode, but advanced mode has no node ID. We could only let advanced mode use Transmit_Raw, or have two versions of Transmit, or something else, up to you. And thank you for the changes! It's looking good.

@Koloss0
Copy link
Copy Markdown
Member

Koloss0 commented Jun 29, 2025

@ArnavG-it

And thank you for the changes! It's looking good.

Thanks for your help as well!

Let me know how you want to deal with CANWrapper_Transmit between the standard and advanced APIs.

I was thinking Transmit_Raw() would act as the advanced version of Transmit(). So SOTI and the IO board would just use the raw one for all transmissions.

This probably isn't a priority but I figured I should take note of it.

No, absolutely! This was at the back of my mind as well as I was working on it. Race conditions are incredibly common in threaded environments so we need to stay vigilant. Thanks for already doing research on it!

@ArnavG-it
Copy link
Copy Markdown
Contributor Author

ArnavG-it commented Jun 29, 2025

@Koloss0 Here are some notes about the accuracy of ticks if you're curious.

CAN messages are considered timed out after 3.6 milliseconds. The job of Error_Handler_Thread is to check the transmitted messages cache to see if any messages have timed out and fire the user's error handling callback. It does this by delaying until the next message's timeout has occurred and then checking whether the message is still in the cache. There are two cases to consider:

  1. The error handler wakes up before the message actually times out. This could incorrectly report a message that gets a late acknowledgement, even if the chance of that is rare.

  2. The error handler delays too long. The issue with this is that the timed out message is reported late. I'm not sure how much of a problem that would be.

Assuming case 1 should be avoided at all costs, the error handler should delay an extra tick. Then there's a tradeoff between minimizing the extra delay time in case 2 and the overhead from a higher tick frequency.

Say a timer is started at tick 0. Here are two idealized examples:

  1. 1 ms = 1 tick (1 kHz, this is the default and what CDH's kernel is set to). A message with a timestamp of 1 will time out between ticks 4 and 6. This is because the message was sent between 1 and 2 ms on the timer, and will time out between 4.6 and 5.6 ms. The error handler should then delay until tick 6 to avoid case 1. It may wake up late, close to 7 ms on the timer, so it will have reported the message 0.4 to 2.4 ms late.

  2. 1 ms = 10 ticks (10 kHz). A message with a timestamp of 1 (sent between 0.1 and 0.2 ms on the timer) will time out between ticks 37 and 38 (in between 3.7 and 3.8 ms). The error handler will delay until tick 38, and may not wake up until right before tick 39, so it will have reported the message 0 to 0.2 ms late.

Hopefully that's all correct and makes sense. Essentially, timeouts will always be reported late in order to not be reported early, and we can just control how late that might be. I haven't looked into the exact overhead yet, but let me know what you think about that extra delay time. We could always go back to the timer if it's a problem.

@Koloss0
Copy link
Copy Markdown
Member

Koloss0 commented Jun 30, 2025

@ArnavG-it Yeah, the extra delay shouldn't be a problem. None of these values are set in stone, it's all stuff that can be re-evaluated at a later date if we need to.

The only issue I'm spotting though, is that it's possible to have two messages with the same timestamp. That could lead to an unexpected edge case in the error handler thread. To fix it, there should be a loop to remove and handle items with matching timestamps until none are left.

@Koloss0
Copy link
Copy Markdown
Member

Koloss0 commented Jun 30, 2025

Actually, it appears it would have been fine. I thought osDelayUntil might not properly handle past timestamps, which may lead to extremely long delays, but it seems to correctly distinguish it from overflows.

@ArnavG-it
Copy link
Copy Markdown
Contributor Author

Actually, it appears it would have been fine. I thought osDelayUntil might not properly handle past timestamps, which may lead to extremely long delays, but it seems to correctly distinguish it from overflows.

Yeah, a benefit is we don't have to handle overflows. Thanks for double checking though. I do have to fix the timeout calculation in that thread, by the way, so ignore the fact that it makes no sense currently.

@Koloss0
Copy link
Copy Markdown
Member

Koloss0 commented Jun 30, 2025

I do have to fix the timeout calculation in that thread, by the way, so ignore the fact that it makes no sense currently.

Wouldn't it work? The data type is a uint32_t, the same as the RTOS tick counter. So if it overflows, it should behave the same as the tick counter overflowing.

@ArnavG-it
Copy link
Copy Markdown
Contributor Author

Wouldn't it work? The data type is a uint32_t, the same as the RTOS tick counter. So if it overflows, it should behave the same as the tick counter overflowing.

I was just accidentally adding the timeout in microseconds to a tick value. The above commit should have fixed that if I did it correctly.

Koloss0 added 3 commits June 30, 2025 15:47
Changes:
* Deleted CAN queue and error queue code.
* Now defines timeouts in terms of milliseconds for simplicity.
Slightly easier to comprehend.
You used the term "standard" and I thought that was a better name.
@Koloss0
Copy link
Copy Markdown
Member

Koloss0 commented Jun 30, 2025

I think this is good enough for this PR. Regarding the TODO's,

@Koloss0 Koloss0 merged commit 0d5ad82 into main Jun 30, 2025
@Koloss0 Koloss0 deleted the refactor/53 branch June 30, 2025 22:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve Readability of Timeout Detection Code

2 participants