Skip to content

Conversation

@jukkar
Copy link
Member

@jukkar jukkar commented Oct 22, 2024

The sock_obj_core_dealloc() was not called if close() is called instead of zsock_close(). This happens if POSIX API is enabled.

Fix this by calling zvfs_close() from zsock_close() and then pass the socket number to zsock_close_ctx() so that the cleanup can be done properly.

Reported-by: Andreas Ålgård [email protected]

Fixes #80997

@jukkar
Copy link
Member Author

jukkar commented Oct 22, 2024

Using the ZVFS_MODE_IFSOCK mode flag for calling close2(void *, int) instead of close(int) is a bit problematic as that flag could be set for example in offloaded drivers (there are total 23 places in upstream alone setting this mode flag). Only couple of places out of those calls need to change to close2() where the socket descriptor is needed.
Also using this flag to determine whether we call close2 vs close is error prone because then out-of-tree drivers using this flag would not work any more without changes. So I think we cannot change the semantics of that flag here just like that.
If we could add new values for the mode list, then we could use that value to determine whether to call close2 vs close.

@jukkar jukkar force-pushed the fix/socket-close-cleanup branch from 3484b2f to 335aa81 Compare October 23, 2024 12:00
@jukkar
Copy link
Member Author

jukkar commented Oct 23, 2024

  • Introduced a new mode flag for the needed close2 call. We cannot really use the IFSOCK flag to determine if we need to call close2 as that could affect out of tree drivers (typically offloaded network drivers need to set these).

@jukkar jukkar requested a review from rlubos October 23, 2024 12:02
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, this seems like it should be unnecessary

Copy link
Member Author

@jukkar jukkar Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the extra flag, but this can be super confusing to a future developer looking at this code.
What I mean by this is that some of the networking code is now using close(void*) callback variant and some (currently sockets_inet.c, sockets_tls.c and sockets_packet.c) are using the close2(void*, int) callback variant in order to fix the resource leak. In both cases the fdtable will call the close2(void*, int) if IFSOCK is set but the 2nd parameter is ignored most of the cases (I would have guessed compiler to give a warning in this case). For a future self looking this code in couple of years, this might probably look like a bug rather than a feature 😕

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there will be a fair bit of refactorization here between now and a couple of years from now, mainly because FS also needs to get a bump in the near term for PSE52.

So consumers of close() can be upgraded to close2(), and the older version can be deprecated. Additionally, that would be a good time to introduce something like a k_off_t type and deprecate the function signatures that use ssize_t and off_t.

@jukkar jukkar force-pushed the fix/socket-close-cleanup branch from 335aa81 to 8cc4e8c Compare October 28, 2024 08:19
The sock_obj_core_dealloc() was not called if close() is called
instead of zsock_close(). This happens if POSIX API is enabled.

Fix this by calling zvfs_close() from zsock_close() and then
pass the socket number to zsock_close_ctx() so that the cleanup
can be done properly.

Reported-by: Andreas Ålgård <[email protected]>
Signed-off-by: Jukka Rissanen <[email protected]>
@jukkar jukkar force-pushed the fix/socket-close-cleanup branch from 8cc4e8c to fe7b1f9 Compare October 28, 2024 14:13
@jukkar jukkar requested a review from cfriedt October 28, 2024 14:26
@jukkar
Copy link
Member Author

jukkar commented Oct 28, 2024

@icsys-aal I tested the code in this PR and did not spot any leaks, but if you have time, please try the PR to verify that your use case works too.

@cfriedt
Copy link
Member

cfriedt commented Oct 28, 2024

LGTM - thanks for this @jukkar

@jukkar jukkar added this to the v4.0.0 milestone Oct 29, 2024
@jukkar jukkar added the bug The issue is a bug, or the PR is fixing a bug label Nov 1, 2024
@icsys-aal
Copy link
Contributor

icsys-aal commented Nov 4, 2024

@icsys-aal I tested the code in this PR and did not spot any leaks, but if you have time, please try the PR to verify that your use case works too.

Hello, I apologize for not getting back to you sooner. Your fix looks good, however in our use case we are using a strange offloaded device, which seems to return a nonzero return code on offload_put, which means this never runs and the leak persists. Should we call sock_obj_core_dealloc regardless of the return value?

@jukkar
Copy link
Member Author

jukkar commented Nov 5, 2024

Your fix looks good, however in our use case we are using a strange offloaded device

@icsys-aal what offloaded driver is that? Perhaps there is a bug in it.

@icsys-aal
Copy link
Contributor

icsys-aal commented Nov 5, 2024

@icsys-aal what offloaded driver is that? Perhaps there is a bug in it.

The driver in question is the winc1500 driver.

@jukkar
Copy link
Member Author

jukkar commented Nov 6, 2024

however in our use case we are using a strange offloaded device, which seems to return a nonzero return code on offload_put, which means this never runs and the leak persists. Should we call sock_obj_core_dealloc regardless of the return value?

I briefly looked the Atmel HAL and its socket close implementation in
https://github.com/zephyrproject-rtos/hal_atmel/blob/56d60ebc909ad065bf6554cee73487969857614b/asf/common/components/wifi/winc1500/socket/source/socket.c#L908
and followed the call flow to hif_send() here
https://github.com/zephyrproject-rtos/hal_atmel/blob/56d60ebc909ad065bf6554cee73487969857614b/asf/common/components/wifi/winc1500/driver/source/m2m_hif.c#L273
According to the documentation there, the function should return 0 for successful operation. In the offloaded put function in

static int winc1500_put(struct net_context *context)

we just return the value from the HAL.
I think we should do the cleanup in Zephyr socket implementation only if offloaded/native code is also ok with that (so the net_context_put returns 0). The offloaded driver could decide to return 0 if the HAL is giving some bogus value if we want to fix this properly.

@jukkar
Copy link
Member Author

jukkar commented Nov 8, 2024

@mmahadevan108 / @dkalowsk Could we get this merged to 4.0?

@mmahadevan108
Copy link
Contributor

@jukkar , waiting for approval from the PR assignees @nashif , @andyross

@mmahadevan108 mmahadevan108 merged commit 04d8b7c into zephyrproject-rtos:main Nov 8, 2024
28 checks passed
@jukkar jukkar deleted the fix/socket-close-cleanup branch January 14, 2025 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: Base OS Base OS Library (lib/os) area: Networking area: Sockets Networking sockets bug The issue is a bug, or the PR is fixing a bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

The Posix API socket close() does not do cleanup properly

8 participants