Skip to content

TCTI fails with “Resource temporarily unavailable” (EAGAIN) errors #2949

@danieltrick

Description

@danieltrick

We have observed the following problem:

When using the mssim TCTI, “Resource temporarily unavailable” errors occur regularly. Often 2 out of 3 runs will fail!

For example, it looks like this:

tpm2_load -T 'mssim:host=192.168.178.47,port=2323' -C 0x81000006 -P 12345 -u ecc.pub -r ecc.priv -c ecc.ctx
Error message: WARNING:tcti:src/util-io/io.c:66:read_all() read on fd 3 failed with errno 11: Resource temporarily unavailable
ERROR:esys:src/tss2-esys/api/Esys_ContextSave.c:251:Esys_ContextSave_Finish() Received a non-TPM Error
ERROR:esys:src/tss2-esys/api/Esys_ContextSave.c:92:Esys_ContextSave() Esys Finish ErrorCode (0x000a000a)
ERROR: Esys_ContextSave(0xA000A) - tcti:IO failure

Note that “Resource temporarily unavailable” comes down to an EAGAIN error (i.e. errno 11).


I think the reason why this can happen is the way how tcti_mssim_receive() is currently implemented: It will first poll() the network socket until it becomes "ready for reading", and once this has happened, it will attempt to recv() the full response message. This is actually wrapped in the socket_recv_buf() function, which just calls the read_all() function.

There are, to my understanding, at least two ways how this can go wrong:

  • If poll() signals that the network socket is "ready for reading", it means that some bytes can be read now, but it does not guarantee that the full message is available yet. Nonetheless, the subsequent read_all() always attempts to read the full message, by repeatedly calling recv(). This will fail, if the full message cannot be read right now. Specifically, the read_all() function will fail with an EAGAIN error (instead of blocking and waiting), if insufficient data is available at the moment – because the socket was opened in O_NONBLOCK mode. And that is, I suppose, precisely what we are seeing.

  • At least on the Linux platform, the poll() and select() functions may cause a so-called "spurious readiness notification". This means that a socket may be reported as "ready for reading" but then the subsequent read() may still block because the socket is not actually ready. In O_NONBLOCK mode, recv() or read() will fail with EAGAIN in this situation.

    For reference, please see the "BUGS" sections at:


At the core of the problem is that the TEMP_RETRY macro does not currently handle the EAGAIN (and EWOULDBLOCK) errors.

At least on the Linux platform. It appears there is some handling on FreeBSD already 🤔

The following patch contains a simple workaround that has fixed the “Resource temporarily unavailable” problem for us:

diff --git a/src/util-io/io.h b/src/util-io/io.h
index 595177d3..dc9a35fa 100644
--- a/src/util-io/io.h
+++ b/src/util-io/io.h
@@ -44,11 +44,12 @@ typedef SSIZE_T ssize_t;
     dest =__ret; }
 #else
 #define TEMP_RETRY(dest, exp) \
-{   int __ret; \
+{   int __ret, __err = 0; \
     do { \
+        if (__err > 0) usleep(100U); \
         __ret = exp; \
-    } while (__ret == SOCKET_ERROR && errno == EINTR); \
-    ((dest)) =__ret; }
+    } while ((__ret == SOCKET_ERROR) && (errno == EINTR || errno == EAGAIN || errno == EWOULDBLOCK) && (++__err < 32767)); \
+    ((dest)) = __ret; }
 #endif
 
 #ifdef __cplusplus

I think the preferable solution would be going back to polling when it turns out that no or insufficient data is available for reading, while keeping the partial message that has already been read. But that would probably require some more significant changes.

Regards.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions