Skip to content

Commit 684cf34

Browse files
committed
Restart the TCP/IP stack on crash.
The TCP/IP stack currently has a lot of state and cannot recover gracefully in case of failure. This commit addresses this limitation. We introduce a custom error handler for the TCP/IP stack. At a high-level, this error handler terminates all user threads currently present in the TCP/IP compartment and resets the IP thread to its entry point `ip_thread_entry`. It does so by setting all the locks of the compartment for destruction and notifying all futex waiters. Then it frees all heap allocations and resets globals. After all the state has been cleaned up, it restarts the network stack.
1 parent 136787e commit 684cf34

15 files changed

+1244
-259
lines changed

README.md

Lines changed: 34 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -30,8 +30,8 @@ The new code in this repository is around 3% of the size of the large components
3030
Compartmentalisation
3131
--------------------
3232

33-
The initial implementation has five compartments.
34-
Three are mostly existing third-party code with thin wrappers:
33+
The initial implementation has six compartments.
34+
Four are mostly existing third-party code with thin wrappers:
3535

3636
- The TCP/IP stack is in a compartment.
3737
The FreeRTOS code was originally written on the assumption of a single security domain and separating it would require refactoring that would be hard to keep up to date.
@@ -75,14 +75,17 @@ graph TD
7575
classDef ThirdParty fill: #e44
7676
```
7777

78-
The TCP/IP stack has a lot of state and cannot yet recover gracefully in case of failure, but failure is at least isolated in the caller.
79-
In contrast, the TLS compartment is almost completely stateless.
80-
This gives strong flow isolation properties: Even if an attacker compromises the TLS compartment by sending malicious data over one connection that triggers a bug in BearSSL (unlikely), it is extraordinarily difficult for them to interfere with any other TLS connection.
78+
The TCP/IP stack is a large compartment with a lot of state.
79+
It is fault-tolerant: when an error is triggered (CHERI spatial or temporal safety fault, assertion), the compartment is automatically reset to a pristine state and restarted.
80+
We expand on this capability [below](#automatic-restart-of-the-tcpip-stack).
81+
82+
Unlike the TCP/IP stack, the TLS compartment is almost completely stateless.
83+
This makes resetting the compartment trivial, and gives strong flow isolation properties: Even if an attacker compromises the TLS compartment by sending malicious data over one connection that triggers a bug in BearSSL (unlikely), it is extraordinarily difficult for them to interfere with any other TLS connection.
8184

8285
Similarly, the firewall is controlled by the Network API compartment.
8386
The TCP/IP stack has no access to the control-plane interface for the compartment.
8487
A compromise that gets arbitrary-code execution in the network stack cannot open new firewall holes (to join a DDoS botnet such as [Mirai](https://en.wikipedia.org/wiki/Mirai_(malware)), for example).
85-
Note that there are currently some technical limitations to this, see the note below.
88+
Note that there are currently some technical limitations to this, see the discussion below.
8689
The worst it can do to rest of the system is provide malicious data, but a system using TLS will have HMACs on received messages and so this is no worse than a malicious packet being injected from the network.
8790

8891
All of this is on top of the spatial and temporal safety properties that the CHERIoT platform provides at a base level.
@@ -224,3 +227,28 @@ Now that you know that the SNTP compartment is the only one that can send and re
224227

225228
If you've modified the SNTP compartment to point to your NTP service and use its authentication credentials, then this should be different.
226229
This can all be part of your firmware's auditing policy.
230+
231+
Automatic restart of the TCP/IP stack
232+
-------------------------------------
233+
234+
We designed the TCP/IP stack to automatically and transparently restart on failure (e.g., a CHERI fault or an assertion).
235+
The restart procedure broadly works like this (simplified for didactic reasons):
236+
237+
1. The error handler of the TCP/IP compartment is triggered and starts the reset procedure.
238+
2. It first sets a flag to prevent any new thread from entering the compartment.
239+
3. Then, it sets all synchronization primitives of the compartment (locks, futexes) for destruction. This wakes up sleeping threads present in the compartment, and prevents them from blocking again.
240+
4. Then, it waits for all threads present in the compartment (apart from the FreeRTOS network thread) to exit, either through normal control-flow, or by crashing.
241+
5. Finally, it frees all the memory of the compartment, resets all global state, and calls the start function of the network stack, which restarts the TCP/IP stack into a pristine working state.
242+
6. After the reset, any further call to the socket API (apart from `network_socket_close`) with an old socket from the previous instance of the network stack will be detected and failed with an `-ENOTCONN` code. This pushes callers to close the sockets and create new ones with the new instance of the TCP/IP stack.
243+
244+
The implementation details of the reset slightly deviate from this description.
245+
See the technical documentation in `tcpip_error_handler.h` for a full perspective.
246+
247+
Note that the current implementation of the automatic reset makes a few assumptions:
248+
249+
- The TCP/IP stack cannot currently recover from a crash due to a stack overflow in the TCP/IP compartment. This is due to a limitation of the implementation of the switcher, which cannot trigger the error handler on stack overflow. This limitation should be addressed soon.
250+
- A small set of globals called 'reset-critical' outlive resets and/or are necessary for the reset. We assume that this data has not been corrupted. This data is correspondingly annotated in the source code.
251+
- The control-flow of threads in the compartment has not been altered.
252+
253+
These assumptions leave some attack surface to malicious actors.
254+
We are working on improvements to remove or weaken them.

lib/firewall/firewall.cc

Lines changed: 65 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
// Copyright SCI Semiconductor and CHERIoT Contributors.
22
// SPDX-License-Identifier: MIT
33

4-
#include "firewall.h"
4+
#include "firewall.hh"
55
#include <atomic>
66
#include <compartment-macros.h>
77
#include <debug.hh>
@@ -69,6 +69,11 @@ namespace
6969

7070
std::atomic<uint32_t> barrier;
7171

72+
/**
73+
* This is used to synchronize with the TCP/IP stack during resets.
74+
*/
75+
std::atomic<uint8_t> *tcpipRestartState = nullptr;
76+
7277
auto &lazy_network_interface()
7378
{
7479
static EthernetDevice interface;
@@ -240,6 +245,13 @@ namespace
240245
return table;
241246
}
242247

248+
void clear(IPProtocolNumber protocol)
249+
{
250+
auto guardedTable = permitted_endpoints(protocol);
251+
auto &[g, table] = guardedTable;
252+
table.clear();
253+
}
254+
243255
void remove_endpoint(IPProtocolNumber protocol,
244256
Address endpoint,
245257
uint16_t localPort,
@@ -492,6 +504,16 @@ namespace
492504

493505
bool packet_filter_ingress(const uint8_t *data, size_t length)
494506
{
507+
uint32_t stateSnapshot = tcpipRestartState->load();
508+
if (stateSnapshot != 0 &&
509+
((stateSnapshot & RestartStateDriverKickedBit) == 0))
510+
{
511+
// We are in a reset and the driver has not yet been
512+
// restarted.
513+
Debug::log("Dropping packet due to network stack restart.");
514+
return false;
515+
}
516+
495517
// Not a valid Ethernet frame (64 bytes including four-byte FCS, which
496518
// is stripped by this point).
497519
if (length < 60)
@@ -527,26 +549,6 @@ namespace
527549

528550
} // namespace
529551

530-
bool ethernet_driver_start()
531-
{
532-
// Protect against double entry. If the barrier state is 0, no
533-
// initialisation has happened and we should proceed. If it's 1, we're in
534-
// the middle of initialisation, if it's 2 then initialisation is done. In
535-
// any non-zero case, we should not try to do anything.
536-
uint32_t expected = 0;
537-
if (!barrier.compare_exchange_strong(expected, 1))
538-
{
539-
return false;
540-
}
541-
Debug::log("Initialising network interface");
542-
auto &ethernet = lazy_network_interface();
543-
ethernet.mac_address_set();
544-
// Poke the barrier and make the driver thread start.
545-
barrier = 2;
546-
barrier.notify_one();
547-
return true;
548-
}
549-
550552
bool ethernet_send_frame(uint8_t *frame, size_t length)
551553
{
552554
// We do not check the frame and length here, because we do not use it
@@ -748,3 +750,45 @@ void firewall_remove_udpipv6_remote_endpoint(uint8_t *remoteAddress,
748750
IPProtocolNumber::UDP, *copy, localPort, remotePort);
749751
}
750752
}
753+
754+
bool ethernet_driver_start(std::atomic<uint8_t> *state)
755+
{
756+
if (tcpipRestartState == nullptr)
757+
{
758+
if (!CHERI::check_pointer<CHERI::PermissionSet{
759+
CHERI::Permission::Load, CHERI::Permission::Global}>(
760+
state, sizeof(*state)))
761+
{
762+
Debug::log("Invalid TCP/IP state pointer {}", state);
763+
return false;
764+
}
765+
tcpipRestartState = state;
766+
}
767+
if (tcpipRestartState->load() != 0)
768+
{
769+
// This is a restart, no need to actually reset the driver.
770+
// Instead, just remove all firewall entries.
771+
Debug::log("Network stack restart: clearing all entries.");
772+
EndpointsTable<IPv6Address>::instance().clear(IPProtocolNumber::UDP);
773+
EndpointsTable<IPv6Address>::instance().clear(IPProtocolNumber::TCP);
774+
EndpointsTable<uint32_t>::instance().clear(IPProtocolNumber::UDP);
775+
EndpointsTable<uint32_t>::instance().clear(IPProtocolNumber::TCP);
776+
return true;
777+
}
778+
// Protect against double entry. If the barrier state is 0, no
779+
// initialisation has happened and we should proceed. If it's 1, we're in
780+
// the middle of initialisation, if it's 2 then initialisation is done. In
781+
// any non-zero case, we should not try to do anything.
782+
uint32_t expected = 0;
783+
if (!barrier.compare_exchange_strong(expected, 1))
784+
{
785+
return false;
786+
}
787+
Debug::log("Initialising network interface");
788+
auto &ethernet = lazy_network_interface();
789+
ethernet.mac_address_set();
790+
// Poke the barrier and make the driver thread start.
791+
barrier = 2;
792+
barrier.notify_one();
793+
return true;
794+
}

lib/firewall/firewall.h renamed to lib/firewall/firewall.hh

Lines changed: 22 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
// Copyright SCI Semiconductor and CHERIoT Contributors.
22
// SPDX-License-Identifier: MIT
33

4+
#pragma once
5+
#include <atomic>
46
#include <compartment.h>
57

68
/**
@@ -11,17 +13,31 @@ bool __cheri_compartment("Firewall")
1113
ethernet_send_frame(uint8_t *packet, size_t length);
1214

1315
/**
14-
* Start the Firewall driver. This returns true if the driver is successfully
15-
* started, false otherwise. This should fail only if the driver is already
16-
* initialised.
16+
* Start the Firewall driver.
17+
*
18+
* `state` should point to the reset state of the TCP/IP stack.
19+
*
20+
* This returns true if the driver is successfully started, false otherwise.
21+
* This should fail only if the driver is already initialised (outside of a
22+
* reset), or if `state` is invalid.
23+
*/
24+
bool __cheri_compartment("Firewall")
25+
ethernet_driver_start(std::atomic<uint8_t> *state);
26+
27+
/**
28+
* Bit of the `state` atomic variable passed to `ethernet_driver_start` which
29+
* indicates that the driver may send packets to the network stack.
30+
*
31+
* This must match the `DriverKicked` bit of enum `RestartState` (see
32+
* `tcpip-internal.h`). This is statically asserted in the TCP/IP stack.
1733
*/
18-
bool __cheri_compartment("Firewall") ethernet_driver_start(void);
34+
static constexpr uint32_t RestartStateDriverKickedBit = 0x4;
1935

2036
/**
2137
* Query the link status of the Firewall driver. This returns true if the link
2238
* is up, false otherwise.
2339
*/
24-
bool __cheri_compartment("Firewall") ethernet_link_is_up(void);
40+
bool __cheri_compartment("Firewall") ethernet_link_is_up();
2541

2642
/**
2743
* Receive a frame from the Firewall device via the on-device firewall.
@@ -42,7 +58,7 @@ void __cheri_compartment("Firewall") firewall_dns_server_ip_set(uint32_t ip);
4258
* This should be called only by the NetAPI compartment.
4359
*/
4460
void __cheri_compartment("Firewall")
45-
firewall_permit_dns(bool dnsIsPermitted __if_cxx(= true));
61+
firewall_permit_dns(bool dnsIsPermitted = true);
4662

4763
/**
4864
* Open a hole in the firewall for TCP packets to and from the given endpoint.

lib/netapi/NetAPI.cc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
// Copyright SCI Semiconductor and CHERIoT Contributors.
22
// SPDX-License-Identifier: MIT
33

4-
#include "../firewall/firewall.h"
4+
#include "../firewall/firewall.hh"
55
#include "../tcpip/network-internal.h"
66
#include <NetAPI.h>
77

lib/tcpip/BufferManagement.cc

Lines changed: 12 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33

44
#include "FreeRTOS.h"
55
#include "FreeRTOS_IP.h"
6+
#include <BufferManagement.hh>
67
#include <NetworkBufferManagement.h>
78
#include <cstdlib>
89
#include <debug.hh>
@@ -12,13 +13,6 @@ using Debug = ConditionalDebug<false, "Buffer management">;
1213

1314
using CHERI::Capability;
1415

15-
// Use a separate allocator quota for the buffer manager (false by default).
16-
// The buffer manager is responsible for allocating network buffers, which
17-
// differs from the other types of allocations the TCP/IP stack performs. It
18-
// may thus make sense to give it its own quota. We may want to expose this as
19-
// a build system configuration option at some point.
20-
#define USE_DEDICATED_BUFFERMANAGER_POOL false
21-
2216
#if USE_DEDICATED_BUFFERMANAGER_POOL
2317
// Size of the buffer manager allocator quota.
2418
# define BM_MALLOC_QUOTA \
@@ -29,6 +23,14 @@ DECLARE_AND_DEFINE_ALLOCATOR_CAPABILITY(BufferManagementMallocQuota,
2923
BM_MALLOC_QUOTA);
3024
# define BM_MALLOC_CAPABILITY \
3125
STATIC_SEALED_VALUE(BufferManagementMallocQuota)
26+
27+
/**
28+
* Free the buffer manager's memory.
29+
*/
30+
void free_buffer_manager_memory()
31+
{
32+
heap_free_all(BM_MALLOC_CAPABILITY);
33+
}
3234
#else
3335
# define BM_MALLOC_CAPABILITY MALLOC_CAPABILITY
3436
#endif
@@ -132,7 +134,9 @@ void vReleaseNetworkBufferAndDescriptor(
132134

133135
// Failure is not supposed to happen unless we have a bug here
134136
// or in FreeRTOS.
135-
Debug::Assert(ret == 0, "Failed to free network buffer or descriptor.");
137+
Debug::Assert(ret == 0,
138+
"Failed to free network buffer or descriptor (error {}).",
139+
ret);
136140
}
137141
}
138142

lib/tcpip/BufferManagement.hh

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
// Copyright SCI Semiconductor and CHERIoT Contributors.
2+
// SPDX-License-Identifier: MIT
3+
4+
#pragma once
5+
6+
/**
7+
* Internal helpers and configuration settings for use in the TCP/IP buffer
8+
* manager. These should be called or used only from/in the TCP/IP compartment.
9+
*/
10+
11+
#ifndef USE_DEDICATED_BUFFERMANAGER_POOL
12+
// Use a separate allocator quota for the buffer manager (false by default).
13+
// The buffer manager is responsible for allocating network buffers, which
14+
// differs from the other types of allocations the TCP/IP stack performs. It
15+
// may thus make sense to give it its own quota. We may want to expose this as
16+
// a build system configuration option at some point.
17+
# define USE_DEDICATED_BUFFERMANAGER_POOL false
18+
#endif
19+
20+
#if USE_DEDICATED_BUFFERMANAGER_POOL
21+
extern void free_buffer_manager_memory();
22+
#else
23+
/**
24+
* Free the buffer manager's memory. Since the buffer manager does not have a
25+
* dedicated quota, this is a noop.
26+
*/
27+
static inline void free_buffer_manager_memory()
28+
{
29+
return;
30+
}
31+
#endif

lib/tcpip/FreeRTOSIPConfig.h

Lines changed: 29 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,28 @@
3232
#define configMAX_PRIORITIES 1
3333
#define configMINIMAL_STACK_SIZE 128
3434

35+
/**
36+
* FreeRTOS+TCP uses descriptors to track information about the packets in a
37+
* window. A pool of descriptors is allocated when the first TCP connection is
38+
* made. This configuration entry sets the size of the pool.
39+
*
40+
* In the worst case, each TCP connection will have an Rx and Tx window of at
41+
* most 8 segments (according to the FreeRTOS+TCP documentation), requiring 16
42+
* descriptors in total per TCP connection. This configuration entry should
43+
* thus be set to `(expected maximum number of TCP connections) x 16`.
44+
*
45+
* If we assume at most 6 TCP connections, this results in 96 segments.
46+
*
47+
* Each descriptor is 104 bytes, i.e., set to 96 this results in a memory
48+
* footprint of 9,984 bytes.
49+
*
50+
* Note that the pool of descriptors is reallocated at every network stack
51+
* reset. Setting this number to a large value (e.g., 256, the FreeRTOS+TCP
52+
* default) is known to cause reset failures due to fragmentation preventing us
53+
* from reallocating the pool.
54+
*/
55+
#define ipconfigTCP_WIN_SEG_COUNT 96
56+
3557
/**
3658
* Enable counting semaphore functionality in the build, as this is necessary
3759
* to build FreeRTOS+TCP.
@@ -63,7 +85,6 @@
6385

6486
#define ipconfigNUM_NETWORK_BUFFER_DESCRIPTORS 8
6587

66-
6788
#define ipconfigTCP_RX_BUFFER_LENGTH ( 1280 )
6889
#define ipconfigTCP_TX_BUFFER_LENGTH ( 1280 )
6990

@@ -97,3 +118,10 @@
97118

98119
#define ipconfigALLOW_SOCKET_SEND_WITHOUT_BIND 0
99120
#define ipconfigUSE_NETWORK_EVENT_HOOK 1
121+
122+
// Necessary to have FreeRTOS+TCP invoke our own `ipFOREVER` function. As of
123+
// FreeRTOS+TCP 3.1.0 this only enables external `ipFOREVER`. It would be nice
124+
// to upstream a patch to have this variable renamed into something specific to
125+
// `ipFOREVER` to ensure that we do not enable additional undesired options in
126+
// future releases.
127+
#define TEST

0 commit comments

Comments
 (0)