Skip to content

Commit 666e633

Browse files
mhl-bafoucret
authored andcommitted
Add networking guide docs (elastic#138119)
1 parent 40cc956 commit 666e633

File tree

1 file changed

+221
-2
lines changed

1 file changed

+221
-2
lines changed

docs/internal/DistributedArchitectureGuide.md

Lines changed: 221 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,227 @@ A guide to the general Elasticsearch components can be found [here](https://gith
1212

1313
# Networking
1414

15+
Every elasticsearch node maintains various networking clients and servers,
16+
protocols, and synchronous/asynchronous handling. Our public docs cover user
17+
facing settings and some internal aspects - [Network Settings](https://www.elastic.co/docs/reference/elasticsearch/configuration-reference/networking-settings).
18+
19+
## HTTP Server
20+
21+
The HTTP Server is a single entry point for all external clients (excluding
22+
cross-cluster communication). Management, ingestion, search, and all other
23+
external operations pass through the HTTP server.
24+
25+
Elasticsearch works over HTTP 1.1 and supports features such as TLS, chunked
26+
transfer encoding, content compression, and pipelining. While attempting to
27+
be HTTP spec compliant, Elasticsearch is not a webserver. ES Supports `GET`
28+
requests with a payload (though some old proxies may drop content) and
29+
`POST` for clients unable to send `GET-with-body`. Requests cannot be cached
30+
by middle boxes.
31+
32+
There is no connection limit, but a limit on payload size exists. The default
33+
maximum payload is 100MB after compression. It's a very large number and almost
34+
never a good target that the client should approach. See
35+
`HttpTransportSettings` class.
36+
37+
Security features, including basic security: authentication(authc),
38+
authorization(authz), Transport Layer Security (TLS) are available in the free
39+
tier and achieved with separate x-pack modules.
40+
41+
The HTTP server provides two options for content processing: full aggregation
42+
and incremental processing. Aggregated content is a preferable choice for small
43+
messages that do not fit for incremental parsing (e.g., JSON). Aggregation has
44+
drawbacks: it requires more memory, which is reserved until all bytes are
45+
received. Concurrent incomplete requests can lead to unbounded memory growth and
46+
potential OOMs. Large delimited content, such as bulk indexing, which is
47+
processed in byte chunks, provides better control over memory usage but is more
48+
complicated for application code.
49+
50+
Incremental bulk indexing includes a back-pressure feature. See `org.
51+
elasticsearch.index.IndexingPressure`. When memory pressure grows high
52+
(`LOW_WATERMARK`), reading bytes from TCP sockets is paused for some
53+
connections, allowing only a few to proceed until the pressure is resolved.
54+
When memory grows too high (`HIGH_WATERMARK`) bulk items are rejected with 429.
55+
This mechanism protects against unbounded memory usage and `OutOfMemory`
56+
errors (OOMs).
57+
58+
ES supports multiple `Content-Type`s for the payload. These are
59+
implementations of `MediaType` interface. A common implementation is called
60+
`XContentType`, including CBOR, JSON, SMILE, YAML, and their versioned types.
61+
X-pack extensions includes PLAIN_TEXT, CSV, etc. Classes that implement
62+
`ToXContent` and friends can be serialized and sent over HTTP.
63+
64+
HTTP routing is based on a combination of Method and URI. For example,
65+
`RestCreateIndexAction` handler uses `("PUT", "/{index}")`, where curly braces
66+
indicate path variables. `RestBulkAction` specifies a list of routes
67+
68+
```java
69+
@Override
70+
public List<Route> routes() {
71+
return List.of(
72+
new Route(POST, "/_bulk"),
73+
new Route(PUT, "/_bulk"),
74+
new Route(POST, "/{index}/_bulk"),
75+
new Route(PUT, "/{index}/_bulk")
76+
);
77+
}
78+
```
79+
80+
Every REST handler must be declared in the `ActionModule` class in the
81+
`initRestHandlers` method. Plugins implementing `ActionPlugin` can extend the
82+
list of handlers via the `getRestHandlers` override. Every REST handler
83+
should extend `BaseRestHandler`.
84+
85+
The REST handler’s job is to parse and validate the HTTP request and construct a
86+
typed version of the request, often a Transport request. When security is
87+
enabled, the HTTP layer handles authentication (based on headers), and the
88+
Transport layer handles authorization.
89+
90+
Request handling flow from Java classes view goes as:
91+
92+
```
93+
(if security enabled) Security.getHttpServerTransportWithHeadersValidator
94+
-> `Netty4HttpServerTransport`
95+
-> `AbstractHttpServerTransport`
96+
-> `RestController`
97+
-> `BaseRestHandler`
98+
-> `Rest{Some}Action`
99+
```
100+
101+
`Netty4HttpServerTransport` is a single implementation of
102+
`AbstractHttpServerTransport` from the `transport-netty4`
103+
module. The `x-pack/security` module injects TLS and headers validator.
104+
105+
## Transport
106+
107+
Transport is the term for node-to-node communication, utilizing a TCP-based
108+
custom binary protocol. Every node acts as both a client and a server.
109+
Node-to-node communication almost never uses HTTP transport (except for
110+
reindex-from-remote).
111+
112+
`Netty4Transport` is the sole implementation of TCP transport, initializing
113+
both the Transport client and server. The `x-pack/security` plugin provides
114+
a secure version: `SecurityNetty4Transport` (with TLS and authentication).
115+
116+
A `Connection` between nodes is a pool of `Channel`s, where each channel is a
117+
non-blocking TCP connection (Java NIO terminology). Once a cluster is
118+
discovered, a `Connection` (pool of `Channel`s) is opened to every other node,
119+
and every other node opens a `Connection` back. This results in two
120+
`Connection`s between any two nodes `(A→B and B→A)`. A node sends requests only
121+
on the `Connection` it opens (acting as a client). The default pool is around 13
122+
`Channel`s, divided into sub-pools for different purposes (e.g., ping,
123+
node-state, bulks). The pool structure is defined in the `ConnectionProfile`
124+
class.
125+
126+
ES never behaves incorrectly (e.g. loses data) in the face of network outages
127+
but it may become unavailable unless the network is stable. Network stability
128+
between nodes is assumed, though connectivity issues remain a constant
129+
challenge.
130+
131+
Request timeouts are discouraged, as Transport requests are guaranteed to
132+
eventually receive a response, even without a timeout. `SO_KEEPALIVE` helps
133+
detect and tear down dead connections. When a connection closes with an error,
134+
the entire pool is closed, outstanding requests fail, and the pool is
135+
reconnected.
136+
137+
There are no retries on the Transport layer itself. The application layer
138+
decides when and how to retry (e.g., via `RetryableAction` or
139+
`TransportMasterNodeAction`). In the future Transport framework might support
140+
retries #95100.
141+
142+
Transport can multiplex requests and responses in a single `Channel`, but
143+
cannot multiplex parts of messages. Each transport message must be fully
144+
dispatched before the next can be sent. Proper application-layer sizing/chunking
145+
of messages is recommended to ensure fairness of delivery across multiple
146+
senders. A Transport message cannot be larger than 30% of heap (
147+
`org.elasticsearch.transport.TcpTransport#THIRTY_PER_HEAP_SIZE`) or 2GB (due to
148+
`org.elasticsearch.transport.Header#networkMessageSize` being an `int`).
149+
150+
The `TransportMessage` family tree includes various types (node-to-node,
151+
broadcast, master node acknowledged) to ensure correct dispatch and response
152+
handling. For example when a message must be accepted on all nodes.
153+
154+
## Other networking stacks
155+
156+
Snapshotting to remote repositories involves different networking clients
157+
and SDKs. For example AWS SDK comes with Apache or Netty HTTP client, Azure
158+
with Netty-based Project-Reactor, GCP uses default Java HTTP client.
159+
Underlying clients may be reused between repositories, with varying levels of
160+
control over networking settings.
161+
162+
There are other features such as SAML/JWT metadata reloading, Watcher HTTP
163+
action, reindex and ML related features such as inference that also use HTTP
164+
clients.
165+
166+
## Sync/Async IO and threading
167+
168+
ES handles a mix of I/O operations (disk, HTTP server,
169+
Transport client/server, repositories), resulting in a combination of
170+
synchronous and asynchronous styles. Asynchronous IO utilizes a small set of
171+
threads by running small tasks, minimizing context switch. Synchronous IO
172+
uses many threads and relies on an OS scheduler. ES typically runs with 100+
173+
threads, where Async and Sync threads compete for resources.
174+
175+
## Netty
176+
177+
Netty is a networking framework/toolkit used extensively for HTTP and Transport
178+
networks, providing foundational building blocks for networking applications.
179+
180+
### Event-Loop (Transport-Thread)
181+
182+
Netty is an Async IO framework, it runs with a few threads. An event-loop is
183+
a thread that processes events for one or many `Channels` (TCP connections).
184+
Every `Channel` has exactly one, unchanging event-loop, eliminating the need to
185+
synchronize events within that `Channel`. A single, CPU-bound `Transport
186+
ThreadPool` (e.g.,4 threads for 4 cores) serves all HTTP and Transport
187+
servers and clients, handling potentially hundreds or thousands of connections.
188+
189+
Event-loop threads serve many connections each, it's critical to not block
190+
threads for a long time. Fork any blocking operation or heavy computation to
191+
another thread pool. Forking, however, comes with overhead. Do not fork
192+
simple requests that can be served from memory and do not require heavy
193+
computations (milliseconds).
194+
195+
Transport threads are monitored by `ThreadWatchdog`. A warning log appears if a
196+
single task runs longer than 5 seconds. Slowness can be caused by blocking, GC
197+
pauses, or CPU starvation from other thread pools.
198+
199+
### ByteBuf - byte buffers and reference counting
200+
201+
Netty's controlled memory allocation provides a performance edge by managing and
202+
reusing byte buffer pools (e.g., pools of 1MiB byte chunks sliced into 16KiB
203+
pages). Some pages might not be in use while taking up heap space and show up in
204+
the heap dump.
205+
206+
Netty reads socket bytes into direct buffers, and ES copies them into pooled
207+
byte-buffers (`CopyBytesSocketChannel`). The application is responsible for
208+
retaining (increasing ref-count) and releasing (decreasing ref-count) for
209+
pooled buffers.
210+
211+
Reference counting introduces two primary problems:
212+
213+
1. Use after release (free): Accessing a buffer after it has been explicitly
214+
released.
215+
2. Never release (leak): Failing to release a buffer, leading to memory leaks.
216+
217+
The compiler does not help detect these issues. They require careful testing
218+
using Netty's LeakDetector with a Paranoid level. It's enabled by default in
219+
all tests.
220+
221+
### Async methods return futures
222+
223+
Every asynchronous operation in Netty returns a future. It is easy to forget
224+
to check the result, as a following call always succeeds:
225+
226+
```java
227+
ctx.write(message)
228+
```
229+
230+
Check the result of an async operation:
231+
232+
```java
233+
ctx.write(message).addListener(f -> { if (f.isSuccess() ...)});
234+
```
235+
15236
### ThreadPool
16237

17238
(We have many thread pools, what and why)
@@ -28,8 +249,6 @@ See the [Javadocs for `ActionListener`](https://github.com/elastic/elasticsearch
28249

29250
### Performance
30251

31-
### Netty
32-
33252
(long running actions should be forked off of the Netty thread. Keep short operations to avoid forking costs)
34253

35254
### Work Queues

0 commit comments

Comments
 (0)