Attempt to to set our own soft limits#26
Attempt to to set our own soft limits#26jesterhodl wants to merge 2 commits intoOCEAN-xyz:masterfrom
Conversation
db8a38f to
e924352
Compare
src/datum_stratum.c
Outdated
| } else if (app->max_clients > rlimit.rlim_cur) { | ||
| DLOG_WARN("*** NOTE *** Max Stratum clients (%llu) exceeds open file soft limit (Soft: %llu / Hard: %llu)", (unsigned long long)app->max_clients, (unsigned long long)rlimit.rlim_cur, (unsigned long long)rlimit.rlim_max); | ||
| DLOG_WARN("*** NOTE *** You should increase the soft open file limit to prevent issues as you approach max clients!"); | ||
| DLOG_WARN("*** NOTE *** Max Stratum clients (%llu) exceeds soft open file limit (Soft: %llu / Hard: %llu)", (unsigned long long)app->max_clients, (unsigned long long)rlimit.rlim_cur, (unsigned long long)rlimit.rlim_max); |
There was a problem hiding this comment.
"open file soft limit" sounded better IMO
src/datum_stratum.c
Outdated
| DLOG_WARN("*** NOTE *** Max Stratum clients (%llu) exceeds open file soft limit (Soft: %llu / Hard: %llu)", (unsigned long long)app->max_clients, (unsigned long long)rlimit.rlim_cur, (unsigned long long)rlimit.rlim_max); | ||
| DLOG_WARN("*** NOTE *** You should increase the soft open file limit to prevent issues as you approach max clients!"); | ||
| DLOG_WARN("*** NOTE *** Max Stratum clients (%llu) exceeds soft open file limit (Soft: %llu / Hard: %llu)", (unsigned long long)app->max_clients, (unsigned long long)rlimit.rlim_cur, (unsigned long long)rlimit.rlim_max); | ||
| DLOG_WARN("*** NOTE *** Attempting to increase the soft limit..."); |
There was a problem hiding this comment.
I don't think we need this extra line, considering we log if we succeed or fail a moment later
src/datum_stratum.c
Outdated
| DLOG_WARN("*** NOTE *** Attempting to increase the soft limit..."); | ||
|
|
||
| // Attempt to increase the soft limit to match the hard limit or max_clients, whichever is smaller | ||
| struct rlimit new_rlimit = rlimit; |
There was a problem hiding this comment.
Let's just modify rlimit in place?
src/datum_stratum.c
Outdated
|
|
||
| // Attempt to increase the soft limit to match the hard limit or max_clients, whichever is smaller | ||
| struct rlimit new_rlimit = rlimit; | ||
| new_rlimit.rlim_cur = (app->max_clients < rlimit.rlim_max) ? app->max_clients : rlimit.rlim_max; |
There was a problem hiding this comment.
If rlim_max is too low, we would have errored above and never gotten here...
But we should probably refactor this to increase the soft limit even if the hard limit is too low.
src/datum_stratum.c
Outdated
|
|
||
| // Attempt to increase the soft limit to match the hard limit or max_clients, whichever is smaller | ||
| struct rlimit new_rlimit = rlimit; | ||
| new_rlimit.rlim_cur = (app->max_clients < rlimit.rlim_max) ? app->max_clients : rlimit.rlim_max; |
There was a problem hiding this comment.
Also, I suspect we actually need more than max_clients!
There was a problem hiding this comment.
Right, I have 3 client connections and /proc shows 35 fds (that's /dev/pts, the log, some pipes, bitcoind connections (not sure why I have 2), pool host, clients, api listener, and a bunch of [eventpoll]. Any advice how much to account for this stuff?
0u CHR 136,3 0t0 6 /dev/pts/3
1u CHR 136,3 0t0 6 /dev/pts/3
2u CHR 136,3 0t0 6 /dev/pts/3
3w REG 259,2 4481639 7999203 /var/log/datum_gateway/dg.log
4u a_inode 0,15 0 49 [eventpoll:5]
5u IPv4 17508026 0t0 TCP *:23335 (LISTEN)
6u IPv4 21423289 0t0 TCP redacted.lan:45608->137.61.196.35.bc.googleusercontent.com:28915 (ESTABLISHED)
7u a_inode 0,15 0 49 [eventpoll:6]
8r FIFO 0,14 0t0 17507241 pipe
9r FIFO 0,14 0t0 17511024 pipe
10w FIFO 0,14 0t0 17507241 pipe
11w FIFO 0,14 0t0 17511024 pipe
12u IPv6 22178442 0t0 TCP ip6-localhost:51958->ip6-localhost:8332 (ESTABLISHED)
13u IPv6 19072439 0t0 TCP ip6-localhost:38598->ip6-localhost:8332 (ESTABLISHED)
14u IPv4 17508031 0t0 TCP *:23334 (LISTEN)
15u a_inode 0,15 0 49 [eventpoll:14]
16u IPv4 21251850 0t0 TCP redacted.lan:23334->redacted.lan:46678 (ESTABLISHED)
17u a_inode 0,15 0 49 [eventpoll:16]
18u IPv4 21251851 0t0 TCP redacted.lan:23334->redacted.lan:58921 (ESTABLISHED)
19u a_inode 0,15 0 49 [eventpoll:18]
20u IPv4 21251852 0t0 TCP redacted.lan:23334->redacted.lan:55522 (ESTABLISHED)
21u a_inode 0,15 0 49 [eventpoll:20]
23u a_inode 0,15 0 49 [eventpoll]
24u a_inode 0,15 0 49 [eventpoll]
25u a_inode 0,15 0 49 [eventpoll]
26u a_inode 0,15 0 49 [eventpoll]
27u a_inode 0,15 0 49 [eventpoll]
28u a_inode 0,15 0 49 [eventpoll]
29u a_inode 0,15 0 49 [eventpoll]
30u a_inode 0,15 0 49 [eventpoll]
31u a_inode 0,15 0 49 [eventpoll]
32u a_inode 0,15 0 49 [eventpoll]
33u a_inode 0,15 0 49 [eventpoll]
34u a_inode 0,15 0 49 [eventpoll]
35u a_inode 0,15 0 49 [eventpoll]
There was a problem hiding this comment.
That's a good start, but from there, you probably need to identify in the code what each one is used for. eg, I bet there's an eventpoll per socket thread...
There was a problem hiding this comment.
My 35 fds, with max_clients set to 16 and 3 clients connected is:
- 3 fds used connections: 3*2 = 6 (socket + epoll)
- 13 unused connections = 13 (just the epoll)
- "overhead" = 16
16+6+13 = 35
This "overhead" is:
- 3 for stdin,stdout,stderr
- 1 for log
- 2 for datum listener + eventpoll
- 2 for pool connection + eventpoll (if using pool)
- 4 pipes
- 2 sockets to bitcoind
- 2 api listener + eventpoll
3+1+2+2+4+2+2 = 16
Naturally, a submit block json will open an fd, so +1 and we arrive at overhad = 17, so:
int overhead = 17;
required_fds = 2 * app->max_clients + overhead;
Now, this is too tight. I might've missed something and your IPv6 work might change things.
I would go for a #define DATUM_FD_OVERHEAD 32 and use that.
Thoughts?
I can do a commit consisting of all of the suggested changes for preview.
There was a problem hiding this comment.
Are you sure we have an epoll per connection? I would expect it to be per thread and shared across all connections on that thread.
There was a problem hiding this comment.
Pretty sure, yeah, judging from calculations. I'll do more research and educate myself on epoll, how it's coded in datum and compare lsof results with /proc//fd (and fdinfo)
There was a problem hiding this comment.
I came up with:
- DATUM thread: 1 epoll, ? DNS + 1 connection + 1*n curl
- per threadpool thread: 1 epoll
- per client: 1 for stratum
- stratum listener thread: 1 extra to reject a client + 1 epoll + up to 2 listening
- stdio: 3
- cookie file: 1
- log file: 1
- RPC notify fallback: n curl
- submit block: 1 for file, n curl, n curl for dedicated thread
- API: 1 listener, 1? epoll, up to 128 client connections, 1 reject, 1 to write config
- each curl: 1 pipe, 1 socket
I'm not sure how to explain the 13 eventpoll fds at the end of your list, though.
There was a problem hiding this comment.
I know where the 13 are coming from. When you launch Datum it will open the expected number of FDs.
The results I got, where from an instance which was running for a long time and had clients connect and disconnect.
Observation:
- Launch Datum, observe the expected number of fds
- Make some telnet/nc connections to the 23334 port, it will start opening fds up to max_clients and equivalent number of eventpolls. I have max_clients at 16, so 16-3 miners = 13
- When those fds are closed, the eventpolls remain open and are reused for subsequent connections.
- Those lingering eventpolls are reused by datum connections (23334) as well as web/api (23335), but if you have, say, 13 eventpolls and make 50 web/api connections, it will use the 13 eventpolls but no more. Those two types of connections work differently.
This is based on observations and opening connections to ports and observing lsof.
There was a problem hiding this comment.
I think it would be good to keep each category separated out for future reference. So something like:
static const unsigned int fds_for_datum_thread = 1 /* epoll */ + 1 /* socket */ + (1 * fds_per_curl) + fds_for_getaddrinfo;
And since we don't necessarily control libcurl or getaddrinfo, probably should be generous with fd assumptions there.
There was a problem hiding this comment.
Following our chat, I pushed a function which estimates the file descriptors with a bit of contingency.
It looks at: stdio, pipes, log, sv1 listener + epoll, clients + epools of each thread, GBT and the fallback, datum connection and API + epolls.
5ce1d72 to
6681bc2
Compare
1. attempt to set out own soft limits 2. align wording: "hard open file limit" and "soft open file limit"
define and use DATUM_API_CONNECTION_LIMIT in datum_api.c and .h
c978392 to
9ac91ec
Compare
|
I'm wondering should I rework it, should we scrap it, or is it OK? |
|
Looks okay, I just don't know how to test it. |
We basically check if max_clients is higher than the soft limit and if so, set the soft limit to the max_clients or the hard limit, if max_clients is higher.
It looks like this in action (when I doubled the defaults):
Now that I suggest this PR. Is setting the limits to max_clients sufficient? Surely not, because there are other files open.
This of course doesn't invalidate this PR, because it does what it is supposed to, but I suppose we should have more room.