-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Jira Link: DB-20480
Description
Summary
On macOS, yb-tserver fails with Too many open files (EMFILE) errors when managing more than ~2,700 tablets (e.g. 2 databases with ~1,500 hash-sharded tables and secondary indexes). This happens even when ulimit -n, kern.maxfilesperproc, and kern.maxfiles are all set well above the failing threshold.
The root cause is that RocksDB's NewSequentialFile and NewLogger use fopen(), which on macOS/FreeBSD is capped at ~32,767 simultaneous FILE* streams due to the _file field in the FILE struct being a short.
Root Cause
Apple's libc (derived from FreeBSD) stores the file descriptor inside the FILE struct as a short (source). There is an unconditional guard in fopen.c:
/*
* File descriptors are a full int, but _file is only a short.
* If we get a valid file descriptor that is greater than
* SHRT_MAX, then the fd will get sign-extended into an
* invalid file descriptor. Handle this case by failing the open.
*/
if (f > SHRT_MAX) {
fp->_flags = 0; /* release */
_close(f);
errno = EMFILE;
return (NULL);
}This check applies even when _DARWIN_UNLIMITED_STREAMS is defined — that flag only changes stream allocation counts via __sfp(), not the fd range check. The same limit exists in fdopen().
The raw open() syscall has no such limit.
The two fopen() call sites in RocksDB that hit this are:
src/yb/rocksdb/util/env_posix.cc—NewSequentialFile(reads OPTIONS, CURRENT, SST metadata during tablet bootstrap)src/yb/rocksdb/util/env_posix.cc—NewLogger(creates per-RocksDB LOG files)
Steps to Reproduce
Requires macOS. Ensure system limits are raised first:
sudo sysctl -w kern.maxvnodes=1048576
sudo launchctl limit maxfiles 1048576 unlimited
ulimit -n 1048576Start a single-node cluster with 1 tablet per table:
bin/yugabyted start \
--base_dir /tmp/yb-emfile-repro \
--listen 127.0.0.1 \
--tserver_flags "ysql_num_shards_per_tserver=1,yb_num_shards_per_tserver=1,tablet_replicas_per_core_limit=0,tablet_replicas_per_gig_limit=0" \
--master_flags "ysql_num_shards_per_tserver=1,yb_num_shards_per_tserver=1,replication_factor=1,tablet_replicas_per_core_limit=0,tablet_replicas_per_gig_limit=0"Create 2 databases with 1,500 tables and 1 secondary index each:
YSQLSH="bin/ysqlsh -h 127.0.0.1 -U yugabyte"
for db in testdb1 testdb2; do
$YSQLSH -d yugabyte -c "CREATE DATABASE $db;"
for batch in $(seq 0 14); do
SQL=""
for i in $(seq $((batch*100+1)) $((batch*100+100))); do
SQL="${SQL}CREATE TABLE t${i}(id INT PRIMARY KEY, val TEXT);"
SQL="${SQL}CREATE INDEX t${i}_val_idx ON t${i}(val);"
done
$YSQLSH -d $db -c "$SQL"
done
doneDuring creation of testdb2, the tserver log will show errors like:
E tablet.cc:1224] Failed to open a RocksDB database:
IO error (env_posix.cc:592): .../CURRENT (num opened files 32770): Too many open files (system error 24)
Table creation will stall or fail with Timed out waiting for table creation.
Suggested Fix
Replace fopen() / FILE* with raw POSIX open() / read() / write() in NewSequentialFile and NewLogger, and update PosixSequentialFile and PosixLogger accordingly. Both classes already store the raw fd internally (via fileno()), so this is a minimal change. The same pattern is already used by PosixRandomAccessFile and PosixWritableFile, which use open() / pread() / write() directly.
Linux is unaffected — glibc's FILE struct uses a full int for the fd field.
Environment
- macOS Sequoia 15.x (Darwin 25.3.0), Apple Silicon
- YugabyteDB built from source (master branch)
- Single-node cluster, 1 tablet per table
Issue Type
kind/enhancement
Warning: Please confirm that this issue does not contain any sensitive information
- I confirm this issue does not contain any sensitive information.