-
Notifications
You must be signed in to change notification settings - Fork 8k
Description
The buggy behavior
macOS (arm64)
Running the following code produces an error:
% php -dmemory_limit=-1 -r 'file_get_contents("big");'
PHP Notice: file_get_contents(): Read of 4694832713 bytes failed with errno=22 Invalid argument in Command line code on line 1
Notice: file_get_contents(): Read of 4694832713 bytes failed with errno=22 Invalid argument in Command line code on line 1The function on macOS returns a 0-byte string, as verified by gettype(file_get_contents(...)) and strlen(file_get_contents(...). The file is almost 5GB in size:
% php -r 'echo filesize("big") . "\n";'
4694824521
Note the size reported: it appears that macOS tries to read exactly 8,192 bytes past the file size? this is probably not related, see below
Comparing to Linux (x86_64)
On a fully updated Debian 13.0 the result is different:
$ php -dmemory_limit=-1 -r 'echo gettype(file_get_contents("big")) . "\n";'
string
$ php -dmemory_limit=-1 -r 'echo strlen(file_get_contents("big")) . "\n";'
4694824521PHP Versions
macOS installed via Homebrew:
% php -v
PHP 8.4.7 (cli) (built: May 6 2025 12:31:58) (NTS)
Copyright (c) The PHP Group
Built by Shivam Mathur
Zend Engine v4.4.7, Copyright (c) Zend Technologies
with Zend OPcache v8.4.7, Copyright (c), by Zend Technologies
Linux:
$ php -v
PHP 8.4.7 (cli) (built: May 9 2025 07:02:39) (NTS)
Copyright (c) The PHP Group
Zend Engine v4.4.7, Copyright (c) Zend Technologies
with Zend OPcache v8.4.7, Copyright (c), by Zend Technologies
Operating System
% sw_vers
ProductName: macOS
ProductVersion: 15.5
BuildVersion: 24F74
% uname -m
arm64
Looking for the culprit
How it fails?
While I am not a C developer, nor I have great familiarity with ZE codebase, I tried to take a crack at this. The error seems to be coming from php_stdiop_read():
php-src/main/streams/plain_wrapper.c
Lines 446 to 448 in 359bb63
| if (!(stream->flags & PHP_STREAM_FLAG_SUPPRESS_ERRORS)) { | |
| php_error_docref(NULL, E_NOTICE, "Read of %zu bytes failed with errno=%d %s", count, errno, strerror(errno)); | |
| } |
Initially, I was thinking it's about the 4GB size, or the error reporting a size off by 8K from the real file size, but it doesn't seem to be the case. In fact, any read larger than or equal to 2GB will fail:
php > echo strlen(file_get_contents('big', length: 2 * 1024 * 1024 * 1024 - 1));
2147483647
php > echo strlen(file_get_contents('big', length: 2 * 1024 * 1024 * 1024));
PHP Notice: file_get_contents(): Read of 2147483648 bytes failed with errno=22 Invalid argument in php shell code on line 1
Notice: file_get_contents(): Read of 2147483648 bytes failed with errno=22 Invalid argument in php shell code on line 1
0file_get_contents() fails only for regular files, regardless of the underlying filesystem (tested on regular APFS & HFS+ ramdisk):
php > echo strlen(file_get_contents('/dev/zero', length: 5 * 1024 * 1024 * 1024));
5368709120
Issue seems to be isolated to file_get_contents() only. My initial hunch of reads in chunks larger than SSIZE_MAX also led to nowhere, as a single fread() is able to read the file as well:
php > echo strlen(fread(fopen('big', 'r'), filesize('big')));
4694824521
php > var_dump(stream_copy_to_stream(fopen('big','r'), fopen('dst','w')));
int(4694824521)
The issue is also not related to an old bug 69824 of mine with variables >2GB, as on modern PHP versions creating a 5GB (i.e. larger than the file) isn't a problem.
I also couldn't replicate it using PHP code that doesn't use file_get_contents().
Why it fails?
If I'm reading the file_get_contents() implementation for files correctly, it will call _php_stream_copy_to_mem(), which then calls universal _php_stream_read() that calls stream->ops->read() on the stream. I think that call on the stream is set to php_stdiop_read().
I suspected that the read(3) is being called with the full $length, as passed to fgc. This points to behavior of read(3) being different between Darwin and Linux.
I wrote a quick C reproducer and tested:
### macOS
Platform SSIZE_MAX=9223372036854775807
Platform INT_MAX=2147483647
=================================
Trying to get 2147483648 from big
File "big" opened, allocating memory...
Memory allocated, attempting read...
!! read() failed - errno=22 err=Invalid argument
=================================
Trying to get 2147483647 from big
File "big" opened, allocating memory...
Memory allocated, attempting read...
Did read 2147483647 bytes ($req-$actual=0)
### Linux
Platform SSIZE_MAX=9223372036854775807
Platform INT_MAX=2147483647
=================================
Trying to get 2147483648 from big
File "big" opened, allocating memory...
Memory allocated, attempting read...
Did read 2147479552 bytes ($req-$actual=4096)
=================================
Trying to get 2147483647 from big
File "big" opened, allocating memory...
Memory allocated, attempting read...
Did read 2147479552 bytes ($req-$actual=4095)
Linux accepts arbitrary size to read(3) and simply returns maximum amount possible (hmm, 2GB-4K??), which lets the stream logic handle stitching. Darwin/XNU and BSD kernels instead immediately returns EINVAL if requested chunk size is larger than INT_MAX.
The same problem also affects file_put_contents() for the same reasons.
Possible fix?
This behavior appears to be known, as stream_set_chunk_size() errors-out if requested chunk size is > INT_MAX on all platforms. Moreover, while debugging I did a full circle: the php_stdiop_read() does clamp the max chunk/buffer to INT_MAX but only on Windows.
I think adding the clamping for macOS and BSD, in addition to Windows, is the simplest solution - PR provided.
Affected versions
The issue will only appear if the stream read buffer is set > INT_MAX, which in the case of file_get_contents() bisects to commit 6beee1a from #8547 that first landed in PHP 8.2.
Knowing this I found this isn't a problem with just file_get_contents() but also fread() as stream_set_read_buffer() doesn't guard this:
php > $f = fopen("big", "r"); stream_set_read_buffer($f, 0); fread($f, 2147483648);
PHP Notice: fread(): Read of 2147483648 bytes failed with errno=22 Invalid argument in php shell code on line 1
Notice: fread(): Read of 2147483648 bytes failed with errno=22 Invalid argument in php shell code on line 1
However, I don't think this needs to be guarded even for DX, as this is a user shooting themselves into a foot. After the patch the code above will instead fail with Notice: fread(): Read of 2147483648 bytes failed with errno=9 Bad file descriptor.
Dataset
The exact file I encounter a problem with is available from Cornell University. You can get it directly via curl -L -o ~/Downloads/arxiv.zip https://www.kaggle.com/api/v1/datasets/download/Cornell-University/arxiv. However, after some digging I see it's not about this exact file, i.e. truncate -s 4694824521 big works too.