-
-
Notifications
You must be signed in to change notification settings - Fork 33k
gh-83714: Use statx on Linux 4.11 and later in os.stat #136334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The libc statx wrapper is required at build time, but weakly linked, and posix_do_stat will fall back to calling stat if statx is not available at runtime due to an old libc or an old kernel.
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
Please update the title to use the gh issue number. |
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some docs suggestions / CPython sphinx has special tags for a couple of the things; haven't looked at main implementation a lot yet
!buildbot .Android. |
🤖 New build scheduled with the buildbot fleet by @gpshead for commit b0e8276 🤖 Results will be shown at: https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F136334%2Fmerge The command will test the builders whose names match following regular expression: The builders matched are:
|
FYI, we recently added Android to the GitHub Actions CI, so there's no longer much need to invoke the buildbots manually. The exception is if a PR has something which might be specific to Android aarch64, because GitHub Actions can't run the tests on that platform yet, although it will still perform a complete build. |
Looking at the failures, the Android buildbots didn't run the most up-to-date code from this PR. Looking at the "Build Properties" tab, the "revision" is "b0e827681f45588ce67cd0924b3c320bb64351ff" (the latest), but "got_revision" is "a387390693c4f4ff793717a91a5137b7f608f9e4", a merge commit attributed to me but that I don't remember making of d53650a (the commit just before I added the configure check) with main. |
Both the buildbots and GitHub Actions don't actually test the head of the PR branch, but a speculative commit showing what you would get if you merged the PR into main. Because of a merge conflict, it's not possible to merge the PR into main at the moment. It looks like GitHub Actions deals with this by not running any tests at all, while the buildbots test the last commit before the conflict. |
I dislike this approach: modifying Python os.stat() to use statx. I would prefer to expose C statx() as Python os.statx() instead. |
… raw kwarg We no longer return None for invalid attributes for the benefit of drop-in replacement of os.stat when not all basic stats are required or sync=False is acceptable. Such code was already accepting faked values from os.stat, so can also accept faked values from os.statx. Remove the raw kwarg as it would have no effect.
These members were removed from os.stat_result in a previous commit.
I've marked this ready for review. @vstinner I'd also appreciate a review from you; GitHub isn't showing me the interface to formally request one. This PR has changed quite a bit from the first commit, and as it did so, I wrote quite a few words in the issue #83714 that may be informative (probably best to read from the bottom up). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed the pieces I have reasonable knowledge of; don't have enough experience with macros and defining new structs to go in more detail through those pieces. For first contributions to CPython really impressive in getting things to the CPython style (docs, C code style, etc)
High level:
- This PR is large for CPython; Smaller chunks will make it easier to review and land; My thought would be adding one part adding os.statx; second updating os.stat to use statx when available
- The
os.stat
change will need (on newer + older kernels). Just adding statx doesn't need as much performance validationpyperf
microbenchmark that shows stat didn't get slower- Validate that Python startup time does not change significantly (I believe github.com/python/pyperformance has two benchmarks around that)
statx
will probably make sense to add topathlib.Path
for easy access likestat
is (but that should probably be another PR with the same issue)
With the thoroughness you're doing definitely feels like will get to the confidence to change os.stat
to point to os.statx
. Again; really impressive for a first CPython contributions.
Doc/library/os.rst
Outdated
|
||
.. function:: statx(path, mask, *, dir_fd=None, follow_symlinks=True, sync=None) | ||
|
||
Get the status of a file or file descriptor by performing a :c:func:`statx` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be able to make the missing c func / struct pieces stop erroring by adding to:
Line 121 in 4e00e25
nitpick_ignore = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Get the status of a file or file descriptor by performing a :c:func:`statx` | |
Get the status of a file or file descriptor by performing a :c:func:`!statx` |
This will also work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running make html
and make check
as directed by the devguide, I don't see any warnings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually retrieved (which may differ from *mask*). | ||
|
||
The optional parameter *sync* controls the freshness of the returned | ||
information. ``sync=True`` requests that the kernel return up-to-date |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should document the value of None
which is the default; may make sense to just make it always True/False for now
Lib/test/test_os.py
Outdated
self.assertAlmostEqual(floaty, nanosecondy, delta=2) | ||
|
||
# Ensure both birthtime and birthtime_ns roughly agree, if present | ||
time_attributes = 'st_atime st_mtime st_ctime'.split() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: why not just make a three element list here? (I see the old code did this, but seems reasonable to just update)
def check_statx_attributes(self, fname): | ||
maximal_mask = 0 | ||
for name in dir(os): | ||
if name.startswith('STATX_'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feels like there should be a better way to group / gather these (maybe have them all live as constants in a Python class that acts as a namespace?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar constants for other os
functions are also dumped in the module namespace. That doesn't mean we can't do something different, but we'd be setting somewhat of a precedent.
(It would be nice to have an IntFlag
enum class that can report which members are valid given a bitmask, but adding an import dependency to os
doesn't seem like a good idea.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fairly strongly against "find everything that starts with STATX_
in os" as that will just get slower as os
gets larger.
Can be just a manually maintained list for now; would be nice to do something more namespace like (but also don't want to add a new CPython standard here).
for sync in (False, True): | ||
with self.subTest(sync=sync): | ||
os.statx(self.fname, os.STATX_BASIC_STATS, sync=sync) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will need more test cases for different ways of constructing/using statx (path_fd, dir_fd, PathLike)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the dir_fd tests go in here in test_os.py or in TestPosixDirFd in test_posix.py?
Should all the statx tests be in test_posix.py? The comment at the top of test_os.py says it's for "a few functions which have been determined to be more portable than they had been thought to be", and statx is not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in test_os
works; don't need to test that "do the dir_fd helpers use handle every case right", want to just make sure "if someone deleted the dir_fd handling code; a test would fail"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_posix
seems to be around https://docs.python.org/3/library/posix.html; I don't have experience / knowledge around that so not sure
int flags = AT_NO_AUTOMOUNT; | ||
flags |= follow_symlinks ? 0 : AT_SYMLINK_NOFOLLOW; | ||
|
||
Py_BEGIN_ALLOW_THREADS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I don't see a reason not to start allowing threads a bit earlier; aren't touching the interpreter in the lines just before this
typedef struct { | ||
PyObject_HEAD | ||
struct statx stx; | ||
#if STATX_RESULT_CACHE_SLOTS > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this always > 0? (the constant 17)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's defined this way to allow reviewers and future maintainers to more easily experiment with not having a cache (because member arrays can't be empty in C). On reflection, it doesn't help much, because the no-cache configuration would change most of the getset defs to members for faster access, and after that, commenting out the member is trivial.
int result; | ||
/* Future bits may refer to members beyond the current size of struct | ||
statx, so we need to mask them off to prevent memory corruption. */ | ||
mask &= _Py_STATX_KNOWN; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this error if the mask was changed by this code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, maybe. We'd only adjust the mask for Python code that specified the mask numerically instead of using the os.STATX_*
constants. That's probably only code that wants to use members Python 3.16 knows about but Python 3.15 doesn't. os.statx_result
doesn't provide Python with an interface to access members the interpreter doesn't know about, so while code could use the numeric value of STATX_FOO
, accessing stx_foo
would raise AttributeError
. And that's how current code copes with os.stat_result
's members varying across versions. But I'm open to arguments that failing fast is better and version-crossing code should instead change their mask based on the constants available.
#endif /* HAVE_EVENTFD && EFD_CLOEXEC */ | ||
|
||
#ifdef HAVE_STATX | ||
if (PyModule_AddIntMacro(m, STATX_TYPE)) return -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Any reason not to alphabetize this list?
I like having either a comment why not alphabetical or to alphabetize for easier later code reading "is this member in the list" when trying to add more
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The order matches the definitions in /usr/include/linux/stat.h
. When new bits are defined, they're added at the end (there are no gaps in the definitions). Alphabetizing would make things harder, not easier.
I don't recommend this. |
👍 / agree with you, meant to say |
glibc doesn't remember that statx previously failed with ENOSYS, so on old kernels, glibc's emulation of statx via stat has the overhead of a failed syscall. Only call the statx wrapper under the same conditions that we make os.statx available. statx_works was always a global C variable, but now its scope is the file instead of just posix_do_stat. Whether statx works is a process-global property, so using a global C variable is appropriate. Previously access to statx_works was unsynchronized, following the pattern of dup3_works. Now we rely on posixmodule_exec happening-before any thread can call os.stat.
Also two minor fixes to statx tests in test_os.py.
Some benchmark numbers comparing
The 'PR nobtime' row replaces the last I don't have a VM running a pre-statx kernel handy, but I did find the The following numbers are from under
We're maybe slightly slower, and this time not due to an extra float and int, but if you look at the histograms in the gist, there are more outliers here. It's hard to conclude much. pyperformance startup benchmark, main vs this PR:
main vs PR nobtime:
PR nobtime vs PR:
This PR doesn't change any of the code in fileutils.c to use statx, so if we're slower, it's because of Python main vs this PR running under
|
As I wrote previously, I would prefer to leave I support the idea of adding a new I suggest creating a new PR which only adds a new |
Hmm. In 2024 a user asked for an update on the issue and you replied that ntninja's PR needed to be updated. That PR changed
That's a non sequitur. The reason I am pushing so hard for
And this PR does in fact add a |
Victor and I synced on the above at our ongoing core team sprint yesterday, I'm in agreement: Lets start with only While that could be done in this PR you may find it cleaner for review purposes to make a new PR focused on just that.
I really appreciate that you're focusing on the performance of this. I'm interested to see what you find! Long term, we envision there becoming a higher level API than the os module as a path stat interface. It could use os.stat, os.statx, and other APIs under the hood. But os.stat itself should not become such an API. I realize this is proposing that we ultimately create a new higher level API for stat. But things like We should track a new higher level API with its own issue. That may wind up becoming a PEP-style discussion as it'd be a new high level API - to be determined. |
|
Pathlib updates for stat attributes are currently being discussed -- https://discuss.python.org/t/expose-file-size-mode-etc-from-pathlib-path-info/103828 questions/criticism/ideas welcome :). |
I've posted some questions in gh-83714. For now, this PR should be considered earnest effort (in the sense of earnest money). It will change based on the answers to the questions in the issue. It needs tests for the new features and the fallback that are not obvious to implement. I need guidance as to when it should strictly conform to PEP 7 and when it should match the surrounding code style (especially with regards to brace placement). But it should demonstrate I'm willing and able.
The libc statx wrapper is required at build time, but weakly linked. There is no configure test; we just look for
STATX_BASIC_STATS
being defined bysys/stat.h
.posix_do_stat
will fall back to calling stat if statx is not available at runtime due to an old libc or an old kernel. The fallback variablestatx_works
is based ondup3_works
, for the thread-safety of which see @ericsnowcurrently's comment. The values are (to me) nonobvious: -1 means statx has never failed, 0 means its first failure was ENOSYS and 1 means its first failure was something other than ENOSYS.This PR is not directly based on gh-19125, but was greatly informed by it, and I'd be happy to give Co-authored-by and/or blurb credit to @ntninja.
statx(2)
system call on Linux for extendedos.stat
information #83714📚 Documentation preview 📚: https://cpython-previews--136334.org.readthedocs.build/