Skip to content

Conversation

@aisk
Copy link
Contributor

@aisk aisk commented Apr 29, 2025

This change will introduce a performance regression:

import os
import timeit
f = open("a.py")
print(timeit.timeit("os.lseek(f.fileno(), 1, os.SEEK_CUR)", number=100000000, globals=globals()))
f.close()

Before the change:

PS C:\Users\xxxxx\Source\cpython> .\python.bat a.py
Running Release|x64 interpreter...
93.8409629999951

After the change:

PS C:\Users\xxxxx\Source\cpython> .\python.bat a.py
Running Release|x64 interpreter...
123.18093929998577

However I think it's acceptable because we added a check in the implementation and os.lseek usually won't been called too many times in the real world.

@aisk aisk marked this pull request as ready for review April 29, 2025 16:01
@aisk aisk changed the title gh-86768: implement os.lseek with SetFilePointer on Windows gh-86768: check if fd is seekable in os.lseek on Windows May 10, 2025
Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be slightly simpler if set result to -1 initially.

Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. 👍

@morotti
Copy link
Contributor

morotti commented Jun 27, 2025

hello,

is it the function that is called when doing file.seek() on a file with open("myfile.zip", "r") as f

if yes, I am somewhat concerned about the performance impact of adding a syscall on every call. is it really needed?
the PR doesn't explain why the check is needed? what happened before when it wasn't explicitly checked whether the file is seekable and it was not seekable?

seek is very heavily used by applications and it is performance sensitive.
for example when you install a package with pip install, the python package is merely a zip file to extract.
the zip file can contain thousands of files, the extraction is doing multiple seek+read operations for each file to extract, to locate (multiple) headers and the content.
I wonder if these calls to seek are going to os.lseek?
https://github.com/python/cpython/blob/3.13/Lib/zipfile/__init__.py

@serhiy-storchaka
Copy link
Member

For example, you can write a ZIP file to not seekable file. If the output file is seekable, ZipFile seeks back to the local file header to write the compressed and uncompressed file sizes after writing the file data. If the output file is not seekable, ZipFile marks them as "undefined" in the local file header and use other way to write this metainformation. If seekable() falsely returns True and seek() does not have effect for pipe, ZipFile will produce corrupted ZIP file.

It is unfortunate that this change will add 0.3 microseconds for each seek() (this is 0.3 milliseconds for a thousand of files), but we have no other way.

@serhiy-storchaka
Copy link
Member

@zooba, could you please take a look? I believe this change should be merged, despite its performance impact.

@zooba
Copy link
Member

zooba commented Sep 3, 2025

Why not change seekable instead? That's what the original bug asked for, and it shouldn't occur in tight loops.

My general rule is that I don't like predicting what errors the OS should return. It's weird that it doesn't produce an error here, but I guess it breaks a lot of apps to introduce it (which should also warn us against introducing an error...). But if we have to do extra work for a bool result like seekable, that's fine by me.

@aisk
Copy link
Contributor Author

aisk commented Sep 3, 2025

Why not change seekable instead? That's what the original bug asked for, and it shouldn't occur in tight loops.

The seekable method for open(os.pipe()[1], 'w') will call lseek on FileIO under the hood, so change os.lseek will also change seekable:

PyObject *pos = portable_lseek(self, NULL, SEEK_CUR, false);

Changing the seekable method for FileIO to check if it's a pipe on Windows is also a solution, but I think the os module is more like a compatibility layer for Windows to emulate POSIX, so changing os.lseek to let it behave more like Unix or Linux is a better way?

@zooba
Copy link
Member

zooba commented Sep 3, 2025

I think the os module is more like a compatibility layer for Windows to emulate POSIX, so changing os.lseek to let it behave more like Unix or Linux is a better way?

Fair, but in that case, I'd prefer we check specifically for a pipe and raise, rather than "anything that isn't disk". It's less disruptive to add new exceptions for specific cases than for negative cases.

@serhiy-storchaka
Copy link
Member

Why not change seekable instead? That's what the original bug asked for, and it shouldn't occur in tight loops.

These are the LBYL and EAFP sides of the same problem. Some code uses FileIO.seekable(), other code uses FileIO.seek() and os.lseek(). BTW, we need to update also the FileIO methods, the C implementation does not call os.lseek().

I'd prefer we check specifically for a pipe and raise, rather than "anything that isn't disk".

Does this work for "CON", "NUL" and "\\.\pipe\xxx"?

@zooba
Copy link
Member

zooba commented Sep 15, 2025

Does this work for "CON", "NUL" and "\\.\pipe\xxx"?

Maybe? I think NUL claims to be a character device, which may get mapped down to pipe at some point, or might be another case that's worth detecting explicitly.

My point is that we should make a list of things that are definitely not seekable and handle those, rather than making a list of things that definitely are seekable and assume that all others should be an error.

@serhiy-storchaka
Copy link
Member

According to the GetFileType() documentation, the only returned values are FILE_TYPE_CHAR, FILE_TYPE_DISK, FILE_TYPE_PIPE and FILE_TYPE_UNKNOWN (which also denotes error). FILE_TYPE_DISK is seekable, FILE_TYPE_PIPE is not seekable.

On Linux, /dev/null, /dev/full, /dev/zero and /dev/urandom are seekable, but seeking is a no-op. /dev/tty and /dev/pts/N are non-seekable. So maybe treat FILE_TYPE_CHAR as seekable?

And what is the correct behavior for FILE_TYPE_UNKNOWN?

@zooba
Copy link
Member

zooba commented Sep 26, 2025

So maybe treat FILE_TYPE_CHAR as seekable?

Seems reasonable.

And what is the correct behavior for FILE_TYPE_UNKNOWN?

I think allow it to work, so that cases where seeking work aren't blocked. Cases where seeking is an error can return an error if they want, but if they want it to be a no-op like pipes, then they can do that too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants