-
-
Notifications
You must be signed in to change notification settings - Fork 33.1k
Description
Documentation
Consider the following reproducer:
import io
import sys
import tarfile
with open("/dev/urandom", "rb") as f:
data = io.BytesIO(f.read(3849))
size = len(data.getbuffer())
if sys.argv[1] == "stream":
kwargs = {
"mode": "w|xz",
"compresslevel": 9,
}
else:
kwargs = {
"mode": "w:xz",
"preset": 9,
}
with tarfile.open("test.tar.xz", format=tarfile.GNU_FORMAT, **kwargs) as tarf:
for x in range(50000):
data.seek(0)
tinfo = tarfile.TarInfo(f"{x}.txt")
tinfo.size = size
tarf.addfile(tinfo, data)
It is supposed to simulate a simplified version of adding lots of small files to an xz-compressed archive. Consider the timings:
$ time python3.13 test.py normal
real 0m10,316s
user 0m9,716s
sys 0m0,568s
$ time python3.13 test.py stream
real 0m9,115s
user 0m8,999s
sys 0m0,090s
The stream mode (w|xz
) is noticeably faster than the regular mode (w:xz
) here. For example, when the problem was reported to pycargoebuild, I've found out that repacking uv-0.6.17
crates takes roughly 3 min 35 s in regular mode, and 3 min 15 s in stream mode.
I presume the differences are by design, but I think it would be useful to document them more clearly. Currently, the documentation indicates that:
For special purposes, there is a second format for mode:
filemode|[compression]
. tarfile.open() will return a TarFile object that processes its data as a stream of blocks. No random seeking will be done on the file. If given, fileobj may be any object that has a read() or write() method (depending on the mode) that works with bytes. bufsize specifies the blocksize and defaults to 20 * 512 bytes. Use this variant in combination with e.g.sys.stdin.buffer
, a socket file object or a tape device. […]
This suggests that you'd only use the stream mode in special cases, in particular when the underlying file doesn't provide for random access. However, this experiment seems to suggest that the stream mode is faster in general, and particularly when dealing with lots of files. Therefore, I think the documentation could be updated to indicate that the stream mode is faster when adding lots of files — or perhaps that it should be preferable in general, unless random access is actually necessary.
Metadata
Metadata
Assignees
Labels
Projects
Status
Status