Skip to content

Improve tarfile streaming mode to handle very large archives #139960

@za3k

Description

@za3k

Bug report

Bug description:

I am trying to use tarfile to write very large archives, and the process is being killed by the OOM killer. I would expect+want to be able to use it in streaming mode (write an unlimited number of files) like the standard tar command line utility supports by default.

Reproduction case attached.

import gc
import io
import os
import psutil
import tarfile

if __name__ == "__main__":
    t = tarfile.open("a.tar", mode="w") # default compresslevel 9
    for i in range(1,100_000_000):
        if i % 10_000 == 0:
            gc.collect()
            process = psutil.Process(os.getpid())
            mem_info = process.memory_info()
            mem = mem_info.rss
            print(f"Iteration {i}, memory usage: {mem}")

        bs = (" "*1000 + str(i)).encode('utf8')
        with io.BytesIO(bs) as file:
            tarinfo = tarfile.TarInfo(name="cool_files/{i}.txt")
            tarinfo.size = len(bs)
            t.addfile(tarinfo, file)

The memory usage increases without bound because the list of this line in addfile():

self.members.append(tarinfo)

I'm not sure what the use-case is for this line. In write-only mode, it does not seem useful. Maybe mixed read/write? But in general it does not seem correct to assume you can fit all the tarinfo's in memory.

Edit: As a workaround, I'm setting t.members=[] manually.

CPython versions tested on:

3.13

Operating systems tested on:

Linux

Metadata

Metadata

Assignees

No one assigned

    Labels

    stdlibStandard Library Python modules in the Lib/ directorytype-featureA feature request or enhancement

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions