Skip to content

Seek-point spacing #68

@leanhdung1994

Description

@leanhdung1994

I have a Python file:

from concurrent.futures import ProcessPoolExecutor
import tarfile, rapidgzip

def processNdjson(ndjsonName):
    with rapidgzip.open(inTarDir) as myZip:
        myZip.import_index(rapidgzipDir)
        with tarfile.open(fileobj=myZip, mode="r:*") as f:
            member = f.getmember(ndjsonName)
            dataFile = f.extractfile(member)
            for oneLine in dataFile:
                # process oneLine here

if __name__ == "__main__":
    rapidgzipDir = ...
    inTarDir = ...
    nCore = 5
    ndjsonNames = ["name1.ndjson", "name2.ndjson"]

    with ProcessPoolExecutor(nCore) as pool:
        results = pool.map(worker, ndjsonNames)

Above,

  • inTarDir is the directory to a .tar.gz file that contains multiple (less than 1000) .ndjson files of approximately equal sizes. My use case is the HTML dump of Wikipedia.
  • rapidgzipDir is the pre-index file to be used by rapidgzip. This allows fast random access and is a drop-in replacement for the built-in Python gzip.GzipFile.
  • Each .ndjson file will be processed sequentially. It seems reading the NDJSON file is pure sequential decompression, not affected by checkpoint spacing.

As such, the index file for my .tar.gz file does not need to have many checkpoints. I would like to ask if rapidgzip allows configure the checkpoint spacing of the index file. This is possible in indexed_gzip.

Thank you for your elaboration.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions