Seek-point spacing

I have a Python file:

````python
from concurrent.futures import ProcessPoolExecutor
import tarfile, rapidgzip

def processNdjson(ndjsonName):
    with rapidgzip.open(inTarDir) as myZip:
        myZip.import_index(rapidgzipDir)
        with tarfile.open(fileobj=myZip, mode="r:*") as f:
            member = f.getmember(ndjsonName)
            dataFile = f.extractfile(member)
            for oneLine in dataFile:
                # process oneLine here

if __name__ == "__main__":
    rapidgzipDir = ...
    inTarDir = ...
    nCore = 5
    ndjsonNames = ["name1.ndjson", "name2.ndjson"]

    with ProcessPoolExecutor(nCore) as pool:
        results = pool.map(worker, ndjsonNames)
````

Above,

- `inTarDir` is the directory to a .tar.gz file that contains multiple (less than 1000) .ndjson files of approximately *equal* sizes. My use case is the [HTML dump](https://enterprise.wikimedia.com/api/) of Wikipedia.
- `rapidgzipDir` is the pre-index file to be used by [`rapidgzip`](https://github.com/mxmlnkn/rapidgzip/tree/main/python/rapidgzip). This allows fast random access and is a drop-in replacement for the built-in Python `gzip.GzipFile`.
- Each  .ndjson file will be processed *sequentially*. It seems reading the NDJSON file is pure sequential decompression, not affected by checkpoint spacing.

As such, the index file for my .tar.gz file does not need to have many checkpoints. I would like to ask if `rapidgzip` allows configure the checkpoint spacing of the index file. This is possible in [indexed_gzip](https://github.com/pauldmccarthy/indexed_gzip).

Thank you for your elaboration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seek-point spacing #68

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Seek-point spacing #68

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions