-
Notifications
You must be signed in to change notification settings - Fork 13
Open
Labels
enhancementNew feature or requestNew feature or request
Description
I have a Python file:
from concurrent.futures import ProcessPoolExecutor
import tarfile, rapidgzip
def processNdjson(ndjsonName):
with rapidgzip.open(inTarDir) as myZip:
myZip.import_index(rapidgzipDir)
with tarfile.open(fileobj=myZip, mode="r:*") as f:
member = f.getmember(ndjsonName)
dataFile = f.extractfile(member)
for oneLine in dataFile:
# process oneLine here
if __name__ == "__main__":
rapidgzipDir = ...
inTarDir = ...
nCore = 5
ndjsonNames = ["name1.ndjson", "name2.ndjson"]
with ProcessPoolExecutor(nCore) as pool:
results = pool.map(worker, ndjsonNames)Above,
inTarDiris the directory to a .tar.gz file that contains multiple (less than 1000) .ndjson files of approximately equal sizes. My use case is the HTML dump of Wikipedia.rapidgzipDiris the pre-index file to be used byrapidgzip. This allows fast random access and is a drop-in replacement for the built-in Pythongzip.GzipFile.- Each .ndjson file will be processed sequentially. It seems reading the NDJSON file is pure sequential decompression, not affected by checkpoint spacing.
As such, the index file for my .tar.gz file does not need to have many checkpoints. I would like to ask if rapidgzip allows configure the checkpoint spacing of the index file. This is possible in indexed_gzip.
Thank you for your elaboration.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request