-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Labels
Description
Problem
When metadata is generated and cached, the max_files config option is only applied during generation, not when loading from cache. This means if you generate metadata with all files but later want to run a quick test with fewer files, you're stuck with the full dataset.
Current behavior:
processesfilter: Works on cached metadatamax_filesfilter: Only works during generation
Example
# First run: generate metadata with all files
config["datasets"]["max_files"] = None
config["general"]["run_metadata_generation"] = True
# ... generates metadata/fileset.json with 1000 files per dataset
# Second run: want to test with just 10 files
config["datasets"]["max_files"] = 10
config["general"]["run_metadata_generation"] = False # use cached
# ... but workitems still contain all 1000 files!Expected behavior
When loading cached metadata, max_files should filter the workitems to only include the first N files per dataset, matching what would happen if metadata was regenerated with that setting.
Suggested approach
Add a filter_by_max_files() utility (similar to existing filter_by_process() in utils/filters.py) and apply it either:
- In
DatasetMetadataManager._load_existing_metadata() - Or in
run_processor_workflow()alongside the existing processes filtering
This would allow quick iteration on cached metadata without regenerating.