Skip to content

feat: Support max_files filtering on cached metadata #35

@MoAly98

Description

@MoAly98

Problem

When metadata is generated and cached, the max_files config option is only applied during generation, not when loading from cache. This means if you generate metadata with all files but later want to run a quick test with fewer files, you're stuck with the full dataset.

Current behavior:

  • processes filter: Works on cached metadata
  • max_files filter: Only works during generation

Example

# First run: generate metadata with all files
config["datasets"]["max_files"] = None
config["general"]["run_metadata_generation"] = True
# ... generates metadata/fileset.json with 1000 files per dataset

# Second run: want to test with just 10 files
config["datasets"]["max_files"] = 10
config["general"]["run_metadata_generation"] = False  # use cached
# ... but workitems still contain all 1000 files!

Expected behavior

When loading cached metadata, max_files should filter the workitems to only include the first N files per dataset, matching what would happen if metadata was regenerated with that setting.

Suggested approach

Add a filter_by_max_files() utility (similar to existing filter_by_process() in utils/filters.py) and apply it either:

  • In DatasetMetadataManager._load_existing_metadata()
  • Or in run_processor_workflow() alongside the existing processes filtering

This would allow quick iteration on cached metadata without regenerating.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions