Skip to content

Bacdive transform progress bar misleadingly references YAML files #476

@turbomam

Description

@turbomam

Summary

The bacdive transform progress bar displays "Processing BacDive file: 24376.yaml" which is misleading - it's actually processing records from a single monolithic JSON file (bacdive_strains.json), not individual YAML files.

Current Behavior

Processing BacDive file: 24376.yaml:  25%| | 24376/99393 [17:01<59:57, 20.85it/s]

This suggests it's reading 99,393 separate YAML files, which is confusing when trying to understand the data pipeline.

Actual Data Flow

  1. The transform reads a single file: data/raw/bacdive_strains.json (748 MB)
  2. This JSON contains ~99k-112k strain records as array elements
  3. Each array element is processed as one "item" in the progress bar

Code Location

bacdive.py line 1980:

progress.set_description(f"Processing BacDive file: {str(index)}.yaml")

This appears to be a leftover from a debugging feature (lines 996-1003, commented out) that could optionally dump each strain to individual YAML files for inspection.

Suggested Fix

Change line 1980 to something more accurate:

progress.set_description(f"Processing BacDive strain: {key}")

Or simply:

progress.set_description("Processing BacDive strains")

This would make the progress output clearer:

Processing BacDive strains:  25%| | 24376/99393 [17:01<59:57, 20.85it/s]

Impact

Low - cosmetic/UX issue only. Does not affect correctness of the transform.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions