-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Description
Summary
The bacdive transform progress bar displays "Processing BacDive file: 24376.yaml" which is misleading - it's actually processing records from a single monolithic JSON file (bacdive_strains.json), not individual YAML files.
Current Behavior
Processing BacDive file: 24376.yaml: 25%| | 24376/99393 [17:01<59:57, 20.85it/s]
This suggests it's reading 99,393 separate YAML files, which is confusing when trying to understand the data pipeline.
Actual Data Flow
- The transform reads a single file:
data/raw/bacdive_strains.json(748 MB) - This JSON contains ~99k-112k strain records as array elements
- Each array element is processed as one "item" in the progress bar
Code Location
bacdive.py line 1980:
progress.set_description(f"Processing BacDive file: {str(index)}.yaml")This appears to be a leftover from a debugging feature (lines 996-1003, commented out) that could optionally dump each strain to individual YAML files for inspection.
Suggested Fix
Change line 1980 to something more accurate:
progress.set_description(f"Processing BacDive strain: {key}")Or simply:
progress.set_description("Processing BacDive strains")This would make the progress output clearer:
Processing BacDive strains: 25%| | 24376/99393 [17:01<59:57, 20.85it/s]
Impact
Low - cosmetic/UX issue only. Does not affect correctness of the transform.
Metadata
Metadata
Assignees
Labels
No labels