Skip to content

Conversation

@realmarcin
Copy link
Collaborator

Summary

Modified the Bakta transform to process multiple dataset subdirectories (e.g., pfas_bakta, cmm_bakta) under data/raw/bakta/, allowing separate genome annotation datasets to be transformed and merged independently.

Changes

1. constants.py (line 38)

  • Changed BAKTA_RAW_DIR from "pfas_bakta/bakta" to "bakta"
  • Now points to top-level bakta directory to support multiple datasets

2. bakta.py

  • Modified run() method to scan for dataset subdirectories
  • Processes each dataset (e.g., pfas_bakta, cmm_bakta) separately
  • Outputs to data/transformed/bakta/[dataset_name]/ for each
  • Uses input_base_dir instead of constant for better testability
  • Added dataset_name parameter to write_output() method

3. merge.yaml (lines 122-123)

  • Updated filename patterns to use wildcards: bakta/*/nodes.tsv
  • Automatically merges all bakta dataset subdirectories

4. test_bakta.py

  • Updated test file path to match new directory structure
  • Now uses: test_dataset/bakta/SAMN_test/SAMN_test.bakta.tsv

5. tests/resources/bakta/

  • Reorganized test files into new directory structure
  • Created test_dataset/bakta/SAMN_test/ hierarchy

Directory Structure

Input:

data/raw/bakta/
├── pfas_bakta/
│   └── bakta/
│       └── SAMN*/
└── cmm_bakta/
    └── bakta/
        └── SAMN*/

Output:

data/transformed/bakta/
├── pfas_bakta/
│   ├── nodes.tsv
│   └── edges.tsv
└── cmm_bakta/
    ├── nodes.tsv
    └── edges.tsv

Merge:

  • All subdirectories merged via wildcard pattern: data/transformed/bakta/*/nodes.tsv

Testing

  • All 14 bakta unit tests passing ✅
  • Test directory structure updated to match new format
  • Verified with existing pfas_bakta and cmm_bakta datasets

Impact

  • Breaking change: Existing bakta data needs to be moved from data/raw/pfas_bakta/ to data/raw/bakta/pfas_bakta/
  • Benefit: Allows processing multiple bakta datasets (PFAS, CMM, etc.) in a single pipeline run
  • Merge: Automatically combines all bakta datasets into the final knowledge graph

🤖 Generated with Claude Code

Modified the Bakta transform to process multiple dataset subdirectories
(e.g., pfas_bakta, cmm_bakta) under data/raw/bakta/, allowing separate
genome annotation datasets to be transformed and merged independently.

Changes:

1. **constants.py** (line 38):
   - Changed BAKTA_RAW_DIR from "pfas_bakta/bakta" to "bakta"
   - Now points to top-level bakta directory to support multiple datasets

2. **bakta.py**:
   - Modified run() method to scan for dataset subdirectories
   - Processes each dataset (e.g., pfas_bakta, cmm_bakta) separately
   - Outputs to data/transformed/bakta/[dataset_name]/ for each
   - Uses input_base_dir instead of constant for better testability
   - Added dataset_name parameter to write_output() method

3. **merge.yaml** (lines 122-123):
   - Updated filename patterns to use wildcards: bakta/*/nodes.tsv
   - Automatically merges all bakta dataset subdirectories

4. **test_bakta.py**:
   - Updated test file path to match new directory structure
   - Now uses: test_dataset/bakta/SAMN_test/SAMN_test.bakta.tsv

5. **tests/resources/bakta/**:
   - Reorganized test files into new directory structure
   - Created test_dataset/bakta/SAMN_test/ hierarchy

Directory Structure:
- Input:  data/raw/bakta/[dataset_name]/bakta/SAMN*/
- Output: data/transformed/bakta/[dataset_name]/nodes.tsv
- Merge:  All subdirectories merged via wildcard pattern

All 14 bakta tests passing.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@realmarcin realmarcin merged commit 3355fcf into master Jan 6, 2026
0 of 3 checks passed
@realmarcin realmarcin deleted the bakta branch January 6, 2026 08:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants