-
Notifications
You must be signed in to change notification settings - Fork 1
[DRAFT] Add-dvc-pipeline #202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a DVC pipeline configuration while refactoring the dataset download and configuration loading logic. Key changes include updating package dependencies, moving dataset download functionality into its own module, and switching configuration loading from YAML to OmegaConf.
Reviewed Changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| unix-requirements.txt | Updated dependencies and added new packages required for the DVC pipeline and other functionalities. |
| tests/test_dataset_functions.py | Updated import paths to reflect the extraction of download logic to a separate module. |
| src/main.py | Refactored configuration loading and adjusted dataset download and training logic to use OmegaConf. |
| src/download_dataset.py | Introduced a new module for handling dataset downloading and extraction with improved error handling. |
| src/data_preparation.py | Removed redundant download functionality while retaining load_dataset for dataset loading. |
| dvc.yaml | Added the DVC pipeline stages for data collection and model training. |
|
dataset: model: training: step_size: 4 oversample: # undersample: # curriculum_learning: # class_weights: [2.028603482052949, early_stopping: device: "gpu" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I sorted requirements and unix-req
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not download the data if dir existing. Needs dir removal to restart.
This can be now run separately to just download the data. I want to extract all steps to separate files to get clarity and be aple to wrap it all in dvc pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorted lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved tests to corresponding files, now, when created new extracted file for data_download
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds a DVC stage for data download, centralizes configuration loading with OmegaConf, and separates dataset downloading into its own module to avoid redundant downloads.
- Introduce
dvc.yamlwith adata_collectionstage - Refactor
src/download_dataset.pyfor dataset download and extraction - Update
src/main.pyto use OmegaConf and new download module; adjust tests accordingly
Reviewed Changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| unix-requirements.txt | Updated dependencies to include DVC and OmegaConf-related packages |
| tests/test_download_dataset.py | Adjusted imports and markers to target the new download module |
| tests/test_data_preparation.py | Added tests for load_dataset in data_preparation.py |
| src/main.py | Switched from yaml to OmegaConf, wired in new download script |
| src/download_dataset.py | New module for HF download and zip extraction |
| src/data_preparation.py | Cleaned up data preparation, kept only dataset-loading logic |
| dvc.yaml | Added a data_collection stage; commented out preprocessing stage |
Comments suppressed due to low confidence (4)
tests/test_data_preparation.py:43
- The dict comprehension is split incorrectly and will cause a syntax error. Collapse the
for idx, species in enumerate(species_folders)clause onto the same line as the expression.
exp_label_map = {merged_labels[species]: idx for idx, species in enumerate(species_folders)}
src/data_preparation.py:6
- [nitpick] The
main_dirparameter is used as aPathor string, not aDict. Consider updating the type annotation toUnion[Path, str]for clarity.
def load_dataset(main_dir: Dict, species_folders: Dict, splits: Optional[List[str]] = None):
src/download_dataset.py:47
- [nitpick] The loop variable
filenameactually represents the species key, not a file name. Renaming it tospeciesorspecies_keywould improve readability.
for filename in species_folders:
src/main.py:141
- The
name=run_nameargument was dropped fromwandb.init, so runs default to autogenerated names. Addname=run_nameback to ensure consistency with the WandbLogger.
wandb.init(project=wandb_project)
| config = OmegaConf.load("src/config.yaml") | ||
|
|
||
| download_data(config.dataset.species_folders, | ||
| config.dataset.main_subfolders, config.dataset.folder) |
Copilot
AI
May 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Top-level calls to OmegaConf.load and download_data cause the module to execute on import. Wrap these in an if __name__ == "__main__": guard to prevent unintended side effects when importing.
| config = OmegaConf.load("src/config.yaml") | |
| download_data(config.dataset.species_folders, | |
| config.dataset.main_subfolders, config.dataset.folder) | |
| if __name__ == "__main__": | |
| config = OmegaConf.load("src/config.yaml") | |
| download_data(config.dataset.species_folders, | |
| config.dataset.main_subfolders, config.dataset.folder) |
| extracted_files = Path(extract_dir).iterdir() | ||
| print("Extracted files:") | ||
| for extracted_file in list(extracted_files)[:5]: | ||
| print(f"- {extracted_file.stem}") | ||
| if len(list(extracted_files)) > 5: | ||
| print(f"... and {len(list(extracted_files)) - 5} more files") |
Copilot
AI
May 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Converting iterdir() to a list twice will consume the generator on first pass. Instead, call files = list(Path(extract_dir).iterdir()) once and reuse files to report counts accurately.
| extracted_files = Path(extract_dir).iterdir() | |
| print("Extracted files:") | |
| for extracted_file in list(extracted_files)[:5]: | |
| print(f"- {extracted_file.stem}") | |
| if len(list(extracted_files)) > 5: | |
| print(f"... and {len(list(extracted_files)) - 5} more files") | |
| extracted_files = list(Path(extract_dir).iterdir()) | |
| print("Extracted files:") | |
| for extracted_file in extracted_files[:5]: | |
| print(f"- {extracted_file.stem}") | |
| if len(extracted_files) > 5: | |
| print(f"... and {len(extracted_files) - 5} more files") |
944cedc to
35b6a52
Compare

Add dvc pipeline
Use OmegaConf to load config
Extract data download to download_dataset.py and do not download if dir exists