[DRAFT] Add-dvc-pipeline #202

rojberr · 2025-05-22T16:12:09Z

Add dvc pipeline

Use OmegaConf to load config

Extract data download to download_dataset.py and do not download if dir exists

Copilot

Pull Request Overview

This PR introduces a DVC pipeline configuration while refactoring the dataset download and configuration loading logic. Key changes include updating package dependencies, moving dataset download functionality into its own module, and switching configuration loading from YAML to OmegaConf.

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
unix-requirements.txt	Updated dependencies and added new packages required for the DVC pipeline and other functionalities.
tests/test_dataset_functions.py	Updated import paths to reflect the extraction of download logic to a separate module.
src/main.py	Refactored configuration loading and adjusted dataset download and training logic to use OmegaConf.
src/download_dataset.py	Introduced a new module for handling dataset downloading and extraction with improved error handling.
src/data_preparation.py	Removed redundant download functionality while retaining load_dataset for dataset loading.
dvc.yaml	Added the DVC pipeline stages for data collection and model training.

github-actions · 2025-05-22T16:23:11Z

dataset:
  folder: "src/data"
  species_folders:
    Abies_alba: "data/imagery-Abies_alba.zip"
    Abies_nordmanniana: "data/imagery-Abies_nordmanniana.zip"
    Castanea_sativa: "data/imagery-Castanea_sativa.zip"
    Fagus_sylvatica: "data/imagery-Fagus_sylvatica.zip"
    Larix_decidua: "data/imagery-Larix_decidua.zip"
    Picea_abies: "data/imagery-Picea_abies.zip"
    Pinus_halepensis: "data/imagery-Pinus_halepensis.zip"
    Pinus_nigra: "data/imagery-Pinus_nigra.zip"
    Pinus_nigra_laricio: "data/imagery-Pinus_nigra_laricio.zip"
    Pinus_pinaster: "data/imagery-Pinus_pinaster.zip"
    Pinus_sylvestris: "data/imagery-Pinus_sylvestris.zip"
    Pseudotsuga_menziesii: "data/imagery-Pseudotsuga_menziesii.zip"
    Quercus_ilex: "data/imagery-Quercus_ilex.zip"
    Quercus_petraea: "data/imagery-Quercus_petraea.zip"
    Quercus_pubescens: "data/imagery-Quercus_pubescens.zip"
    Quercus_robur: "data/imagery-Quercus_robur.zip"
    Quercus_rubra: "data/imagery-Quercus_rubra.zip"
    Robinia_pseudoacacia: "data/imagery-Robinia_pseudoacacia.zip"
  main_subfolders:
    aerial_imagery: "imagery/"
    lidar: "lidar/"

model:
name: "fine_grained" # currently supporting resnet18, vit and inception_v3

training:
  batch_size: 32
  learning_rate: 0.0001
  max_epochs: 100
  freeze: true
  weight_decay: 0.0001

step_size: 4
gamma: 0.1

  oversample:
    oversample_factor: 4
    oversample_threshold: 1000

# undersample:
# target_size: 530

  # curriculum_learning:
  #   initial_ratio: 2
  #   step_size: 1
  #   class_order: [10, 11, 5, 7, 9, 1, 12, 0, 2, 3, 6, 4, 8] # Based on decreasing IoU

  # class_weights: [2.028603482052949,
  #                 1.9149570077824503,
  #                 2.3698711832307096,
  #                 2.7918140711618267,
  #                 8.404431999123624,
  #                 1.4891439907690158,
  #                 2.8278190246173205,
  #                 1.559603179364982,
  #                 8.968666793195208,
  #                 1.750924051756126,
  #                 1.4114322619818822,
  #                 1.4826886210799306,
  #                 2.025711256102825] # Weights calculated using log2((1/IoU)+1)

  dataloader:
    auto: true
    num_workers: 0
    pin_memory: false
    presistent_workers: false

  early_stopping:
    apply: true
    monitor: "val_loss"
    patience: 3
    mode: "min"

device: "gpu"

…dataset

rojberr · 2025-05-22T17:13:57Z

requirements.txt

I sorted requirements and unix-req

rojberr · 2025-05-22T17:15:06Z

src/download_dataset.py

This does not download the data if dir existing. Needs dir removal to restart.

This can be now run separately to just download the data. I want to extract all steps to separate files to get clarity and be aple to wrap it all in dvc pipeline.

rojberr · 2025-05-22T17:15:20Z

unix-requirements.txt

Sorted lines

rojberr · 2025-05-22T17:15:45Z

tests/test_download_dataset.py

Moved tests to corresponding files, now, when created new extracted file for data_download

Copilot

Pull Request Overview

Adds a DVC stage for data download, centralizes configuration loading with OmegaConf, and separates dataset downloading into its own module to avoid redundant downloads.

Introduce dvc.yaml with a data_collection stage
Refactor src/download_dataset.py for dataset download and extraction
Update src/main.py to use OmegaConf and new download module; adjust tests accordingly

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
unix-requirements.txt	Updated dependencies to include DVC and OmegaConf-related packages
tests/test_download_dataset.py	Adjusted imports and markers to target the new download module
tests/test_data_preparation.py	Added tests for `load_dataset` in `data_preparation.py`
src/main.py	Switched from yaml to OmegaConf, wired in new download script
src/download_dataset.py	New module for HF download and zip extraction
src/data_preparation.py	Cleaned up data preparation, kept only dataset-loading logic
dvc.yaml	Added a `data_collection` stage; commented out preprocessing stage

Comments suppressed due to low confidence (4)

tests/test_data_preparation.py:43

The dict comprehension is split incorrectly and will cause a syntax error. Collapse the for idx, species in enumerate(species_folders) clause onto the same line as the expression.

    exp_label_map = {merged_labels[species]: idx for idx, species in enumerate(species_folders)}

src/data_preparation.py:6

[nitpick] The main_dir parameter is used as a Path or string, not a Dict. Consider updating the type annotation to Union[Path, str] for clarity.

def load_dataset(main_dir: Dict, species_folders: Dict, splits: Optional[List[str]] = None):

src/download_dataset.py:47

[nitpick] The loop variable filename actually represents the species key, not a file name. Renaming it to species or species_key would improve readability.

for filename in species_folders:

src/main.py:141

The name=run_name argument was dropped from wandb.init, so runs default to autogenerated names. Add name=run_name back to ensure consistency with the WandbLogger.

    wandb.init(project=wandb_project)

Copilot · 2025-05-22T17:18:09Z

src/download_dataset.py

+config = OmegaConf.load("src/config.yaml")
+
+download_data(config.dataset.species_folders,
+              config.dataset.main_subfolders, config.dataset.folder)


Top-level calls to OmegaConf.load and download_data cause the module to execute on import. Wrap these in an if __name__ == "__main__": guard to prevent unintended side effects when importing.

Suggested change

config = OmegaConf.load("src/config.yaml")

download_data(config.dataset.species_folders,

config.dataset.main_subfolders, config.dataset.folder)

if __name__ == "__main__":

config = OmegaConf.load("src/config.yaml")

download_data(config.dataset.species_folders,

config.dataset.main_subfolders, config.dataset.folder)

Copilot · 2025-05-22T17:18:09Z

src/download_dataset.py

+    extracted_files = Path(extract_dir).iterdir()
+    print("Extracted files:")
+    for extracted_file in list(extracted_files)[:5]:
+        print(f"- {extracted_file.stem}")
+    if len(list(extracted_files)) > 5:
+        print(f"... and {len(list(extracted_files)) - 5} more files")


Converting iterdir() to a list twice will consume the generator on first pass. Instead, call files = list(Path(extract_dir).iterdir()) once and reuse files to report counts accurately.

Suggested change

extracted_files = Path(extract_dir).iterdir()

print("Extracted files:")

for extracted_file in list(extracted_files)[:5]:

print(f"- {extracted_file.stem}")

if len(list(extracted_files)) > 5:

print(f"... and {len(list(extracted_files)) - 5} more files")

extracted_files = list(Path(extract_dir).iterdir())

print("Extracted files:")

for extracted_file in extracted_files[:5]:

print(f"- {extracted_file.stem}")

if len(extracted_files) > 5:

print(f"... and {len(extracted_files) - 5} more files")

rojberr added 2 commits May 22, 2025 16:22

build(dvc): Add data_collection dvc pipeline step

ce40bd4

feat(): Load config using OmegaConf

d158c90

rojberr had a problem deploying to Repository May 22, 2025 16:12 — with GitHub Actions Failure

build(): Add Omegaconf to requirements

984a1c1

GHOST-Science-Club deleted a comment from github-actions bot May 22, 2025

rojberr temporarily deployed to Repository May 22, 2025 16:22 — with GitHub Actions Inactive

rojberr requested a review from Copilot May 22, 2025 16:22

Copilot AI reviewed May 22, 2025

View reviewed changes

rojberr added 3 commits May 22, 2025 18:24

test(mypy): Remove unused imports from main and download_dataset

118dbba

test(): Split tests from dataset_functions to data_prep and download_…

4d11ab2

…dataset

fix tests

73a798a

rojberr commented May 22, 2025

View reviewed changes

requirements.txt

Copy link

Collaborator Author

rojberr May 22, 2025 •

edited

Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sorted requirements and unix-req

rojberr commented May 22, 2025

View reviewed changes

unix-requirements.txt

Copy link

Collaborator Author

rojberr May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorted lines

rojberr commented May 22, 2025

View reviewed changes

tests/test_download_dataset.py

Copy link

Collaborator Author

rojberr May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved tests to corresponding files, now, when created new extracted file for data_download

rojberr requested a review from Copilot May 22, 2025 17:15

Copilot AI reviewed May 22, 2025

View reviewed changes

build(requirements): Remove no needed reqs

095534b

rojberr force-pushed the Add-dvc-pipeline branch from 3733280 to 243dd8a Compare May 22, 2025 17:58

build(workflows): Cleanup runner space before tests

9593784

rojberr force-pushed the Add-dvc-pipeline branch 3 times, most recently from 944cedc to 35b6a52 Compare May 22, 2025 18:18

test(): Run tests just on 3.11

79e6cc2

rojberr force-pushed the Add-dvc-pipeline branch from 35b6a52 to 79e6cc2 Compare May 22, 2025 18:22

rojberr marked this pull request as draft June 2, 2025 13:56

rojberr changed the title ~~Add-dvc-pipeline~~ [DRAFT] Add-dvc-pipeline Jun 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DRAFT] Add-dvc-pipeline #202

[DRAFT] Add-dvc-pipeline #202

Uh oh!

rojberr commented May 22, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

github-actions bot commented May 22, 2025

Uh oh!

rojberr May 22, 2025 •

edited

Loading

Uh oh!

rojberr May 22, 2025

Uh oh!

rojberr May 22, 2025

Uh oh!

rojberr May 22, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI May 22, 2025

Uh oh!

Copilot AI May 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[DRAFT] Add-dvc-pipeline #202

Are you sure you want to change the base?

[DRAFT] Add-dvc-pipeline #202

Uh oh!

Conversation

rojberr commented May 22, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

github-actions bot commented May 22, 2025

Uh oh!

rojberr May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rojberr May 22, 2025

Choose a reason for hiding this comment

Uh oh!

rojberr May 22, 2025

Choose a reason for hiding this comment

Uh oh!

rojberr May 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI May 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 22, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rojberr May 22, 2025 •

edited

Loading