Skip to content

Conversation

@rojberr
Copy link
Collaborator

@rojberr rojberr commented May 22, 2025

Add dvc pipeline

Use OmegaConf to load config

Extract data download to download_dataset.py and do not download if dir exists

@GHOST-Science-Club GHOST-Science-Club deleted a comment from github-actions bot May 22, 2025
@rojberr rojberr requested a review from Copilot May 22, 2025 16:22
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a DVC pipeline configuration while refactoring the dataset download and configuration loading logic. Key changes include updating package dependencies, moving dataset download functionality into its own module, and switching configuration loading from YAML to OmegaConf.

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.

Show a summary per file
File Description
unix-requirements.txt Updated dependencies and added new packages required for the DVC pipeline and other functionalities.
tests/test_dataset_functions.py Updated import paths to reflect the extraction of download logic to a separate module.
src/main.py Refactored configuration loading and adjusted dataset download and training logic to use OmegaConf.
src/download_dataset.py Introduced a new module for handling dataset downloading and extraction with improved error handling.
src/data_preparation.py Removed redundant download functionality while retaining load_dataset for dataset loading.
dvc.yaml Added the DVC pipeline stages for data collection and model training.

@github-actions
Copy link

dataset:
  folder: "src/data"
  species_folders:
    Abies_alba: "data/imagery-Abies_alba.zip"
    Abies_nordmanniana: "data/imagery-Abies_nordmanniana.zip"
    Castanea_sativa: "data/imagery-Castanea_sativa.zip"
    Fagus_sylvatica: "data/imagery-Fagus_sylvatica.zip"
    Larix_decidua: "data/imagery-Larix_decidua.zip"
    Picea_abies: "data/imagery-Picea_abies.zip"
    Pinus_halepensis: "data/imagery-Pinus_halepensis.zip"
    Pinus_nigra: "data/imagery-Pinus_nigra.zip"
    Pinus_nigra_laricio: "data/imagery-Pinus_nigra_laricio.zip"
    Pinus_pinaster: "data/imagery-Pinus_pinaster.zip"
    Pinus_sylvestris: "data/imagery-Pinus_sylvestris.zip"
    Pseudotsuga_menziesii: "data/imagery-Pseudotsuga_menziesii.zip"
    Quercus_ilex: "data/imagery-Quercus_ilex.zip"
    Quercus_petraea: "data/imagery-Quercus_petraea.zip"
    Quercus_pubescens: "data/imagery-Quercus_pubescens.zip"
    Quercus_robur: "data/imagery-Quercus_robur.zip"
    Quercus_rubra: "data/imagery-Quercus_rubra.zip"
    Robinia_pseudoacacia: "data/imagery-Robinia_pseudoacacia.zip"
  main_subfolders:
    aerial_imagery: "imagery/"
    lidar: "lidar/"

model:
  name: "fine_grained" # currently supporting resnet18, vit and inception_v3

training:
  batch_size: 32
  learning_rate: 0.0001
  max_epochs: 100
  freeze: true
  weight_decay: 0.0001

  step_size: 4
  gamma: 0.1

  oversample:
    oversample_factor: 4
    oversample_threshold: 1000

  # undersample:
  #   target_size: 530

  # curriculum_learning:
  #   initial_ratio: 2
  #   step_size: 1
  #   class_order: [10, 11, 5, 7, 9, 1, 12, 0, 2, 3, 6, 4, 8] # Based on decreasing IoU

  # class_weights: [2.028603482052949,
  #                 1.9149570077824503,
  #                 2.3698711832307096,
  #                 2.7918140711618267,
  #                 8.404431999123624,
  #                 1.4891439907690158,
  #                 2.8278190246173205,
  #                 1.559603179364982,
  #                 8.968666793195208,
  #                 1.750924051756126,
  #                 1.4114322619818822,
  #                 1.4826886210799306,
  #                 2.025711256102825] # Weights calculated using log2((1/IoU)+1)
  
  dataloader:
    auto: true
    num_workers: 0
    pin_memory: false
    presistent_workers: false

  early_stopping:
    apply: true
    monitor: "val_loss"
    patience: 3
    mode: "min"

device: "gpu"

Copy link
Collaborator Author

@rojberr rojberr May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sorted requirements and unix-req

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not download the data if dir existing. Needs dir removal to restart.

This can be now run separately to just download the data. I want to extract all steps to separate files to get clarity and be aple to wrap it all in dvc pipeline.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorted lines

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved tests to corresponding files, now, when created new extracted file for data_download

@rojberr rojberr requested a review from Copilot May 22, 2025 17:15
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds a DVC stage for data download, centralizes configuration loading with OmegaConf, and separates dataset downloading into its own module to avoid redundant downloads.

  • Introduce dvc.yaml with a data_collection stage
  • Refactor src/download_dataset.py for dataset download and extraction
  • Update src/main.py to use OmegaConf and new download module; adjust tests accordingly

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
unix-requirements.txt Updated dependencies to include DVC and OmegaConf-related packages
tests/test_download_dataset.py Adjusted imports and markers to target the new download module
tests/test_data_preparation.py Added tests for load_dataset in data_preparation.py
src/main.py Switched from yaml to OmegaConf, wired in new download script
src/download_dataset.py New module for HF download and zip extraction
src/data_preparation.py Cleaned up data preparation, kept only dataset-loading logic
dvc.yaml Added a data_collection stage; commented out preprocessing stage
Comments suppressed due to low confidence (4)

tests/test_data_preparation.py:43

  • The dict comprehension is split incorrectly and will cause a syntax error. Collapse the for idx, species in enumerate(species_folders) clause onto the same line as the expression.
    exp_label_map = {merged_labels[species]: idx for idx, species in enumerate(species_folders)}

src/data_preparation.py:6

  • [nitpick] The main_dir parameter is used as a Path or string, not a Dict. Consider updating the type annotation to Union[Path, str] for clarity.
def load_dataset(main_dir: Dict, species_folders: Dict, splits: Optional[List[str]] = None):

src/download_dataset.py:47

  • [nitpick] The loop variable filename actually represents the species key, not a file name. Renaming it to species or species_key would improve readability.
for filename in species_folders:

src/main.py:141

  • The name=run_name argument was dropped from wandb.init, so runs default to autogenerated names. Add name=run_name back to ensure consistency with the WandbLogger.
    wandb.init(project=wandb_project)

Comment on lines +75 to +78
config = OmegaConf.load("src/config.yaml")

download_data(config.dataset.species_folders,
config.dataset.main_subfolders, config.dataset.folder)
Copy link

Copilot AI May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Top-level calls to OmegaConf.load and download_data cause the module to execute on import. Wrap these in an if __name__ == "__main__": guard to prevent unintended side effects when importing.

Suggested change
config = OmegaConf.load("src/config.yaml")
download_data(config.dataset.species_folders,
config.dataset.main_subfolders, config.dataset.folder)
if __name__ == "__main__":
config = OmegaConf.load("src/config.yaml")
download_data(config.dataset.species_folders,
config.dataset.main_subfolders, config.dataset.folder)

Copilot uses AI. Check for mistakes.
Comment on lines +12 to +17
extracted_files = Path(extract_dir).iterdir()
print("Extracted files:")
for extracted_file in list(extracted_files)[:5]:
print(f"- {extracted_file.stem}")
if len(list(extracted_files)) > 5:
print(f"... and {len(list(extracted_files)) - 5} more files")
Copy link

Copilot AI May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converting iterdir() to a list twice will consume the generator on first pass. Instead, call files = list(Path(extract_dir).iterdir()) once and reuse files to report counts accurately.

Suggested change
extracted_files = Path(extract_dir).iterdir()
print("Extracted files:")
for extracted_file in list(extracted_files)[:5]:
print(f"- {extracted_file.stem}")
if len(list(extracted_files)) > 5:
print(f"... and {len(list(extracted_files)) - 5} more files")
extracted_files = list(Path(extract_dir).iterdir())
print("Extracted files:")
for extracted_file in extracted_files[:5]:
print(f"- {extracted_file.stem}")
if len(extracted_files) > 5:
print(f"... and {len(extracted_files) - 5} more files")

Copilot uses AI. Check for mistakes.
@rojberr rojberr force-pushed the Add-dvc-pipeline branch from 3733280 to 243dd8a Compare May 22, 2025 17:58
@rojberr rojberr force-pushed the Add-dvc-pipeline branch 3 times, most recently from 944cedc to 35b6a52 Compare May 22, 2025 18:18
@rojberr rojberr force-pushed the Add-dvc-pipeline branch from 35b6a52 to 79e6cc2 Compare May 22, 2025 18:22
@rojberr rojberr marked this pull request as draft June 2, 2025 13:56
@rojberr rojberr changed the title Add-dvc-pipeline [DRAFT] Add-dvc-pipeline Jun 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants