Unnecessary data download triggered in `register_data` when using non-bio `loader_type`

`bug`, `data-loading`

Hi Popper Team,

I noticed a small issue in the `Popper.register_data` method. Currently, the check for the existence of the `bio_database` directory and the subsequent call to `download_all_data` happens *before* the `loader_type` parameter is evaluated.

```python
# In popper/popper.py -> Popper.register_data
# ...
self.data_path = data_path
# This check runs regardless of loader_type
if not os.path.exists(os.path.join(data_path, 'bio_database')):
    print('It will take a few minutes to download the data for the first time...')
    self.download_all_data()
else:
    print('Data already exists, loading...')

if loader_type == 'bio':
    # ... load bio data ...
elif loader_type == 'custom':
    # ... load custom data ...
# ... etc ...
```

This means that even if a user specifies `loader_type='custom'` or `loader_type='discovery_bench'`, the code still attempts to download the large biological dataset if the `bio_database` directory isn't found in the provided `data_path`. This can lead to unnecessary downloads and delays for users who only intend to use their own local data. The download of the biological dataset should only be triggered if the `loader_type` is set to `'bio'` or `'bio_selected'` (or any other type that explicitly requires this specific dataset).

**Suggested Fix:**

Moving the existence check and the `download_all_data()` call *inside* the conditional blocks for the relevant `loader_type`s would resolve this. For example:

```python
# In popper/popper.py -> Popper.register_data (Proposed Change)
# ...
self.data_path = data_path
# Download check removed from here

if loader_type == 'bio':
    # Check and download only if loader_type is 'bio'
    if not os.path.exists(os.path.join(data_path, 'bio_database')):
        print('It will take a few minutes to download the data for the first time...')
        self.download_all_data()
    else:
        print('Bio data already exists, loading...')
    self.data_loader = ExperimentalDataLoader(...)
elif loader_type == 'bio_selected':
     # Similar check for 'bio_selected'
    if not os.path.exists(os.path.join(data_path, 'bio_database')):
         # ... download ...
    # ...
elif loader_type == 'custom':
    # No download check needed here
    print('Loading custom data...')
    self.data_loader = CustomDataLoader(data_folder=data_path)
# ... etc ...
```

This change would ensure the download only happens when necessary based on the user's specified data loading strategy.

Thanks for considering this improvement!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unnecessary data download triggered in `register_data` when using non-bio `loader_type` #6

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unnecessary data download triggered in register_data when using non-bio loader_type #6

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Unnecessary data download triggered in `register_data` when using non-bio `loader_type` #6