Skip to content

Unnecessary data download triggered in register_data when using non-bio loader_type #6

@jinyoungkim927

Description

@jinyoungkim927

bug, data-loading

Hi Popper Team,

I noticed a small issue in the Popper.register_data method. Currently, the check for the existence of the bio_database directory and the subsequent call to download_all_data happens before the loader_type parameter is evaluated.

# In popper/popper.py -> Popper.register_data
# ...
self.data_path = data_path
# This check runs regardless of loader_type
if not os.path.exists(os.path.join(data_path, 'bio_database')):
    print('It will take a few minutes to download the data for the first time...')
    self.download_all_data()
else:
    print('Data already exists, loading...')

if loader_type == 'bio':
    # ... load bio data ...
elif loader_type == 'custom':
    # ... load custom data ...
# ... etc ...

This means that even if a user specifies loader_type='custom' or loader_type='discovery_bench', the code still attempts to download the large biological dataset if the bio_database directory isn't found in the provided data_path. This can lead to unnecessary downloads and delays for users who only intend to use their own local data. The download of the biological dataset should only be triggered if the loader_type is set to 'bio' or 'bio_selected' (or any other type that explicitly requires this specific dataset).

Suggested Fix:

Moving the existence check and the download_all_data() call inside the conditional blocks for the relevant loader_types would resolve this. For example:

# In popper/popper.py -> Popper.register_data (Proposed Change)
# ...
self.data_path = data_path
# Download check removed from here

if loader_type == 'bio':
    # Check and download only if loader_type is 'bio'
    if not os.path.exists(os.path.join(data_path, 'bio_database')):
        print('It will take a few minutes to download the data for the first time...')
        self.download_all_data()
    else:
        print('Bio data already exists, loading...')
    self.data_loader = ExperimentalDataLoader(...)
elif loader_type == 'bio_selected':
     # Similar check for 'bio_selected'
    if not os.path.exists(os.path.join(data_path, 'bio_database')):
         # ... download ...
    # ...
elif loader_type == 'custom':
    # No download check needed here
    print('Loading custom data...')
    self.data_loader = CustomDataLoader(data_folder=data_path)
# ... etc ...

This change would ensure the download only happens when necessary based on the user's specified data loading strategy.

Thanks for considering this improvement!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions