-
Notifications
You must be signed in to change notification settings - Fork 27
Description
bug, data-loading
Hi Popper Team,
I noticed a small issue in the Popper.register_data method. Currently, the check for the existence of the bio_database directory and the subsequent call to download_all_data happens before the loader_type parameter is evaluated.
# In popper/popper.py -> Popper.register_data
# ...
self.data_path = data_path
# This check runs regardless of loader_type
if not os.path.exists(os.path.join(data_path, 'bio_database')):
print('It will take a few minutes to download the data for the first time...')
self.download_all_data()
else:
print('Data already exists, loading...')
if loader_type == 'bio':
# ... load bio data ...
elif loader_type == 'custom':
# ... load custom data ...
# ... etc ...This means that even if a user specifies loader_type='custom' or loader_type='discovery_bench', the code still attempts to download the large biological dataset if the bio_database directory isn't found in the provided data_path. This can lead to unnecessary downloads and delays for users who only intend to use their own local data. The download of the biological dataset should only be triggered if the loader_type is set to 'bio' or 'bio_selected' (or any other type that explicitly requires this specific dataset).
Suggested Fix:
Moving the existence check and the download_all_data() call inside the conditional blocks for the relevant loader_types would resolve this. For example:
# In popper/popper.py -> Popper.register_data (Proposed Change)
# ...
self.data_path = data_path
# Download check removed from here
if loader_type == 'bio':
# Check and download only if loader_type is 'bio'
if not os.path.exists(os.path.join(data_path, 'bio_database')):
print('It will take a few minutes to download the data for the first time...')
self.download_all_data()
else:
print('Bio data already exists, loading...')
self.data_loader = ExperimentalDataLoader(...)
elif loader_type == 'bio_selected':
# Similar check for 'bio_selected'
if not os.path.exists(os.path.join(data_path, 'bio_database')):
# ... download ...
# ...
elif loader_type == 'custom':
# No download check needed here
print('Loading custom data...')
self.data_loader = CustomDataLoader(data_folder=data_path)
# ... etc ...This change would ensure the download only happens when necessary based on the user's specified data loading strategy.
Thanks for considering this improvement!