Skip to content

Implement a Prefect flow to combine a list of zarr files and convert to NetCDF #2

@beatfactor

Description

@beatfactor

Prefect 3.0 Flow: Convert Zarr Files to NetCDF4 and Store in Blob Storage

Implement a Prefect flow which will process and concatenate xarray datasets stored in a specified Azure Blob Storage container, as zarr stores. The flow will:

  1. Identify individual file IDs from the container directory structure.
  2. Open both normal and denoised Zarr files using existing functions.
  3. Retrieve additional metadata from the database for each file ID.
  4. Concatenate the datasets while preserving metadata.
  5. Convert the concatenated dataset to NetCDF4 format.
  6. Optionally create echograms of the new concatenated dataset (both denoised and normal), compute MVBS, and NASC.
  7. Store the NetCDF4 file in an output Blob Storage container and generate an access link.

The flow will have the following signature:

load_and_process_files.serve(
    name='convert-to-netcdf',
    parameters={
        'cruise_id': 'example_cruise',
        'load_from_blobstorage': True,
        'get_list_from_db': False,
        'start_datetime': None,
        'end_datetime': None,
        'source_container': 'input-zarr-container',
        'save_to_blobstorage': True,
        'output_container': 'output-netcdf-container',
        'save_to_directory': False,
        'output_directory': '',
        'plot_echograms': False,
        'compute_nasc': False,
        'compute_mvbs': False,
        'chunks_ping_time': 500,
        'chunks_range_sample': 500,
        'batch_size': BATCH_SIZE
    }
)

Workflow Steps

1. Retrieve List of File IDs from Container

  • List all folders under {cruise_id}/
  • Extract {individual_file_id} from folder names.
  • Identify the presence of both {individual_file_id}.zarr and {individual_file_id_denoised}.zarr.

2. Retrieve Metadata from Database

  • Extend FileSegmentService to fetch metadata for each file, including:
    • location
    • file_name
    • id
    • location_data
    • file_freqs
    • file_start_time
    • file_end_time

3. Load Zarr Datasets

  • Use open_zarr_store() to lazily load both normal and denoised datasets.

4. Concatenate Zarr Datasets

  • Call concatenate_zarr_files() to merge all datasets while keeping metadata.
  • Ensure datasets are rechunked appropriately.

5. Convert to NetCDF4

  • Use save_dataset_to_netcdf() to convert the dataset.

6. Upload to Output Container

  • Store the NetCDF4 file in output_container.
  • Generate an access link via generate_container_access_url().

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions