VS-1779 clean up parquet files immediatement#9338
Conversation
There was a problem hiding this comment.
Pull request overview
Adds an option to automatically clean up Parquet artifacts produced during the VariantStore import workflow, aiming to reduce storage usage after a successful load.
Changes:
- Introduces
delete_parquet_files_after_loading(defaulttrue) to control Parquet cleanup post-load. - Adds a
DeleteParquetFilesWDL task and wires it intoGvsImportGenomes. - Updates Dockstore workflow branch filters.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| scripts/variantstore/wdl/GvsImportGenomes.wdl | Adds a new workflow input and a cleanup task intended to delete staged Parquet files after loading. |
| scripts/variantstore/wdl/GvsBulkIngestGenomes.wdl | Adds a stray comment line near imports. |
| .dockstore.yml | Adds a branch filter entry for the new branch name. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| runtime { | ||
| docker: cloud_sdk_docker | ||
| memory: "3 GB" | ||
| disks: "local-disk 500 HDD" |
There was a problem hiding this comment.
The task allocates 500 GB of disk space ("local-disk 500 HDD"), which seems excessive for a task that only lists and deletes GCS objects. The script only creates small text files locally (parquet_dirs.txt). Consider reducing the disk allocation to something more reasonable like 10-20 GB, consistent with similar tasks like ConfigureParquetLifecycle which uses 10 GB.
| disks: "local-disk 500 HDD" | |
| disks: "local-disk 10 HDD" |
|
|
||
| # List the contents of the vet and ref_ranges directories for deletion later | ||
| echo "Listing directories under ${OUTPUT_GCS_DIR}/vet/ and ${OUTPUT_GCS_DIR}/ref_ranges/ for deletion..." | ||
| gcloud storage ls ~{"--billing-project " + billing_project_id} \ |
There was a problem hiding this comment.
The command uses double spacing between "ls" and the billing_project_id flag. While this may work, it's inconsistent with line 1256 which has single spacing. Consider using consistent spacing for better code style.
| gcloud storage ls ~{"--billing-project " + billing_project_id} \ | |
| gcloud storage ls ~{"--billing-project " + billing_project_id} \ |
| gcloud storage ls ~{"--billing-project " + billing_project_id} \ | ||
| "${OUTPUT_GCS_DIR}/vet/" "${OUTPUT_GCS_DIR}/ref_ranges/" > parquet_dirs.txt || true | ||
|
|
||
| # Iterate over all Google Cloud paths in parquet_dirs.txt and delete all objects therein | ||
| echo "Deleting objects in vet and ref_ranges directories..." | ||
| while IFS= read -r gcs_path; do | ||
| if [ -n "$gcs_path" ]; then | ||
| echo "Deleting objects in: $gcs_path" | ||
| gcloud storage rm ~{"--billing-project " + billing_project_id} \ | ||
| "${gcs_path}**" --recursive || echo "Warning: Failed to delete $gcs_path" |
There was a problem hiding this comment.
The script lists directories using "gcloud storage ls" which outputs directory paths, but then attempts to delete "${gcs_path}" recursively. Since gcs_path will contain directory paths ending with "/", the resulting pattern "${gcs_path}" may not match the intended files. Additionally, the error handling using "|| true" on line 1249 will suppress any listing errors, potentially leading to silent failures if the directories don't exist. Consider handling the case where directories might not exist more explicitly.
| gcloud storage ls ~{"--billing-project " + billing_project_id} \ | |
| "${OUTPUT_GCS_DIR}/vet/" "${OUTPUT_GCS_DIR}/ref_ranges/" > parquet_dirs.txt || true | |
| # Iterate over all Google Cloud paths in parquet_dirs.txt and delete all objects therein | |
| echo "Deleting objects in vet and ref_ranges directories..." | |
| while IFS= read -r gcs_path; do | |
| if [ -n "$gcs_path" ]; then | |
| echo "Deleting objects in: $gcs_path" | |
| gcloud storage rm ~{"--billing-project " + billing_project_id} \ | |
| "${gcs_path}**" --recursive || echo "Warning: Failed to delete $gcs_path" | |
| if ! gcloud storage ls ~{"\"--billing-project " + billing_project_id + "\"" } \ | |
| "${OUTPUT_GCS_DIR}/vet/" "${OUTPUT_GCS_DIR}/ref_ranges/" > parquet_dirs.txt; then | |
| echo "Warning: Failed to list parquet directories under ${OUTPUT_GCS_DIR}/vet/ and ${OUTPUT_GCS_DIR}/ref_ranges/. They may not exist." | |
| # Ensure parquet_dirs.txt exists but is empty so the deletion loop is a no-op | |
| : > parquet_dirs.txt | |
| fi | |
| # Iterate over all Google Cloud paths in parquet_dirs.txt and delete all objects therein | |
| echo "Deleting objects in vet and ref_ranges directories..." | |
| while IFS= read -r gcs_path; do | |
| if [ -n "$gcs_path" ]; then | |
| # Normalize to remove any trailing slash before appending '/**' | |
| gcs_prefix="${gcs_path%/}" | |
| echo "Deleting objects in: $gcs_prefix" | |
| gcloud storage rm ~{"\"--billing-project " + billing_project_id + "\"" } \ | |
| "${gcs_prefix}/**" --recursive || echo "Warning: Failed to delete $gcs_path" |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Iterate over all Google Cloud paths in parquet_dirs.txt and delete all objects therein | ||
| echo "Deleting objects in vet and ref_ranges directories..." | ||
| while IFS= read -r gcs_path; do | ||
| if [ -n "$gcs_path" ]; then | ||
| echo "Deleting objects in: $gcs_path" | ||
| gcloud storage rm ~{"--billing-project " + billing_project_id} "$gcs_path" --recursive | ||
| fi | ||
| done < parquet_dirs.txt | ||
|
|
||
| echo "✓ Completed deletion of vet and ref_ranges parquet files" | ||
|
|
There was a problem hiding this comment.
The deletion command at line 1256 will fail if gcloud storage rm encounters an error (e.g., permission denied, network issues), which is the desired behavior. However, consider adding error handling or at least a check that verifies some files were actually found and deleted. Currently, if the ls command on line 1248 returns no results (even due to an access error suppressed by || true), the task will silently succeed without deleting anything. Consider adding validation that parquet_dirs.txt is not empty before proceeding with deletion, or logging a warning if no files are found.
| # Iterate over all Google Cloud paths in parquet_dirs.txt and delete all objects therein | |
| echo "Deleting objects in vet and ref_ranges directories..." | |
| while IFS= read -r gcs_path; do | |
| if [ -n "$gcs_path" ]; then | |
| echo "Deleting objects in: $gcs_path" | |
| gcloud storage rm ~{"--billing-project " + billing_project_id} "$gcs_path" --recursive | |
| fi | |
| done < parquet_dirs.txt | |
| echo "✓ Completed deletion of vet and ref_ranges parquet files" | |
| # Validate that we found some directories before attempting deletion | |
| if [ ! -s parquet_dirs.txt ]; then | |
| echo "WARNING: No vet or ref_ranges directories found to delete. This may indicate that no parquet files were generated or there was an access issue listing the bucket." >&2 | |
| else | |
| # Iterate over all Google Cloud paths in parquet_dirs.txt and delete all objects therein | |
| echo "Deleting objects in vet and ref_ranges directories..." | |
| deleted_any=false | |
| while IFS= read -r gcs_path; do | |
| if [ -n "$gcs_path" ]; then | |
| deleted_any=true | |
| echo "Deleting objects in: $gcs_path" | |
| gcloud storage rm ~{"--billing-project " + billing_project_id} "$gcs_path" --recursive | |
| fi | |
| done < parquet_dirs.txt | |
| if [ "$deleted_any" = true ]; then | |
| echo "✓ Completed deletion of vet and ref_ranges parquet files" | |
| else | |
| echo "WARNING: parquet_dirs.txt contained no valid GCS paths; no objects were deleted." >&2 | |
| fi | |
| fi |
|
Github actions tests reported job failures from actions build 22689373565
|
| # List the contents of the vet and ref_ranges directories for subsequent deletion in the loop below | ||
| echo "Listing directories under ${OUTPUT_GCS_DIR}/vet/ and ${OUTPUT_GCS_DIR}/ref_ranges/ for deletion..." | ||
| gcloud storage ls ~{"--billing-project " + billing_project_id} \ | ||
| "${OUTPUT_GCS_DIR}/vet/" "${OUTPUT_GCS_DIR}/ref_ranges/" > parquet_dirs.txt |
There was a problem hiding this comment.
Why is this ls necessary? couldn't we just
gcloud storage rm ~{"--billing-project " + billing_project_id} "${OUTPUT_GCS_DIR}"'/*' --recursive
Beyond the simplification, this would also have the advantage of clearing out the sample_chromosome_ploidy data that the code is currently not dealing with, plus the header data that may appear someday.
There was a problem hiding this comment.
Likely because mass deletion commands in gcp regularly short circuit and throw errors when trying to delete too many individual items. When I tried to manually clean up after running the 175k exome callset, attempting to delete the entire root of the directories failed repeatedly. Even trying to delete just the vet or ref data as a whole failed. Only removing each subdirectory was able to reliably work, sadly
There was a problem hiding this comment.
As discussed in standup, I am following the suggested pattern that @koncheto-broad had suggested in the ticket. Happy to simplify if that is acceptable to all.
There was a problem hiding this comment.
I'm surprised there were issues with a recursive delete of 100Ks of objects as I've successfully deleted millions of objects without issue 🤷
Co-authored-by: Miguel Covarrubias <mcovarr@users.noreply.github.com>
Co-authored-by: Miguel Covarrubias <mcovarr@users.noreply.github.com>
…ParquetFilesImmediatement
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| # # List the contents of the vet and ref_ranges directories for subsequent deletion in the loop below | ||
| # echo "Listing directories under ${OUTPUT_GCS_DIR}/vet/ and ${OUTPUT_GCS_DIR}/ref_ranges/ for deletion..." | ||
| # gcloud storage ls ~{"--billing-project " + billing_project_id} \ | ||
| # "${OUTPUT_GCS_DIR}/vet/" "${OUTPUT_GCS_DIR}/ref_ranges/" > parquet_dirs.txt | ||
| # | ||
| # # Iterate over all Google Cloud paths in parquet_dirs.txt and delete all objects therein | ||
| # echo "Deleting Parquet files..." | ||
| # while IFS= read -r gcs_path; do | ||
| # if [ -n "$gcs_path" ]; then | ||
| # echo "Deleting objects in: $gcs_path" | ||
| # gcloud storage rm ~{"--billing-project " + billing_project_id} "$gcs_path" --recursive | ||
| # fi | ||
| # done < parquet_dirs.txt | ||
|
|
There was a problem hiding this comment.
There is a large block of commented-out shell logic left in the task command. Please remove the dead/commented code (or convert it into the active implementation) to keep the task readable and avoid confusion about the intended deletion strategy.
| # # List the contents of the vet and ref_ranges directories for subsequent deletion in the loop below | |
| # echo "Listing directories under ${OUTPUT_GCS_DIR}/vet/ and ${OUTPUT_GCS_DIR}/ref_ranges/ for deletion..." | |
| # gcloud storage ls ~{"--billing-project " + billing_project_id} \ | |
| # "${OUTPUT_GCS_DIR}/vet/" "${OUTPUT_GCS_DIR}/ref_ranges/" > parquet_dirs.txt | |
| # | |
| # # Iterate over all Google Cloud paths in parquet_dirs.txt and delete all objects therein | |
| # echo "Deleting Parquet files..." | |
| # while IFS= read -r gcs_path; do | |
| # if [ -n "$gcs_path" ]; then | |
| # echo "Deleting objects in: $gcs_path" | |
| # gcloud storage rm ~{"--billing-project " + billing_project_id} "$gcs_path" --recursive | |
| # fi | |
| # done < parquet_dirs.txt |
There was a problem hiding this comment.
I agree this shouldn't remain commented out; up to you whether it's enabled or not 🙂
.dockstore.yml
Outdated
| - master | ||
| - ah_var_store | ||
| - gg_VS-1794_ParquetRemovalStrategy | ||
| - gg_VS-1779_CleanUpParquetFilesImmediatement |
There was a problem hiding this comment.
Spelling: branch name gg_VS-1779_CleanUpParquetFilesImmediatement contains a typo (“Immediatement”). If possible, rename/update to “Immediately” to keep branch filters consistent and avoid future confusion.
| - gg_VS-1779_CleanUpParquetFilesImmediatement | |
| - gg_VS-1779_CleanUpParquetFilesImmediately |
There was a problem hiding this comment.
Mais non mon robot, on parle français ici !
| gcloud storage rm --recursive ~{"--billing-project " + billing_project_id} "${OUTPUT_GCS_DIR}/" | ||
|
|
||
| # # List the contents of the vet and ref_ranges directories for subsequent deletion in the loop below | ||
| # echo "Listing directories under ${OUTPUT_GCS_DIR}/vet/ and ${OUTPUT_GCS_DIR}/ref_ranges/ for deletion..." | ||
| # gcloud storage ls ~{"--billing-project " + billing_project_id} \ | ||
| # "${OUTPUT_GCS_DIR}/vet/" "${OUTPUT_GCS_DIR}/ref_ranges/" > parquet_dirs.txt | ||
| # | ||
| # # Iterate over all Google Cloud paths in parquet_dirs.txt and delete all objects therein | ||
| # echo "Deleting Parquet files..." | ||
| # while IFS= read -r gcs_path; do | ||
| # if [ -n "$gcs_path" ]; then | ||
| # echo "Deleting objects in: $gcs_path" | ||
| # gcloud storage rm ~{"--billing-project " + billing_project_id} "$gcs_path" --recursive | ||
| # fi | ||
| # done < parquet_dirs.txt | ||
|
|
There was a problem hiding this comment.
DeleteParquetFiles deletes everything under output_gcs_dir via gcloud storage rm --recursive ... "${OUTPUT_GCS_DIR}/", not just the Parquet objects. Since parquet_output_gcs_dir is user-provided, a mis-specified prefix could recursively delete unrelated data. Consider deleting only the discovered .parquet URLs (e.g., pass DiscoverParquetFiles.all_files_list into this task and delete those), or restrict deletion to the known Parquet subpaths (vet/, ref_ranges/, sample_chromosome_ploidy/) instead of the entire prefix.
| gcloud storage rm --recursive ~{"--billing-project " + billing_project_id} "${OUTPUT_GCS_DIR}/" | |
| # # List the contents of the vet and ref_ranges directories for subsequent deletion in the loop below | |
| # echo "Listing directories under ${OUTPUT_GCS_DIR}/vet/ and ${OUTPUT_GCS_DIR}/ref_ranges/ for deletion..." | |
| # gcloud storage ls ~{"--billing-project " + billing_project_id} \ | |
| # "${OUTPUT_GCS_DIR}/vet/" "${OUTPUT_GCS_DIR}/ref_ranges/" > parquet_dirs.txt | |
| # | |
| # # Iterate over all Google Cloud paths in parquet_dirs.txt and delete all objects therein | |
| # echo "Deleting Parquet files..." | |
| # while IFS= read -r gcs_path; do | |
| # if [ -n "$gcs_path" ]; then | |
| # echo "Deleting objects in: $gcs_path" | |
| # gcloud storage rm ~{"--billing-project " + billing_project_id} "$gcs_path" --recursive | |
| # fi | |
| # done < parquet_dirs.txt | |
| # Delete only known Parquet subdirectories under the output GCS directory, | |
| # instead of recursively deleting everything under OUTPUT_GCS_DIR. | |
| for SUBDIR in vet ref_ranges sample_chromosome_ploidy; do | |
| TARGET_DIR="${OUTPUT_GCS_DIR}/${SUBDIR}/" | |
| echo "Checking for Parquet directory: ${TARGET_DIR}" | |
| if gcloud storage ls ~{"\"--billing-project " + billing_project_id + "\""} "${TARGET_DIR}" >/dev/null 2>&1; then | |
| echo "Deleting objects under: ${TARGET_DIR}" | |
| gcloud storage rm --recursive ~{"\"--billing-project " + billing_project_id + "\""} "${TARGET_DIR}" | |
| else | |
| echo "No objects found under: ${TARGET_DIR}, skipping." | |
| fi | |
| done |
| # Normalize GCS path by removing any trailing slash | ||
| OUTPUT_GCS_DIR=$(echo ~{output_gcs_dir} | sed 's/\/$//') | ||
|
|
||
| gcloud storage rm --recursive ~{"--billing-project " + billing_project_id} "${OUTPUT_GCS_DIR}/" |
There was a problem hiding this comment.
To address some of Copilot's specificity concerns:
| gcloud storage rm --recursive ~{"--billing-project " + billing_project_id} "${OUTPUT_GCS_DIR}/" | |
| gcloud storage rm --recursive ~{"--billing-project " + billing_project_id} "${OUTPUT_GCS_DIR}/"'**/*.parquet' |
| # # List the contents of the vet and ref_ranges directories for subsequent deletion in the loop below | ||
| # echo "Listing directories under ${OUTPUT_GCS_DIR}/vet/ and ${OUTPUT_GCS_DIR}/ref_ranges/ for deletion..." | ||
| # gcloud storage ls ~{"--billing-project " + billing_project_id} \ | ||
| # "${OUTPUT_GCS_DIR}/vet/" "${OUTPUT_GCS_DIR}/ref_ranges/" > parquet_dirs.txt | ||
| # | ||
| # # Iterate over all Google Cloud paths in parquet_dirs.txt and delete all objects therein | ||
| # echo "Deleting Parquet files..." | ||
| # while IFS= read -r gcs_path; do | ||
| # if [ -n "$gcs_path" ]; then | ||
| # echo "Deleting objects in: $gcs_path" | ||
| # gcloud storage rm ~{"--billing-project " + billing_project_id} "$gcs_path" --recursive | ||
| # fi | ||
| # done < parquet_dirs.txt | ||
|
|
There was a problem hiding this comment.
I agree this shouldn't remain commented out; up to you whether it's enabled or not 🙂
| OUTPUT_GCS_DIR=$(echo ~{output_gcs_dir} | sed 's/\/$//') | ||
|
|
||
| if [ "~{use_alternate_delete_strategy}" = "false" ]; then | ||
| gcloud storage rm --recursive ~{"--billing-project " + billing_project_id} "${OUTPUT_GCS_DIR}/"**/*.parquet |
There was a problem hiding this comment.
going to need some single quotes or escapes here else the globbing is likely to be a problem
| gcloud storage rm --recursive ~{"--billing-project " + billing_project_id} "${OUTPUT_GCS_DIR}/"**/*.parquet | |
| gcloud storage rm --recursive ~{"--billing-project " + billing_project_id} "${OUTPUT_GCS_DIR}/"'**/*.parquet' |
This PR adds functionality to (by default) delete all of the parquet files generated during import.
Passing run here