Skip to content

VS-1779 clean up parquet files immediatement#9338

Merged
gbggrant merged 26 commits intoVS-1736from
gg_VS-1779_CleanUpParquetFilesImmediatement
Mar 10, 2026
Merged

VS-1779 clean up parquet files immediatement#9338
gbggrant merged 26 commits intoVS-1736from
gg_VS-1779_CleanUpParquetFilesImmediatement

Conversation

@gbggrant
Copy link
Collaborator

@gbggrant gbggrant commented Feb 26, 2026

This PR adds functionality to (by default) delete all of the parquet files generated during import.

Passing run here

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an option to automatically clean up Parquet artifacts produced during the VariantStore import workflow, aiming to reduce storage usage after a successful load.

Changes:

  • Introduces delete_parquet_files_after_loading (default true) to control Parquet cleanup post-load.
  • Adds a DeleteParquetFiles WDL task and wires it into GvsImportGenomes.
  • Updates Dockstore workflow branch filters.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File Description
scripts/variantstore/wdl/GvsImportGenomes.wdl Adds a new workflow input and a cleanup task intended to delete staged Parquet files after loading.
scripts/variantstore/wdl/GvsBulkIngestGenomes.wdl Adds a stray comment line near imports.
.dockstore.yml Adds a branch filter entry for the new branch name.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

runtime {
docker: cloud_sdk_docker
memory: "3 GB"
disks: "local-disk 500 HDD"
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The task allocates 500 GB of disk space ("local-disk 500 HDD"), which seems excessive for a task that only lists and deletes GCS objects. The script only creates small text files locally (parquet_dirs.txt). Consider reducing the disk allocation to something more reasonable like 10-20 GB, consistent with similar tasks like ConfigureParquetLifecycle which uses 10 GB.

Suggested change
disks: "local-disk 500 HDD"
disks: "local-disk 10 HDD"

Copilot uses AI. Check for mistakes.

# List the contents of the vet and ref_ranges directories for deletion later
echo "Listing directories under ${OUTPUT_GCS_DIR}/vet/ and ${OUTPUT_GCS_DIR}/ref_ranges/ for deletion..."
gcloud storage ls ~{"--billing-project " + billing_project_id} \
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The command uses double spacing between "ls" and the billing_project_id flag. While this may work, it's inconsistent with line 1256 which has single spacing. Consider using consistent spacing for better code style.

Suggested change
gcloud storage ls ~{"--billing-project " + billing_project_id} \
gcloud storage ls ~{"--billing-project " + billing_project_id} \

Copilot uses AI. Check for mistakes.
Comment on lines +1248 to +1257
gcloud storage ls ~{"--billing-project " + billing_project_id} \
"${OUTPUT_GCS_DIR}/vet/" "${OUTPUT_GCS_DIR}/ref_ranges/" > parquet_dirs.txt || true

# Iterate over all Google Cloud paths in parquet_dirs.txt and delete all objects therein
echo "Deleting objects in vet and ref_ranges directories..."
while IFS= read -r gcs_path; do
if [ -n "$gcs_path" ]; then
echo "Deleting objects in: $gcs_path"
gcloud storage rm ~{"--billing-project " + billing_project_id} \
"${gcs_path}**" --recursive || echo "Warning: Failed to delete $gcs_path"
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script lists directories using "gcloud storage ls" which outputs directory paths, but then attempts to delete "${gcs_path}" recursively. Since gcs_path will contain directory paths ending with "/", the resulting pattern "${gcs_path}" may not match the intended files. Additionally, the error handling using "|| true" on line 1249 will suppress any listing errors, potentially leading to silent failures if the directories don't exist. Consider handling the case where directories might not exist more explicitly.

Suggested change
gcloud storage ls ~{"--billing-project " + billing_project_id} \
"${OUTPUT_GCS_DIR}/vet/" "${OUTPUT_GCS_DIR}/ref_ranges/" > parquet_dirs.txt || true
# Iterate over all Google Cloud paths in parquet_dirs.txt and delete all objects therein
echo "Deleting objects in vet and ref_ranges directories..."
while IFS= read -r gcs_path; do
if [ -n "$gcs_path" ]; then
echo "Deleting objects in: $gcs_path"
gcloud storage rm ~{"--billing-project " + billing_project_id} \
"${gcs_path}**" --recursive || echo "Warning: Failed to delete $gcs_path"
if ! gcloud storage ls ~{"\"--billing-project " + billing_project_id + "\"" } \
"${OUTPUT_GCS_DIR}/vet/" "${OUTPUT_GCS_DIR}/ref_ranges/" > parquet_dirs.txt; then
echo "Warning: Failed to list parquet directories under ${OUTPUT_GCS_DIR}/vet/ and ${OUTPUT_GCS_DIR}/ref_ranges/. They may not exist."
# Ensure parquet_dirs.txt exists but is empty so the deletion loop is a no-op
: > parquet_dirs.txt
fi
# Iterate over all Google Cloud paths in parquet_dirs.txt and delete all objects therein
echo "Deleting objects in vet and ref_ranges directories..."
while IFS= read -r gcs_path; do
if [ -n "$gcs_path" ]; then
# Normalize to remove any trailing slash before appending '/**'
gcs_prefix="${gcs_path%/}"
echo "Deleting objects in: $gcs_prefix"
gcloud storage rm ~{"\"--billing-project " + billing_project_id + "\"" } \
"${gcs_prefix}/**" --recursive || echo "Warning: Failed to delete $gcs_path"

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1251 to +1261
# Iterate over all Google Cloud paths in parquet_dirs.txt and delete all objects therein
echo "Deleting objects in vet and ref_ranges directories..."
while IFS= read -r gcs_path; do
if [ -n "$gcs_path" ]; then
echo "Deleting objects in: $gcs_path"
gcloud storage rm ~{"--billing-project " + billing_project_id} "$gcs_path" --recursive
fi
done < parquet_dirs.txt

echo "✓ Completed deletion of vet and ref_ranges parquet files"

Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deletion command at line 1256 will fail if gcloud storage rm encounters an error (e.g., permission denied, network issues), which is the desired behavior. However, consider adding error handling or at least a check that verifies some files were actually found and deleted. Currently, if the ls command on line 1248 returns no results (even due to an access error suppressed by || true), the task will silently succeed without deleting anything. Consider adding validation that parquet_dirs.txt is not empty before proceeding with deletion, or logging a warning if no files are found.

Suggested change
# Iterate over all Google Cloud paths in parquet_dirs.txt and delete all objects therein
echo "Deleting objects in vet and ref_ranges directories..."
while IFS= read -r gcs_path; do
if [ -n "$gcs_path" ]; then
echo "Deleting objects in: $gcs_path"
gcloud storage rm ~{"--billing-project " + billing_project_id} "$gcs_path" --recursive
fi
done < parquet_dirs.txt
echo "✓ Completed deletion of vet and ref_ranges parquet files"
# Validate that we found some directories before attempting deletion
if [ ! -s parquet_dirs.txt ]; then
echo "WARNING: No vet or ref_ranges directories found to delete. This may indicate that no parquet files were generated or there was an access issue listing the bucket." >&2
else
# Iterate over all Google Cloud paths in parquet_dirs.txt and delete all objects therein
echo "Deleting objects in vet and ref_ranges directories..."
deleted_any=false
while IFS= read -r gcs_path; do
if [ -n "$gcs_path" ]; then
deleted_any=true
echo "Deleting objects in: $gcs_path"
gcloud storage rm ~{"--billing-project " + billing_project_id} "$gcs_path" --recursive
fi
done < parquet_dirs.txt
if [ "$deleted_any" = true ]; then
echo "✓ Completed deletion of vet and ref_ranges parquet files"
else
echo "WARNING: parquet_dirs.txt contained no valid GCS paths; no objects were deleted." >&2
fi
fi

Copilot uses AI. Check for mistakes.
@gbggrant gbggrant marked this pull request as ready for review March 4, 2026 21:09
@gatk-bot
Copy link

gatk-bot commented Mar 4, 2026

Github actions tests reported job failures from actions build 22689373565
Failures in the following jobs:

Test Type JDK Job ID Logs
conda 17.0.6+10 22689373565.3 logs

Comment on lines +1301 to +1304
# List the contents of the vet and ref_ranges directories for subsequent deletion in the loop below
echo "Listing directories under ${OUTPUT_GCS_DIR}/vet/ and ${OUTPUT_GCS_DIR}/ref_ranges/ for deletion..."
gcloud storage ls ~{"--billing-project " + billing_project_id} \
"${OUTPUT_GCS_DIR}/vet/" "${OUTPUT_GCS_DIR}/ref_ranges/" > parquet_dirs.txt
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this ls necessary? couldn't we just

gcloud storage rm ~{"--billing-project " + billing_project_id} "${OUTPUT_GCS_DIR}"'/*' --recursive

Beyond the simplification, this would also have the advantage of clearing out the sample_chromosome_ploidy data that the code is currently not dealing with, plus the header data that may appear someday.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likely because mass deletion commands in gcp regularly short circuit and throw errors when trying to delete too many individual items. When I tried to manually clean up after running the 175k exome callset, attempting to delete the entire root of the directories failed repeatedly. Even trying to delete just the vet or ref data as a whole failed. Only removing each subdirectory was able to reliably work, sadly

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in standup, I am following the suggested pattern that @koncheto-broad had suggested in the ticket. Happy to simplify if that is acceptable to all.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised there were issues with a recursive delete of 100Ks of objects as I've successfully deleted millions of objects without issue 🤷

gbggrant and others added 3 commits March 5, 2026 10:40
Co-authored-by: Miguel Covarrubias <mcovarr@users.noreply.github.com>
Co-authored-by: Miguel Covarrubias <mcovarr@users.noreply.github.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +1344 to +1357
# # List the contents of the vet and ref_ranges directories for subsequent deletion in the loop below
# echo "Listing directories under ${OUTPUT_GCS_DIR}/vet/ and ${OUTPUT_GCS_DIR}/ref_ranges/ for deletion..."
# gcloud storage ls ~{"--billing-project " + billing_project_id} \
# "${OUTPUT_GCS_DIR}/vet/" "${OUTPUT_GCS_DIR}/ref_ranges/" > parquet_dirs.txt
#
# # Iterate over all Google Cloud paths in parquet_dirs.txt and delete all objects therein
# echo "Deleting Parquet files..."
# while IFS= read -r gcs_path; do
# if [ -n "$gcs_path" ]; then
# echo "Deleting objects in: $gcs_path"
# gcloud storage rm ~{"--billing-project " + billing_project_id} "$gcs_path" --recursive
# fi
# done < parquet_dirs.txt

Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a large block of commented-out shell logic left in the task command. Please remove the dead/commented code (or convert it into the active implementation) to keep the task readable and avoid confusion about the intended deletion strategy.

Suggested change
# # List the contents of the vet and ref_ranges directories for subsequent deletion in the loop below
# echo "Listing directories under ${OUTPUT_GCS_DIR}/vet/ and ${OUTPUT_GCS_DIR}/ref_ranges/ for deletion..."
# gcloud storage ls ~{"--billing-project " + billing_project_id} \
# "${OUTPUT_GCS_DIR}/vet/" "${OUTPUT_GCS_DIR}/ref_ranges/" > parquet_dirs.txt
#
# # Iterate over all Google Cloud paths in parquet_dirs.txt and delete all objects therein
# echo "Deleting Parquet files..."
# while IFS= read -r gcs_path; do
# if [ -n "$gcs_path" ]; then
# echo "Deleting objects in: $gcs_path"
# gcloud storage rm ~{"--billing-project " + billing_project_id} "$gcs_path" --recursive
# fi
# done < parquet_dirs.txt

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this shouldn't remain commented out; up to you whether it's enabled or not 🙂

.dockstore.yml Outdated
- master
- ah_var_store
- gg_VS-1794_ParquetRemovalStrategy
- gg_VS-1779_CleanUpParquetFilesImmediatement
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling: branch name gg_VS-1779_CleanUpParquetFilesImmediatement contains a typo (“Immediatement”). If possible, rename/update to “Immediately” to keep branch filters consistent and avoid future confusion.

Suggested change
- gg_VS-1779_CleanUpParquetFilesImmediatement
- gg_VS-1779_CleanUpParquetFilesImmediately

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mais non mon robot, on parle français ici !

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

d'accord

Comment on lines +1342 to +1357
gcloud storage rm --recursive ~{"--billing-project " + billing_project_id} "${OUTPUT_GCS_DIR}/"

# # List the contents of the vet and ref_ranges directories for subsequent deletion in the loop below
# echo "Listing directories under ${OUTPUT_GCS_DIR}/vet/ and ${OUTPUT_GCS_DIR}/ref_ranges/ for deletion..."
# gcloud storage ls ~{"--billing-project " + billing_project_id} \
# "${OUTPUT_GCS_DIR}/vet/" "${OUTPUT_GCS_DIR}/ref_ranges/" > parquet_dirs.txt
#
# # Iterate over all Google Cloud paths in parquet_dirs.txt and delete all objects therein
# echo "Deleting Parquet files..."
# while IFS= read -r gcs_path; do
# if [ -n "$gcs_path" ]; then
# echo "Deleting objects in: $gcs_path"
# gcloud storage rm ~{"--billing-project " + billing_project_id} "$gcs_path" --recursive
# fi
# done < parquet_dirs.txt

Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeleteParquetFiles deletes everything under output_gcs_dir via gcloud storage rm --recursive ... "${OUTPUT_GCS_DIR}/", not just the Parquet objects. Since parquet_output_gcs_dir is user-provided, a mis-specified prefix could recursively delete unrelated data. Consider deleting only the discovered .parquet URLs (e.g., pass DiscoverParquetFiles.all_files_list into this task and delete those), or restrict deletion to the known Parquet subpaths (vet/, ref_ranges/, sample_chromosome_ploidy/) instead of the entire prefix.

Suggested change
gcloud storage rm --recursive ~{"--billing-project " + billing_project_id} "${OUTPUT_GCS_DIR}/"
# # List the contents of the vet and ref_ranges directories for subsequent deletion in the loop below
# echo "Listing directories under ${OUTPUT_GCS_DIR}/vet/ and ${OUTPUT_GCS_DIR}/ref_ranges/ for deletion..."
# gcloud storage ls ~{"--billing-project " + billing_project_id} \
# "${OUTPUT_GCS_DIR}/vet/" "${OUTPUT_GCS_DIR}/ref_ranges/" > parquet_dirs.txt
#
# # Iterate over all Google Cloud paths in parquet_dirs.txt and delete all objects therein
# echo "Deleting Parquet files..."
# while IFS= read -r gcs_path; do
# if [ -n "$gcs_path" ]; then
# echo "Deleting objects in: $gcs_path"
# gcloud storage rm ~{"--billing-project " + billing_project_id} "$gcs_path" --recursive
# fi
# done < parquet_dirs.txt
# Delete only known Parquet subdirectories under the output GCS directory,
# instead of recursively deleting everything under OUTPUT_GCS_DIR.
for SUBDIR in vet ref_ranges sample_chromosome_ploidy; do
TARGET_DIR="${OUTPUT_GCS_DIR}/${SUBDIR}/"
echo "Checking for Parquet directory: ${TARGET_DIR}"
if gcloud storage ls ~{"\"--billing-project " + billing_project_id + "\""} "${TARGET_DIR}" >/dev/null 2>&1; then
echo "Deleting objects under: ${TARGET_DIR}"
gcloud storage rm --recursive ~{"\"--billing-project " + billing_project_id + "\""} "${TARGET_DIR}"
else
echo "No objects found under: ${TARGET_DIR}, skipping."
fi
done

Copilot uses AI. Check for mistakes.
# Normalize GCS path by removing any trailing slash
OUTPUT_GCS_DIR=$(echo ~{output_gcs_dir} | sed 's/\/$//')

gcloud storage rm --recursive ~{"--billing-project " + billing_project_id} "${OUTPUT_GCS_DIR}/"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To address some of Copilot's specificity concerns:

Suggested change
gcloud storage rm --recursive ~{"--billing-project " + billing_project_id} "${OUTPUT_GCS_DIR}/"
gcloud storage rm --recursive ~{"--billing-project " + billing_project_id} "${OUTPUT_GCS_DIR}/"'**/*.parquet'

Comment on lines +1344 to +1357
# # List the contents of the vet and ref_ranges directories for subsequent deletion in the loop below
# echo "Listing directories under ${OUTPUT_GCS_DIR}/vet/ and ${OUTPUT_GCS_DIR}/ref_ranges/ for deletion..."
# gcloud storage ls ~{"--billing-project " + billing_project_id} \
# "${OUTPUT_GCS_DIR}/vet/" "${OUTPUT_GCS_DIR}/ref_ranges/" > parquet_dirs.txt
#
# # Iterate over all Google Cloud paths in parquet_dirs.txt and delete all objects therein
# echo "Deleting Parquet files..."
# while IFS= read -r gcs_path; do
# if [ -n "$gcs_path" ]; then
# echo "Deleting objects in: $gcs_path"
# gcloud storage rm ~{"--billing-project " + billing_project_id} "$gcs_path" --recursive
# fi
# done < parquet_dirs.txt

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this shouldn't remain commented out; up to you whether it's enabled or not 🙂

OUTPUT_GCS_DIR=$(echo ~{output_gcs_dir} | sed 's/\/$//')

if [ "~{use_alternate_delete_strategy}" = "false" ]; then
gcloud storage rm --recursive ~{"--billing-project " + billing_project_id} "${OUTPUT_GCS_DIR}/"**/*.parquet
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

going to need some single quotes or escapes here else the globbing is likely to be a problem

Suggested change
gcloud storage rm --recursive ~{"--billing-project " + billing_project_id} "${OUTPUT_GCS_DIR}/"**/*.parquet
gcloud storage rm --recursive ~{"--billing-project " + billing_project_id} "${OUTPUT_GCS_DIR}/"'**/*.parquet'

@gbggrant gbggrant merged commit 71a586d into VS-1736 Mar 10, 2026
7 of 12 checks passed
@gbggrant gbggrant deleted the gg_VS-1779_CleanUpParquetFilesImmediatement branch March 10, 2026 12:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants