Skip to content

Conversation

@ntnn19
Copy link

@ntnn19 ntnn19 commented Oct 25, 2025

Overview

This PR introduces an option to compress the inference output directory, which is useful for large-scale prediction workloads that generate extensive results.

Compression only occurs after a full inference run completes.
If only the data pipeline is executed, no compression takes place, even if the compression flag is set to true.
The default behavior, --compress_output_dir=false preserves existing functionality.

Motivation

Inference runs can generate large output directories, especially in high-throughput scenarios. This option helps users:

  • Reduce storage usage by compressing completed inference outputs.
  • Run data preprocessing (data_pipeline) without inference or compression when only staged preparation is needed.
  • Retain backward compatibility when compression is disabled (default setting).

Key Changes

  1. Added --compress_output_dir flag to the pipeline configuration (default: false).

  2. Behavioral logic:

    Mode Inference force_output_dir compress_output_dir Result
    Default run false false Normal uncompressed run
    Default run false true Compression after inference
    Default run true false Normal uncompressed run
    Default run true true Compression after inference
    Separate pipeline → inference false false Normal uncompressed run
    Separate pipeline → inference false true Compression after inference
    Separate pipeline → inference true false Forced uncompressed run
    Separate pipeline → inference true true Compression after inference of the forced directory
    Data-only pipeline false false Data pipeline runs; no compression
    Data-only pipeline false true Data pipeline runs; no compression
    Data-only pipeline true false Data pipeline runs; no compression
    Data-only pipeline true true Data pipeline runs; no compression

    Impact

    ✅ No breaking changes; default behavior unchanged

    ✅ Enables disk-saving via post-inference compression

    ✅ Explicitly avoids compression during data-only runs

    Checklist

    • New flag and logic documented
    • Tested with both true/false values in multiple configurations
    • Defaults preserved for existing workflows

    Example Output Directory Structure

    To illustrate the effect of the new --compress_output_dir flag, here is a snippet of the output directory tree generated during testing:

    tree results/
    results/
    ├── rule_DATA_PIPELINE
    │   ├── compress_false_inference_false_data_pipeline_true
    │   │   └── 2PV7
    │   │       └── 2PV7_data.json
    │   ├── compress_false_inference_true_data_pipeline_false
    │   │   └── 2PV7
    │   │       ├── 2PV7_confidences.json
    │   │       ├── 2PV7_data.json
    │   │       ├── 2PV7_model.cif
    │   │       ├── 2PV7_ranking_scores.csv
    │   │       ├── 2PV7_summary_confidences.json
    │   │       ├── seed-1_sample-0
    │   │       │   ├── 2PV7_seed-1_sample-0_confidences.json
    │   │       │   ├── 2PV7_seed-1_sample-0_model.cif
    │   │       │   └── 2PV7_seed-1_sample-0_summary_confidences.json
    │   │       ├── seed-1_sample-1
    │   │       │   ├── 2PV7_seed-1_sample-1_confidences.json
    │   │       │   ├── 2PV7_seed-1_sample-1_model.cif
    │   │       │   └── 2PV7_seed-1_sample-1_summary_confidences.json
    │   │       ├── seed-1_sample-2
    │   │       │   ├── 2PV7_seed-1_sample-2_confidences.json
    │   │       │   ├── 2PV7_seed-1_sample-2_model.cif
    │   │       │   └── 2PV7_seed-1_sample-2_summary_confidences.json
    │   │       ├── seed-1_sample-3
    │   │       │   ├── 2PV7_seed-1_sample-3_confidences.json
    │   │       │   ├── 2PV7_seed-1_sample-3_model.cif
    │   │       │   └── 2PV7_seed-1_sample-3_summary_confidences.json
    │   │       ├── seed-1_sample-4
    │   │       │   ├── 2PV7_seed-1_sample-4_confidences.json
    │   │       │   ├── 2PV7_seed-1_sample-4_model.cif
    │   │       │   └── 2PV7_seed-1_sample-4_summary_confidences.json
    │   │       └── TERMS_OF_USE.md
    │   ├── compress_true_inference_false_data_pipeline_true
    │   │   └── 2PV7
    │   │       └── 2PV7_data.json
    │   └── compress_true_inference_true_data_pipeline_false
    │       └── 2PV7.tar.gz
    ├── rule_DATA_PIPELINE_PLUS_INFERENCE
    │   ├── compress_false_inference_true_data_pipeline_true
    │   │   └── 2PV7
    │   │       ├── 2PV7_confidences.json
    │   │       ├── 2PV7_data.json
    │   │       ├── 2PV7_model.cif
    │   │       ├── 2PV7_ranking_scores.csv
    │   │       ├── 2PV7_summary_confidences.json
    │   │       ├── seed-1_sample-0
    │   │       │   ├── 2PV7_seed-1_sample-0_confidences.json
    │   │       │   ├── 2PV7_seed-1_sample-0_model.cif
    │   │       │   └── 2PV7_seed-1_sample-0_summary_confidences.json
    │   │       ├── seed-1_sample-1
    │   │       │   ├── 2PV7_seed-1_sample-1_confidences.json
    │   │       │   ├── 2PV7_seed-1_sample-1_model.cif
    │   │       │   └── 2PV7_seed-1_sample-1_summary_confidences.json
    │   │       ├── seed-1_sample-2
    │   │       │   ├── 2PV7_seed-1_sample-2_confidences.json
    │   │       │   ├── 2PV7_seed-1_sample-2_model.cif
    │   │       │   └── 2PV7_seed-1_sample-2_summary_confidences.json
    │   │       ├── seed-1_sample-3
    │   │       │   ├── 2PV7_seed-1_sample-3_confidences.json
    │   │       │   ├── 2PV7_seed-1_sample-3_model.cif
    │   │       │   └── 2PV7_seed-1_sample-3_summary_confidences.json
    │   │       ├── seed-1_sample-4
    │   │       │   ├── 2PV7_seed-1_sample-4_confidences.json
    │   │       │   ├── 2PV7_seed-1_sample-4_model.cif
    │   │       │   └── 2PV7_seed-1_sample-4_summary_confidences.json
    │   │       └── TERMS_OF_USE.md
    │   └── compress_true_inference_true_data_pipeline_true
    │       └── 2PV7.tar.gz
    └── rule_INFERENCE
        ├── compress_false_inference_true_data_pipeline_false
        │   └── 2PV7
        │       ├── 2PV7_confidences.json
        │       ├── 2PV7_data.json
        │       ├── 2PV7_model.cif
        │       ├── 2PV7_ranking_scores.csv
        │       ├── 2PV7_summary_confidences.json
        │       ├── seed-1_sample-0
        │       │   ├── 2PV7_seed-1_sample-0_confidences.json
        │       │   ├── 2PV7_seed-1_sample-0_model.cif
        │       │   └── 2PV7_seed-1_sample-0_summary_confidences.json
        │       ├── seed-1_sample-1
        │       │   ├── 2PV7_seed-1_sample-1_confidences.json
        │       │   ├── 2PV7_seed-1_sample-1_model.cif
        │       │   └── 2PV7_seed-1_sample-1_summary_confidences.json
        │       ├── seed-1_sample-2
        │       │   ├── 2PV7_seed-1_sample-2_confidences.json
        │       │   ├── 2PV7_seed-1_sample-2_model.cif
        │       │   └── 2PV7_seed-1_sample-2_summary_confidences.json
        │       ├── seed-1_sample-3
        │       │   ├── 2PV7_seed-1_sample-3_confidences.json
        │       │   ├── 2PV7_seed-1_sample-3_model.cif
        │       │   └── 2PV7_seed-1_sample-3_summary_confidences.json
        │       ├── seed-1_sample-4
        │       │   ├── 2PV7_seed-1_sample-4_confidences.json
        │       │   ├── 2PV7_seed-1_sample-4_model.cif
        │       │   └── 2PV7_seed-1_sample-4_summary_confidences.json
        │       └── TERMS_OF_USE.md
        └── compress_true_inference_true_data_pipeline_false
            └── 2PV7.tar.gz
    
    32 directories, 68 files

@google-cla
Copy link

google-cla bot commented Oct 25, 2025

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant