Skip to content
Merged
9 changes: 9 additions & 0 deletions components/clp-package-utils/clp_package_utils/general.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
CLP_DEFAULT_CREDENTIALS_FILE_PATH,
CLPConfig,
DB_COMPONENT_NAME,
QueryEngine,
QUEUE_COMPONENT_NAME,
REDIS_COMPONENT_NAME,
REDUCER_COMPONENT_NAME,
Expand Down Expand Up @@ -582,6 +583,14 @@ def validate_dataset_name(clp_table_prefix: str, dataset_name: str) -> None:
)


def validate_retention_config(clp_config: CLPConfig) -> None:
clp_query_engine = clp_config.package.query_engine
if is_retention_period_configured(clp_config) and clp_query_engine == QueryEngine.PRESTO:
raise ValueError(
f"Retention control is not supported with query_engine `{clp_query_engine}`"
)


def is_retention_period_configured(clp_config: CLPConfig) -> bool:
if clp_config.archive_output.retention_period is not None:
return True
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@
validate_redis_config,
validate_reducer_config,
validate_results_cache_config,
validate_retention_config,
validate_webui_config,
)

Expand Down Expand Up @@ -1251,6 +1252,12 @@ def main(argv):
GARBAGE_COLLECTOR_NAME,
):
validate_output_storage_config(clp_config)
if target in (
ALL_TARGET_NAME,
CONTROLLER_TARGET_NAME,
GARBAGE_COLLECTOR_NAME,
):
validate_retention_config(clp_config)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Retention validation correctly blocks PRESTO + retention at startup; consider narrowing scope to results-cache retention.

This will fail fast when PRESTO is configured with a retention period, matching the stated validation. If archive retention (archives/streams) is safe to run with PRESTO, you may want to restrict the check to results-cache retention only to avoid over-blocking legitimate configs.

If you want to narrow the check, update the validator in general.py (see my suggestion there) and keep this call site unchanged.

To double-check the target/component wiring on this branch (so GC gets included under CONTROLLER and ALL), you can run:


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Confirm STORAGE_MANAGEMENT_COMPONENTS and GC membership"
rg -n -C3 '\bSTORAGE_MANAGEMENT_COMPONENTS\b|GARBAGE_COLLECTOR_NAME' --type=py

echo
echo "Confirm TARGET_TO_COMPONENTS includes STORAGE_MANAGEMENT_COMPONENTS for CONTROLLER and ALL"
rg -n -C3 --type=py 'TARGET_TO_COMPONENTS'
rg -n -C3 --type=py 'CONTROLLER_TARGET_NAME.*STORAGE_MANAGEMENT_COMPONENTS'
rg -n -C3 --type=py 'ALL_TARGET_NAME.*STORAGE_MANAGEMENT_COMPONENTS'

Length of output: 27689


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Search for validate_retention_config definition and references"
rg -n --type=py 'def validate_retention_config|validate_retention_config\(' -C3 || true

echo
echo "Show definition (if found)"
DEF_FILE=$(rg -n --type=py 'def validate_retention_config' -S -o | awk -F: '{print $1; exit}' || true)
if [ -n "$DEF_FILE" ]; then
  echo "Definition file: $DEF_FILE"
  sed -n '1,240p' "$DEF_FILE"
else
  echo "No direct definition found; show all matches instead:"
  rg -n --type=py 'validate_retention_config' -C5 || true
fi

Length of output: 8972


Narrow PRESTO + retention validation to results-cache retention (fix required)

validate_retention_config currently rejects startup whenever any retention is configured AND package.query_engine == QueryEngine.PRESTO — this will block PRESTO even if only archive retention is set. Restrict the check to results-cache retention to avoid over-blocking.

  • Files to change

    • components/clp-package-utils/clp_package_utils/general.py — function validate_retention_config (around line ~586). Current condition:
      • if is_retention_period_configured(clp_config) and clp_query_engine == QueryEngine.PRESTO: raise ValueError(...)
    • Call site (leave unchanged):
      • components/clp-package-utils/clp_package_utils/scripts/start_clp.py — call at lines ~1256-1261: validate_retention_config(clp_config)
  • Suggested fix (replace the broad test with a results-cache specific check):

    • Replace
      • if is_retention_period_configured(clp_config) and clp_query_engine == QueryEngine.PRESTO:
    • With
      • if clp_query_engine == QueryEngine.PRESTO and clp_config.results_cache.retention_period is not None:
    • Keep the existing error message or update it to mention results-cache explicitly.

I verified:

  • validate_retention_config is defined in components/clp-package-utils/clp_package_utils/general.py and uses is_retention_period_configured + QueryEngine.PRESTO.
  • start_clp.py calls validate_retention_config for ALL / CONTROLLER / GARBAGE_COLLECTOR targets (so the check is exercised when GC is included).
🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/general.py around line ~586,
the validate_retention_config function currently blocks startup for any
configured retention when clp_query_engine == QueryEngine.PRESTO; narrow this to
only check results-cache retention by changing the condition to test
clp_query_engine == QueryEngine.PRESTO AND
clp_config.results_cache.retention_period is not None, and keep (or update to
mention results-cache) the existing ValueError message so only results-cache
retention prevents PRESTO from starting.

clp_config.validate_data_dir()
clp_config.validate_logs_dir()
Expand Down Expand Up @@ -1330,7 +1337,8 @@ def main(argv):

if WEBUI_COMPONENT_NAME in components_to_start:
start_webui(instance_id, clp_config, container_clp_config, mounts)
if target in (ALL_TARGET_NAME, GARBAGE_COLLECTOR_NAME):

if GARBAGE_COLLECTOR_NAME in components_to_start:
start_garbage_collector(instance_id, clp_config, container_clp_config, mounts)

except Exception as ex:
Expand Down
8 changes: 6 additions & 2 deletions components/clp-py-utils/clp_py_utils/clp_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,10 @@
RESULTS_CACHE_COMPONENT_NAME,
WEBUI_COMPONENT_NAME,
}
ALL_COMPONENTS = COMPRESSION_COMPONENTS | QUERY_COMPONENTS | UI_COMPONENTS
STORAGE_MANAGEMENT_COMPONENTS = {GARBAGE_COLLECTOR_NAME}
ALL_COMPONENTS = (
COMPRESSION_COMPONENTS | QUERY_COMPONENTS | UI_COMPONENTS | STORAGE_MANAGEMENT_COMPONENTS
)

# Target names
ALL_TARGET_NAME = ""
Expand All @@ -60,7 +63,8 @@
| {
COMPRESSION_SCHEDULER_COMPONENT_NAME,
QUERY_SCHEDULER_COMPONENT_NAME,
},
}
| STORAGE_MANAGEMENT_COMPONENTS,
}

QUERY_JOBS_TABLE_NAME = "query_jobs"
Expand Down