Skip to content

Commit c8fb4ae

Browse files
chadcrumclaude
andauthored
fix(ci): fix RHDH OCP Orchestrator Helm e2e job failures (#3929)
* fix(ci): add SSL support for external PostgreSQL in sonataflow database creation Signed-off-by: Chad Crum <ccrum@redhat.com> * fix(ci): remove redundant hardcoded PGSSLMODE export PGSSLMODE was being set twice in the database creation and verification pods: hardcoded to 'require' via export, and also injected as an env var from the postgres-cred secret. The export overrode the secret value, making the secret-based env var misleading. Remove the hardcoded export so PGSSLMODE is sourced solely from the secret, consistent with all other connection parameters. Signed-off-by: Chad Crum <ccrum@redhat.com> Made-with: Cursor * fix(ci): remove redundant verify_sonataflow_database function The verification function spun up a separate pod just to run `\l | grep sonataflow`, but its result was non-blocking — failures logged a warning and continued anyway. The creation function already reports success/failure, making this an unnecessary extra pod and additional CI time for no actionable outcome. Signed-off-by: Chad Crum <ccrum@redhat.com> Made-with: Cursor * fix(ci): add timeout to sonataflow database job wait loop The loop waiting for the create-sonataflow-database job had no upper bound, so a silent helm install failure would spin indefinitely until the Prow timeout killed the entire CI job. Add a 5-minute cap (60 attempts x 5s) with a clear error message and namespace job listing to aid debugging. Signed-off-by: Chad Crum <ccrum@redhat.com> Made-with: Cursor * style: fix prettier formatting in utils.sh Signed-off-by: Chad Crum <ccrum@redhat.com> * fix(ci): parameterize PostgreSQL namespace in RBAC helm deploy The externalDBHost was hardcoded to the postgress-external-db namespace, but nightly Prow jobs create it as postgress-external-db-nightly. Use NAME_SPACE_POSTGRES_DB env var via --set override to inject the correct hostname at deploy time. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Chad Crum <ccrum@redhat.com> * fix(ci): fail fast when jobs-service rollout times out Previously a failed rollout just logged a warning and continued to deploy workflows, wasting CI time on guaranteed failures. Now rbac_deployment returns 1 so the job fails with a clear root cause. Signed-off-by: Chad Crum <ccrum@redhat.com> Made-with: Cursor * docs(ci): document perform_helm_install pass-through args Signed-off-by: Chad Crum <ccrum@redhat.com> Made-with: Cursor * fix(ci): clean up failed helm create-sonataflow-database job The helm chart's create-sonataflow-database job is expected to fail due to missing PGSSLMODE, and we replace it with a manual database creation. Delete the failed job afterwards so it doesn't linger in the namespace and show up in monitoring or alerts. Signed-off-by: Chad Crum <ccrum@redhat.com> Made-with: Cursor * fix(ci): restore reactive datasource URL and SSL require mode The SonataFlowPlatform patch was accidentally changed: SSL mode was downgraded from 'require' to 'allow' (allowing silent plaintext fallback), the env var was renamed, and QUARKUS_DATASOURCE_REACTIVE_URL was dropped entirely. Restore the full reactive datasource URL with SSL connection params and set SSL mode back to 'require'. Also parameterize the postgres namespace instead of hardcoding it. Signed-off-by: Chad Crum <ccrum@redhat.com> Made-with: Cursor * fix(ci): use exact job lookup instead of grep in wait loop Replace `oc get jobs | grep | wc -l` with a direct `oc get job/<name>` lookup. This avoids false positives from substring matches and is simpler to read. Signed-off-by: Chad Crum <ccrum@redhat.com> Made-with: Cursor * fix(ci): use quoted heredoc with envsubst for pod YAML template Replace unquoted heredoc (requiring fragile backslash escaping of k8s env var references) with a quoted heredoc and selective envsubst for the namespace variable. This eliminates the risk of accidental shell expansion corrupting the pod spec. Signed-off-by: Chad Crum <ccrum@redhat.com> Made-with: Cursor * fix(ci): pre-delete database creation pod to avoid AlreadyExists If a previous run was killed mid-execution, the create-sonataflow-db-manual pod may still exist. Delete it before applying the new pod spec to prevent AlreadyExists failures on retry. Signed-off-by: Chad Crum <ccrum@redhat.com> Made-with: Cursor * refactor(ci): convert database creation pod to a k8s Job Replace the bare Pod + manual poll/cleanup loop with a proper Job resource. This gives us automatic retries (backoffLimit: 3), built-in TTL cleanup (ttlSecondsAfterFinished: 120), and replaces ~40 lines of hand-rolled polling with a single oc wait --for=condition=complete. Failure vs timeout is distinguished by inspecting job status. Signed-off-by: Chad Crum <ccrum@redhat.com> Made-with: Cursor * style: fix prettier formatting in utils.sh Signed-off-by: Chad Crum <ccrum@redhat.com> Made-with: Cursor * fix(ci): wait for PostgreSQL readiness instead of sleep 5 The hardcoded sleep 5 was a race condition — postgres may not be ready when secrets are extracted or when sonataflow services connect. Use oc wait to block until the master pod is actually accepting connections. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Chad Crum <ccrum@redhat.com> * fix(ci): poll for PostgreSQL master pod before oc wait oc wait fails with "no matching resources found" when the Crunchy PGO operator hasn't created the master pod yet. Add a bounded poll loop (60s timeout) to wait for pod creation before calling oc wait. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Chad Crum <ccrum@redhat.com> * style: fix prettier formatting in utils.sh Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Chad Crum <ccrum@redhat.com> * fix(ci): increase PostgreSQL pod creation timeout to 300s 60s was not enough for the Crunchy PGO operator to reconcile and create the master pod after namespace recreation. Increase to 300s to match the oc wait readiness timeout. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Chad Crum <ccrum@redhat.com> --------- Signed-off-by: Chad Crum <ccrum@redhat.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 3d7b45e commit c8fb4ae

File tree

1 file changed

+131
-10
lines changed

1 file changed

+131
-10
lines changed

.ci/pipelines/utils.sh

Lines changed: 131 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -332,7 +332,21 @@ delete_namespace() {
332332
configure_external_postgres_db() {
333333
local project=$1
334334
oc apply -f "${DIR}/resources/postgres-db/postgres.yaml" --namespace="${NAME_SPACE_POSTGRES_DB}"
335-
sleep 5
335+
336+
echo "Waiting for PostgreSQL master pod to be created..."
337+
local max_wait=300
338+
local elapsed=0
339+
until oc get pod -l postgres-operator.crunchydata.com/role=master -n "${NAME_SPACE_POSTGRES_DB}" -o name 2> /dev/null | grep -q pod; do
340+
elapsed=$((elapsed + 5))
341+
if [[ $elapsed -ge $max_wait ]]; then
342+
echo "ERROR: PostgreSQL master pod not created after ${max_wait}s"
343+
return 1
344+
fi
345+
sleep 5
346+
done
347+
echo "Waiting for PostgreSQL cluster to be ready..."
348+
oc wait --for=condition=Ready pod -l postgres-operator.crunchydata.com/role=master -n "${NAME_SPACE_POSTGRES_DB}" --timeout=180s
349+
336350
oc get secret postgress-external-db-cluster-cert -n "${NAME_SPACE_POSTGRES_DB}" -o jsonpath='{.data.ca\.crt}' | base64 --decode > postgres-ca
337351
oc get secret postgress-external-db-cluster-cert -n "${NAME_SPACE_POSTGRES_DB}" -o jsonpath='{.data.tls\.crt}' | base64 --decode > postgres-tls-crt
338352
oc get secret postgress-external-db-cluster-cert -n "${NAME_SPACE_POSTGRES_DB}" -o jsonpath='{.data.tls\.key}' | base64 --decode > postgres-tsl-key
@@ -798,17 +812,97 @@ get_image_helm_set_params() {
798812
}
799813

800814
# Helper function to perform helm install/upgrade
815+
# Additional args ($4+) are passed through to helm upgrade
801816
perform_helm_install() {
802817
local release_name=$1
803818
local namespace=$2
804819
local value_file=$3
820+
shift 3
805821

806822
# shellcheck disable=SC2046
807823
helm upgrade -i "${release_name}" -n "${namespace}" \
808824
"${HELM_CHART_URL}" --version "${CHART_VERSION}" \
809825
-f "${DIR}/value_files/${value_file}" \
810826
--set global.clusterRouterBase="${K8S_CLUSTER_ROUTER_BASE}" \
811-
$(get_image_helm_set_params)
827+
$(get_image_helm_set_params) \
828+
"$@"
829+
}
830+
831+
# Manually create sonataflow database with SSL support via a k8s Job.
832+
# Workaround: the helm chart's create-db job doesn't include PGSSLMODE env var.
833+
create_sonataflow_database_with_ssl() {
834+
local namespace=$1
835+
local job_name="create-sonataflow-db-manual"
836+
837+
echo "Manually creating sonataflow database with SSL support..."
838+
839+
oc delete job "${job_name}" -n "${namespace}" --ignore-not-found=true
840+
841+
# Quoted heredoc prevents shell expansion; envsubst selectively expands NAMESPACE only
842+
NAMESPACE="${namespace}" envsubst '$NAMESPACE' << 'EOF' | oc apply -f -
843+
apiVersion: batch/v1
844+
kind: Job
845+
metadata:
846+
name: create-sonataflow-db-manual
847+
namespace: ${NAMESPACE}
848+
spec:
849+
backoffLimit: 3
850+
ttlSecondsAfterFinished: 120
851+
template:
852+
spec:
853+
restartPolicy: Never
854+
containers:
855+
- name: psql
856+
image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres:ubi8-16.3-1
857+
command: ["sh", "-c"]
858+
args:
859+
- |
860+
psql -h ${POSTGRES_HOST} -p ${POSTGRES_PORT} -U ${POSTGRES_USER} -d postgres -c 'CREATE DATABASE sonataflow;' && echo "Database created successfully" || echo "Database creation failed or database already exists"
861+
env:
862+
- name: POSTGRES_HOST
863+
valueFrom:
864+
secretKeyRef:
865+
name: postgres-cred
866+
key: POSTGRES_HOST
867+
- name: POSTGRES_USER
868+
valueFrom:
869+
secretKeyRef:
870+
name: postgres-cred
871+
key: POSTGRES_USER
872+
- name: POSTGRES_PORT
873+
valueFrom:
874+
secretKeyRef:
875+
name: postgres-cred
876+
key: POSTGRES_PORT
877+
- name: PGPASSWORD
878+
valueFrom:
879+
secretKeyRef:
880+
name: postgres-cred
881+
key: POSTGRES_PASSWORD
882+
- name: PGSSLMODE
883+
valueFrom:
884+
secretKeyRef:
885+
name: postgres-cred
886+
key: PGSSLMODE
887+
EOF
888+
889+
echo "Waiting for database creation job to complete..."
890+
if ! oc wait --for=condition=complete job/"${job_name}" -n "${namespace}" --timeout=5m 2> /dev/null; then
891+
local failed
892+
failed=$(oc get job/"${job_name}" -n "${namespace}" -o jsonpath='{.status.failed}' 2> /dev/null)
893+
if [[ "${failed}" -gt 0 ]]; then
894+
echo "ERROR: Database creation job failed after ${failed} attempt(s)"
895+
else
896+
echo "ERROR: Database creation job timed out"
897+
fi
898+
oc logs job/"${job_name}" -n "${namespace}" 2> /dev/null || echo "Could not retrieve logs"
899+
oc delete job "${job_name}" -n "${namespace}" --ignore-not-found=true
900+
return 1
901+
fi
902+
903+
echo "Database creation output:"
904+
oc logs job/"${job_name}" -n "${namespace}" 2> /dev/null || echo "Could not retrieve logs"
905+
echo "Manual database creation completed successfully"
812906
}
813907

814908
base_deployment() {
@@ -834,18 +928,45 @@ rbac_deployment() {
834928
local rbac_rhdh_base_url="https://${RELEASE_NAME_RBAC}-developer-hub-${NAME_SPACE_RBAC}.${K8S_CLUSTER_ROUTER_BASE}"
835929
apply_yaml_files "${DIR}" "${NAME_SPACE_RBAC}" "${rbac_rhdh_base_url}"
836930
echo "Deploying image from repository: ${QUAY_REPO}, TAG_NAME: ${TAG_NAME}, in NAME_SPACE: ${RELEASE_NAME_RBAC}"
837-
perform_helm_install "${RELEASE_NAME_RBAC}" "${NAME_SPACE_RBAC}" "${HELM_CHART_RBAC_VALUE_FILE_NAME}"
838-
839-
# NOTE: This is a workaround to allow the sonataflow platform to connect to the external postgres db using ssl.
840-
until [[ $(oc get jobs -n "${NAME_SPACE_RBAC}" 2> /dev/null | grep "${RELEASE_NAME_RBAC}-create-sonataflow-database" | wc -l) -eq 1 ]]; do
841-
echo "Waiting for sf db creation job to be created. Retrying in 5 seconds..."
931+
perform_helm_install "${RELEASE_NAME_RBAC}" "${NAME_SPACE_RBAC}" "${HELM_CHART_RBAC_VALUE_FILE_NAME}" \
932+
--set "orchestrator.sonataflowPlatform.externalDBHost=postgress-external-db-primary.${NAME_SPACE_POSTGRES_DB}.svc.cluster.local"
933+
934+
# NOTE: The helm chart's create-sonataflow-database job will fail because it doesn't include PGSSLMODE env var.
935+
# We wait for the job to be created (indicating helm install is progressing), then manually create the database with SSL.
936+
local max_attempts=60
937+
local attempt=0
938+
until oc get job/"${RELEASE_NAME_RBAC}-create-sonataflow-database" -n "${NAME_SPACE_RBAC}" &> /dev/null; do
939+
attempt=$((attempt + 1))
940+
if [[ $attempt -ge $max_attempts ]]; then
941+
echo "ERROR: Timed out after $((max_attempts * 5))s waiting for ${RELEASE_NAME_RBAC}-create-sonataflow-database job to be created."
942+
echo "Helm install may have failed. Current jobs in namespace:"
943+
oc get jobs -n "${NAME_SPACE_RBAC}" 2> /dev/null || echo " (unable to list jobs)"
944+
return 1
945+
fi
946+
echo "Waiting for sf db creation job to be created. Retrying in 5 seconds... (attempt ${attempt}/${max_attempts})"
842947
sleep 5
843948
done
844-
oc wait --for=condition=complete job/"${RELEASE_NAME_RBAC}-create-sonataflow-database" -n "${NAME_SPACE_RBAC}" --timeout=3m
949+
950+
# Don't wait for the helm job to complete - it will fail due to missing SSL configuration
951+
# Instead, manually create the database with proper SSL support
952+
create_sonataflow_database_with_ssl "${NAME_SPACE_RBAC}"
953+
954+
# Clean up the failed helm chart job so it doesn't pollute monitoring/alerts
955+
oc delete job "${RELEASE_NAME_RBAC}-create-sonataflow-database" -n "${NAME_SPACE_RBAC}" --ignore-not-found=true
956+
957+
# Patch the sonataflow platform to configure SSL for the jobs service
958+
echo "Patching SonataFlowPlatform with SSL configuration..."
845959
oc -n "${NAME_SPACE_RBAC}" patch sfp sonataflow-platform --type=merge \
846-
-p '{"spec":{"services":{"jobService":{"podTemplate":{"container":{"env":[{"name":"QUARKUS_DATASOURCE_REACTIVE_URL","value":"postgresql://postgress-external-db-primary.postgress-external-db.svc.cluster.local:5432/sonataflow?search_path=jobs-service&sslmode=require&ssl=true&trustAll=true"},{"name":"QUARKUS_DATASOURCE_REACTIVE_SSL_MODE","value":"require"},{"name":"QUARKUS_DATASOURCE_REACTIVE_TRUST_ALL","value":"true"}]}}}}}}'
960+
-p '{"spec":{"services":{"jobService":{"podTemplate":{"container":{"env":[{"name":"QUARKUS_DATASOURCE_REACTIVE_URL","value":"postgresql://postgress-external-db-primary.'"${NAME_SPACE_POSTGRES_DB}"'.svc.cluster.local:5432/sonataflow?search_path=jobs-service&sslmode=require&ssl=true&trustAll=true"},{"name":"QUARKUS_DATASOURCE_REACTIVE_SSL_MODE","value":"require"},{"name":"QUARKUS_DATASOURCE_REACTIVE_TRUST_ALL","value":"true"}]}}}}}}'
847961
oc rollout restart deployment/sonataflow-platform-jobs-service -n "${NAME_SPACE_RBAC}"
848962

963+
# Wait for jobs-service to be ready before deploying workflows
964+
echo "Waiting for jobs-service to be ready..."
965+
if ! oc rollout status deployment/sonataflow-platform-jobs-service -n "${NAME_SPACE_RBAC}" --timeout=3m; then
966+
echo "ERROR: jobs-service rollout did not complete in time. Cannot deploy workflows without it."
967+
return 1
968+
fi
969+
849970
# initiate orchestrator workflows deployment
850971
deploy_orchestrator_workflows "${NAME_SPACE_RBAC}"
851972
}
@@ -1254,7 +1375,7 @@ deploy_orchestrator_workflows() {
12541375
oc apply -f "${WORKFLOW_MANIFESTS}"
12551376

12561377
helm repo add orchestrator-workflows https://rhdhorchestrator.io/serverless-workflows
1257-
helm install greeting orchestrator-workflows/greeting -n "$namespace"
1378+
helm install greeting orchestrator-workflows/greeting -n "$namespace" --wait --timeout=5m
12581379

12591380
until [[ $(oc get sf -n "$namespace" --no-headers 2> /dev/null | wc -l) -eq 2 ]]; do
12601381
echo "No sf resources found. Retrying in 5 seconds..."

0 commit comments

Comments
 (0)