-
Notifications
You must be signed in to change notification settings - Fork 319
ingester and querier pods are getting OOMKilled on stone-prod-p02 #10283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
ingester and querier pods are getting OOMKilled on stone-prod-p02 #10283
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: olegbet The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
🤖 Gemini AI Assistant AvailableHi @olegbet! I'm here to help with your pull request. You can interact with me using the following commands: Available Commands
How to Use
PermissionsOnly OWNER, MEMBER, or COLLABORATOR users can trigger my responses. This ensures secure and appropriate usage. This message was automatically added to help you get started with the Gemini AI assistant. Feel free to delete this comment if you don't need assistance. |
|
🤖 Hi @olegbet, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
af9039e to
1e8d373
Compare
🤖 Pipeline Failure AnalysisCategory: Infrastructure Pipeline failed due to a DNS resolution issue preventing the Prow job from connecting to the OpenShift cluster API server. 📋 Technical DetailsImmediate CauseThe Contributing FactorsThe ImpactThe inability to resolve the cluster's API server hostname prevented the Prow job from collecting essential audit logs and diagnostic data. This fundamental infrastructure failure also led to the premature termination of the e2e test execution, blocking the successful completion of the job. 🔍 Evidenceappstudio-e2e-tests/gather-audit-logsCategory: Logs:
|
|
Change it in all clusters. You can use something like: To copy the config to all clusters, then remove the file from the base and empty-base folders. |
|
/lgtm |
components/vector-kubearchive-log-collector/development/loki-helm-dev-values.yaml
Show resolved
Hide resolved
🤖 Pipeline Failure AnalysisCategory: Timeout The Red Hat AppStudio end-to-end tests timed out, preventing the successful completion of the Prow job. 📋 Technical DetailsImmediate CauseThe Contributing FactorsAnalysis of the cluster state reveals several Argo CD Applications and ApplicationSets in an "OutOfSync" or "Missing" state. This indicates potential configuration drift or instability within the deployed applications, which could lead to increased resource consumption or delays during the execution of end-to-end tests that rely on these services. Specific examples include ImpactThe timeout of the end-to-end tests directly blocked the progression of the Prow job, preventing any subsequent steps from executing and leading to an overall job failure. This hinders the verification of the Red Hat AppStudio infrastructure deployment. 🔍 Evidenceappstudio-e2e-tests/redhat-appstudio-e2eCategory: Logs:
|
1e8d373 to
8f5b5fd
Compare
|
New changes are detected. LGTM label has been removed. |
🤖 Pipeline Failure AnalysisCategory: Timeout The Prow job 📋 Technical DetailsImmediate CauseThe Contributing FactorsWhile the direct cause is a timeout, the ImpactThe timeout prevented the successful execution and completion of the 🔍 Evidenceappstudio-e2e-tests/redhat-appstudio-e2eCategory: Logs:
|
🤖 Pipeline Failure AnalysisCategory: Timeout The 📋 Technical DetailsImmediate CauseThe Contributing FactorsSeveral factors within the cluster likely contributed to the extended execution time:
ImpactThe prolonged synchronization and deployment times, caused by the Argo CD and Tekton issues, prevented the E2E tests from executing and completing within the Prow job's timeout limit. This blocked the successful validation of the infrastructure deployment. 🔍 Evidenceappstudio-e2e-tests/redhat-appstudio-e2eCategory: Logs:
|
4af92a4 to
f597d37
Compare
🤖 Pipeline Failure AnalysisCategory: Infrastructure The Prow job failed due to infrastructure issues causing DNS resolution and network connectivity failures to the OpenShift API server, preventing the execution of e2e tests. 📋 Technical DetailsImmediate CauseThe Contributing FactorsThe ImpactThe DNS resolution and network connectivity failures prevented the successful collection of diagnostic logs and cluster information by the 🔍 Evidenceappstudio-e2e-tests/gather-audit-logsCategory: Logs:
|
🤖 Pipeline Failure AnalysisCategory: Infrastructure The pipeline failed due to persistent DNS resolution errors preventing essential diagnostic and test steps from connecting to the OpenShift API server, indicating a critical infrastructure network configuration issue. 📋 Technical DetailsImmediate CauseMultiple infrastructure steps ( Contributing FactorsThe ImpactThe inability to resolve the cluster API endpoint prevented the successful execution of critical diagnostic data collection steps, such as 🔍 Evidenceappstudio-e2e-tests/gather-audit-logsCategory: Logs:
|
🤖 Pipeline Failure AnalysisCategory: Timeout Pipeline failed due to a timeout in the e2e tests, likely caused by widespread cluster instability and resource reconciliation issues indicated by numerous Argo CD ApplicationSet failures. 📋 Technical DetailsImmediate CauseThe Contributing FactorsThe ImpactThe timeout prevented the completion of the end-to-end tests for this Prow job. The underlying cluster instability, indicated by the numerous failing ApplicationSets and Tekton configurations, suggests that even if the tests had not timed out, they might have failed due to the non-operational state of critical components. This failure blocks the validation of infrastructure deployments and prevents merging of changes that rely on these tests. 🔍 Evidenceappstudio-e2e-tests/redhat-appstudio-e2eCategory: Logs:
|
🤖 Pipeline Failure AnalysisCategory: Timeout The e2e tests timed out because ArgoCD applications failed to synchronize, likely due to degraded deployment health and unsynced ApplicationSets. 📋 Technical DetailsImmediate CauseThe Contributing FactorsAnalysis of the cluster state reveals that several ArgoCD ApplicationSets ('application-api', 'build-service', 'crossplane-control-plane') are in an ImpactThe failure of ArgoCD to synchronize its applications and ApplicationSets directly prevented the successful deployment and configuration of the necessary resources for the e2e tests to execute. This resulted in the test step exceeding its timeout limit and ultimately causing the job to fail. 🔍 Evidenceappstudio-e2e-tests/redhat-appstudio-e2eCategory: Analysis powered by prow-failure-analysis | Build: |
🤖 Pipeline Failure AnalysisCategory: Infrastructure The Appstudio E2E tests failed due to persistent DNS resolution errors preventing the job from connecting to the OpenShift API server, leading to the failure of diagnostic collection steps and eventual job termination. 📋 Technical DetailsImmediate CauseThe immediate cause of the failure was the inability of various Prow job steps ( Contributing FactorsThe ImpactThe DNS resolution failures prevented essential diagnostic information from being collected by the 🔍 Evidenceappstudio-e2e-tests/gather-audit-logsCategory: Logs:
|
🤖 Pipeline Failure AnalysisCategory: Timeout The end-to-end tests for AppStudio failed due to exceeding the allocated execution time, preventing the successful validation of the deployment. 📋 Technical DetailsImmediate CauseThe Contributing FactorsSeveral environmental issues were observed, including multiple ArgoCD ApplicationSets being in an 'OutOfSync' state and the 'build-service' application showing a 'Degraded' health status. Additionally, there are indications of TektonAddon and TektonConfig not being fully ready, which could impact test execution or the environment the tests are running against. The exact reason for the prolonged test execution is not definitively identified but could stem from these underlying cluster configuration or resource issues leading to slow test progress or infinite loops. ImpactThe timeout failure prevented the completion of the end-to-end test suite. This means the current deployment has not been validated, and any potential issues introduced by the pull request remain undetected, blocking the merge of the change. 🔍 Evidenceappstudio-e2e-tests/redhat-appstudio-e2eCategory: Logs:
|
b737389 to
1010bec
Compare
🤖 Pipeline Failure AnalysisCategory: Infrastructure The pipeline failed due to a DNS resolution failure preventing essential infrastructure gathering steps from connecting to the OpenShift API server. 📋 Technical DetailsImmediate CauseMultiple infrastructure gathering steps, including Contributing FactorsThe ImpactThe DNS resolution failure in critical infrastructure gathering steps prevented the pipeline from collecting necessary diagnostic information and likely contributed to the eventual timeout of the main e2e test execution. This blocked the successful completion of the end-to-end test suite for the AppStudio infrastructure. 🔍 Evidenceappstudio-e2e-tests/gather-audit-logsCategory: Logs:
|
🤖 Pipeline Failure AnalysisCategory: Infrastructure The Prow job failed because multiple infrastructure steps could not resolve the Kubernetes API server's DNS name, preventing connectivity to the cluster. 📋 Technical DetailsImmediate CauseSeveral infrastructure-related steps, including Contributing FactorsThe ImpactThe inability to resolve the Kubernetes API server's DNS name prevented the E2E test job from performing essential data collection and cluster interaction tasks. This fundamental connectivity issue blocked the successful execution of the 🔍 Evidenceappstudio-e2e-tests/gather-audit-logsCategory: Logs:
|
🤖 Pipeline Failure AnalysisCategory: Infrastructure The pipeline failed due to a persistent DNS resolution failure preventing access to the OpenShift API server, which blocked the execution of end-to-end tests. 📋 Technical DetailsImmediate CauseMultiple steps within the Contributing FactorsThe ImpactThe DNS resolution failure prevented the collection of critical diagnostic information via 🔍 Evidenceappstudio-e2e-tests/gather-audit-logsCategory: Logs:
|
🤖 Pipeline Failure AnalysisCategory: Infrastructure The Prow job failed due to an infrastructure issue where the test environment could not resolve the DNS hostname for the cluster API endpoint, preventing essential diagnostic data collection and subsequent test execution. 📋 Technical DetailsImmediate CauseMultiple infrastructure-related steps ( Contributing FactorsThe inability to resolve the cluster API endpoint's DNS name suggests a network configuration problem within the test environment, specifically related to DNS resolution services or network accessibility to the DNS server. Some logs also show ImpactThe failure to resolve the cluster API endpoint prevented the 🔍 Evidenceappstudio-e2e-tests/gather-audit-logsCategory: Logs:
|
🤖 Pipeline Failure AnalysisCategory: Infrastructure The E2E test pipeline failed to bootstrap the cluster due to a configuration error during the kustomize build of the 📋 Technical DetailsImmediate CauseThe Contributing FactorsThe ImpactThis infrastructure failure directly blocked the E2E test execution. The inability to successfully build and apply the necessary Kubernetes manifests for the 🔍 Evidenceappstudio-e2e-tests/redhat-appstudio-e2eCategory: Logs:
|
Signed-off-by: obetsun <[email protected]> Assisted-by: Claude rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
Signed-off-by: obetsun <[email protected]> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
Signed-off-by: obetsun <[email protected]> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
…tefulSet Signed-off-by: obetsun <[email protected]> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
Signed-off-by: obetsun <[email protected]> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
Signed-off-by: obetsun <[email protected]> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
Signed-off-by: obetsun <[email protected]> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
Signed-off-by: obetsun <[email protected]> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
Signed-off-by: obetsun <[email protected]> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
Signed-off-by: obetsun <[email protected]> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
Signed-off-by: obetsun <[email protected]> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
Signed-off-by: obetsun <[email protected]> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
Signed-off-by: obetsun <[email protected]> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
Signed-off-by: obetsun <[email protected]> rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED
f19f83b to
57bafde
Compare
🤖 Pipeline Failure AnalysisCategory: Infrastructure The end-to-end tests failed because Argo CD applications could not synchronize due to a persistent volume reclaim policy configuration error, preventing cluster bootstrapping. 📋 Technical DetailsImmediate CauseThe Argo CD applications Contributing FactorsThe failure in Argo CD synchronization directly impacted the bootstrapping process for the end-to-end tests. The error message "error calculating server side diff: serverSideDiff error: error running server side apply in dryrun mode for resource Namespace/konflux-policies: Internal error occurred: failed calling webhook \"namespace.operator.tekton.dev\": failed to call webhook: Post \"https://tekton-operator-proxy-webhook.openshift-pipelines.svc:443/namespace-validation?timeout=10s\\\": no endpoints available for service \"tekton-operator-proxy-webhook\"" suggests an issue with webhook communication, potentially related to the Tekton operator or its dependencies, which is a consequence of the initial synchronization failure. ImpactThe inability of Argo CD to synchronize critical applications prevented the cluster from being properly provisioned and configured for the end-to-end tests. As a result, the cluster bootstrapping process failed after reaching the maximum number of attempts, leading to the overall failure of the 🔍 Evidenceappstudio-e2e-tests/redhat-appstudio-e2eCategory: Logs:
|
|
@olegbet: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Signed-off-by: obetsun [email protected]
rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED