TiKV Fails to Connect to PD: "validate PD endpoints failed" in Self-Managed Kubernetes Deployment #66524
Replies: 3 comments
-
|
Original author: TiDB-AI-Assistant Certainly! Here’s a comprehensive, actionable troubleshooting guide for TiKV ‘validate PD endpoints failed’ errors in a self-managed TiDB cluster on Kubernetes, particularly taking into account issues around leftover PVC data, internal PD metadata mismatches, and TiDB Operator v1.6.1 nuances. 1. Root Cause AnalysisBackground: ‘validate PD endpoints failed’ in TiKVThis error occurs when TiKV attempts to join or reconnect to the PD (Placement Driver) service but encounters endpoint verification failures. The most common root causes for this issue in Kubernetes-managed TiDB clusters are: a) Stale PVC (Persistent Volume Claim) Data
b) PD Metadata and Cluster Lifecycle
2. Step-by-Step Workflow for Clean, Effective Cluster ResetTo avoid cluster ID or endpoint issues, always follow a thorough cleanup and reinitialization process when tearing down or redeploying a TiDB cluster. Here is a safe, best-practice workflow: a) Scale Down and Delete the TiDB Cluster
b) Remove Stateful Data: PVC and PVCritical step: make sure to delete all PVCs related to the cluster, especially for PD and TiKV:
c) Remove the Namespace (if not used for other clusters)
d) Re-deploy the TiDB Cluster
Reference: Official guide on cluster destroy and cleanup 3. Version-Specific Tips, Debugging, and Prevention for v1.6.1Key Recommendations for TiDB Operator v1.6.1
Endpoint Validation Debugging Steps
Cluster Reinstall/Reset Tips
Operator/PD-specific nuances
Summary Table — Troubleshooting and Clean Reset
Further Reading and References
Conclusion: Preventing Future Endpoint Mismatches
By following the above analysis and workflow, you will ensure that all TiDB/TiKV/PD components initialize cleanly, preventing endpoint validation errors rooted in data residue. |
Beta Was this translation helpful? Give feedback.
-
|
Original author: Hazel Hi @hamid could you try adding this env variable to the TiKV on your TiDB cluster and see if it helps resolve the issue? env:
For your reference: tikv start with error: connect to pd failed, but check network is ok. · Issue #5372 · pingcap/tidb-operator · GitHub |
Beta Was this translation helpful? Give feedback.
-
|
Original author: hamid Dear @Hazel for your help |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Original author: hamid
Original time: 2025-05-28 07:51:13.61695
Original ID/Slug: 10497 / tikv-fails-to-connect-to-pd-validate-pd-endpoints-failed-in-self-managed-kubernetes-deployment
Hi everyone,
I’m deploying a self-managed TiDB cluster on Kubernetes (offline, no NetworkPolicy) using the basic cluster example from PingCAP’s GitHub repo, with TiDB Operator version v1.6.1.
I’m encountering an issue where TiKV cannot connect to the PD endpoint.
TiKV log snippet:
[2025/05/28 07:09:46.898 +00:00] [INFO] [util.rs:639] [“connecting to PD endpoint”] [endpoints=http://basic-pd:2379] [thread_id=1]
[2025/05/28 07:09:46.899 +00:00] [ERROR] [util.rs:634] [“validate PD endpoints failed”] [err=“Other("[components/pd_client/src/util.rs:634]: …”]
PD log shows it’s advertising the client URL as:
advertise-client-urls=“[http://basic-pd-0.basic-pd-peer.default.svc:2379]”
I’ve already tried updating the PD config to:
advertise-client-urls = “http://basic-pd:2379”
and confirmed via logs:
advertise-client-urls=“[http://basic-pd:2379]”
But the error persists.
I’ve also checked:
Why does TiKV still fail with validate PD endpoints failed even though PD is reachable and advertising the correct URL?
Could leftover PVC data or internal PD metadata still be causing an endpoint mismatch?
What’s the recommended workflow to cleanly reset a TiDB cluster deployment in Kubernetes to avoid these kinds of issues?
Any advice, known bugs for v1.6.1, or debugging steps I might have missed would be greatly appreciated
Beta Was this translation helpful? Give feedback.
All reactions