diff --git a/src/docs.json b/src/docs.json index 5170840788..9c1d6dfdf0 100644 --- a/src/docs.json +++ b/src/docs.json @@ -1228,6 +1228,12 @@ "langsmith/deploy-standalone-server" ] }, + { + "group": "Troubleshooting", + "pages": [ + "langsmith/diagnostics-self-hosted" + ] + }, { "group": "App development", "pages": [ diff --git a/src/langsmith/diagnostics-self-hosted.mdx b/src/langsmith/diagnostics-self-hosted.mdx new file mode 100644 index 0000000000..5440c93bc1 --- /dev/null +++ b/src/langsmith/diagnostics-self-hosted.mdx @@ -0,0 +1,184 @@ +--- +title: Diagnostic steps for self-hosted deployments +sidebarTitle: Diagnostics +--- + +This page provides diagnostic steps to help you troubleshoot issues with self-hosted [LangSmith Deployment](/langsmith/deployments) before reaching out to support. Follow these steps systematically to identify and resolve common deployment issues. + + +If you complete these diagnostic steps and still need assistance, refer to [Support](#support) at the end of this guide for information on what to gather before reaching out. + + +## Prerequisites + +Before beginning the diagnostic steps, ensure you have: + +- `kubectl` access to your Kubernetes cluster. +- Appropriate permissions to view pods, deployments, and services. +- Familiarity with your Helm chart configuration. + +## Step 1. Understand your deployment + +After installing the Helm chart(s), verify what was deployed and understand the baseline state of your system. This helps you recognize what normal operation looks like and identify deviations when issues occur. + +### List deployed resources + +Run the following commands to view all deployed resources. + +List all Kubernetes deployments: + +```bash +kubectl get deployments +``` + +List all pods: + +```bash +kubectl get pods +``` + +List all services: + +```bash +kubectl get services +``` + +List all `lgps` resources (only relevant after creating an [Agent Server](/langsmith/agent-server)): + +```bash +kubectl get lgps +``` + +### Key deployed components + +Your deployment includes the following core components: + +- **`langsmith-frontend`**: The LangSmith frontend UI where you create Agent Server deployments. This app makes API calls to `langsmith-host-backend`. Part of the [control plane](/langsmith/control-plane). +- **`langsmith-host-backend`**: The LangSmith Deployment [control plane](/langsmith/control-plane) that receives requests from `langsmith-frontend` and persists deployment requests to the control plane Postgres database. +- **`langsmith-listener`**: Part of the LangSmith Deployment [data plane](/langsmith/data-plane). Polls `langsmith-host-backend` via HTTP API for deployments to create, update, or delete. Enqueues tasks for worker processes to handle. +- **`langsmith-redis`**: The [Redis](/langsmith/data-plane#redis) instance serving as the task queue for `langsmith-listener`. The listener enqueues tasks here and workers pull tasks from this queue. +- **`langsmith-operator`**: The `lgps` Kubernetes operator that reconciles underlying Kubernetes resources for `lgps` resources. Part of the data plane infrastructure. + + +This list is not exhaustive. Additional components may be present in your deployment depending on your configuration. For a complete overview, refer to [LangSmith Deployment components](/langsmith/components). + + +### Review application logs + +Tail the logs of each pod to understand baseline behavior: + +```bash +kubectl logs -f +``` + +#### What to look for in logs + +First, enable `DEBUG` level logs (see Step 2 below). Then look for these log lines: + +- **`langsmith-listener`**: `Reconciling projects...` (appears every 10 seconds) +- **`langsmith-operator`**: `Starting reconciliation` (appears periodically) + +In a healthy deployment, you should not see any errors. All logs should appear normal and routine. + +### Next steps + +Once you understand the baseline state, create a deployment from the LangSmith Frontend. This process will create an `lgps` Kubernetes resource that you can monitor. + +## Step 2. Enable debug logging + +When troubleshooting issues, the first step is typically to enable debug-level logging to gather more detailed information about what's happening in your system. + +### For control plane or data plane deployments + +If you are experiencing issues with a control plane deployment (for example, `langsmith-host-backend`) or a data plane deployment (for example, `langsmith-listener`), reinstall the Helm chart with the `LOG_LEVEL=DEBUG` environment variable. Add the following to your `values.yaml` file: + +```yaml +extraEnv: + - name: LOG_LEVEL + value: DEBUG +``` + +### For Agent Server deployments + +If the issue is with an individual Agent Server deployment: + +1. Navigate to the **Deployments** tab in the [LangSmith UI](https://smith.langchain.com). +1. On a deployment's view, select **+ New Revision**. +1. Set the **Environment Variables** to `LOG_LEVEL` and the value to `DEBUG`. + + +You can also find debug logs in the UI on a deployment's view, click on **Server Logs** and select **Debug** for the **Log level: Info** dropdown. + + +### For widespread issues + +If you are unsure where the issue originates, enable `DEBUG` logging everywhere (control plane, data plane, and all Agent Server deployments). + +### Interpret debug logs + +Once you have reviewed logs when the system was working correctly to establish what normal behavior looks like, look for the following problem indicators: + +- Exceptions or stack traces. +- Error messages (the word `"ERROR"` or red-colored text). +- Unusual patterns that differ from normal operation. + +Based on the errors you find: + +- **Infrastructure bug**: If you suspect a bug in the platform infrastructure, raise the issue with LangChain. +- **Configuration issue**: If you suspect a configuration problem, raise the issue with the person who ran [`helm install`](/langsmith/kubernetes). +- **User code bug**: If you suspect a bug in user code (for example, the LangGraph OSS graph implementation), raise the issue with the person who created the [`langgraph.json`](/langsmith/application-structure#configuration-file) file. + +## Step 3. Describe deployments and pods + +Describing Kubernetes resources reveals error events and statuses that may not appear in application logs. These errors are typically caused by configuration or infrastructure issues rather than application code bugs. Describing resources also shows their configuration (such as environment variables), which is helpful for debugging. + +Run the following commands to describe your resources. + +Describe a Kubernetes deployment: + +```bash +kubectl describe deployment +``` + +Describe a Kubernetes pod: + +```bash +kubectl describe pod +``` + +Describe an `lgps` resource (only relevant after creating an Agent Server): + +```bash +kubectl describe lgps +``` + +### Interpret results + +Review the `Events:` section of the output and verify that everything is normal. Common issues that appear include: + +- Failed liveness or readiness probes +- Image pull errors +- Resource constraints (CPU, memory) +- Volume mount issues +- Configuration errors + +Make sure there are no error events and that all events indicate healthy operation. + +## Additional resources + +For more troubleshooting information, refer to: + +- [Troubleshooting](/langsmith/troubleshooting): General troubleshooting guide with solutions to common issues. +- [Architectural overview](/langsmith/architectural-overview): Details on system architecture and component interactions. +- [Self-hosted documentation](/langsmith/self-hosted) + +## Support + +If you have followed these diagnostic steps and still need assistance, gather the following information before contacting support: + +- Output from the [diagnostic steps](#step-1-understand-your-deployment). +- Your Helm chart configuration. +- Relevant error messages and logs. +- Description of what you were trying to do when the issue occurred. + +Having this information ready will help the [support](mailto:support@langchain.dev) team diagnose and resolve your issue more quickly.