diff --git a/src/docs.json b/src/docs.json index 5170840788..671ea1073f 100644 --- a/src/docs.json +++ b/src/docs.json @@ -1344,6 +1344,13 @@ "langsmith/cloud" ] }, + { + "group": "Self-hosted cloud architecture", + "pages": [ + "langsmith/aws-self-hosted", + "langsmith/azure-self-hosted" + ] + }, { "group": "Hybrid", "pages": [ diff --git a/src/langsmith/aws-self-hosted.mdx b/src/langsmith/aws-self-hosted.mdx new file mode 100644 index 0000000000..6c3c478d90 --- /dev/null +++ b/src/langsmith/aws-self-hosted.mdx @@ -0,0 +1,108 @@ +--- +title: Self-hosted on AWS +sidebarTitle: AWS +icon: "aws" +--- + +When running LangSmith on [Amazon Web Services (AWS)](https://aws.amazon.com/), you can set up in either [full self-hosted](/langsmith/self-hosted) or [hybrid](/langsmith/hybrid) mode. Full self-hosted mode deploys a complete LangSmith platform with observability functionality as well as the option to create agent deployments. Hybrid mode entails just the infrastructure to run LangSmith-managed agents in a data plane within your cloud, while our SaaS provides the control plane and observability functionality. + +This page provides AWS-specific architecture patterns, service recommendations, and best practices for deploying and operating LangSmith on AWS. + + +LangChain provides Terraform modules specifically for AWS to help provision infrastructure for LangSmith. These modules can quickly set up EKS clusters, RDS, ElastiCache, S3, and networking resources. + +View the [AWS Terraform modules](https://github.com/langchain-ai/terraform/tree/main/modules/aws) for documentation and examples. + + +## Reference architecture + +We recommend leveraging AWS's managed services to provide a scalable, secure, and resilient platform. The following architecture applies to both self-hosted and hybrid and aligns with the [AWS Well-Architected Framework](https://aws.amazon.com/architecture/well-architected/): + +![Architecture diagram showing AWS relations to LangSmith services](/langsmith/images/aws-architecture-self-hosted.png) + +- **Ingress & networking**: Requests enter via [Amazon Application Load Balancer (ALB)](https://aws.amazon.com/elasticloadbalancing/application-load-balancer/) within your [VPC](https://aws.amazon.com/vpc/), secured using [AWS WAF](https://aws.amazon.com/waf/) and [IAM](https://aws.amazon.com/iam/)-based authentication. +- **Frontend & backend services:** Containers run on [Amazon EKS](https://aws.amazon.com/eks/), orchestrated behind the ALB. routes requests to other services within the cluster as necessary. +- **Storage & databases:** + - [Amazon RDS for PostgreSQL](https://aws.amazon.com/rds/postgresql/), optionally using [Aurora](https://aws.amazon.com/rds/aurora/): metadata, projects, users, and short-term and long-term memory for deployed agents. LangSmith supports Postgres version 14 or higher. + - [Amazon ElastiCache (Redis)](https://aws.amazon.com/elasticache/redis/): caching and job queues. ElastiCache must be in single instance mode, running open-source Redis version 5 or higher. + - ClickHouse + [Amazon EBS](https://aws.amazon.com/ebs/): analytics and trace storage. + - We recommend using an [externally managed ClickHouse solution](/langsmith/self-host-external-clickhouse) unless security or compliance reasons + prevent you from doing so. + - ClickHouse is not required for hybrid deployments. + - [Amazon S3](https://aws.amazon.com/s3/): object storage for trace artifacts and telemetry. + +- **LLM integration:** Optionally proxy requests to [Amazon Bedrock](https://aws.amazon.com/bedrock/) or [Amazon SageMaker](https://aws.amazon.com/sagemaker/) for LLM inference. +- **Monitoring & observability:** Integrate with [Amazon CloudWatch](https://aws.amazon.com/cloudwatch/) + + +## Compute options + +LangSmith supports multiple compute options depending on your requirements: + +| Compute option | Description | Suitable for | +|-----------------|-------------|--------------| +| **Elastic Kubernetes Service (preferred)** | Advanced scaling and multi-tenant support | Large enterprises | +| **EC2-based** | Full control, BYO-infra | Regulated or air-gapped environments | + +## AWS Well-Architected best practices + +This reference is designed to align with the six pillars of the AWS Well-Architected Framework: + +### Operational excellence + +- Automate deployments with IaC ([CloudFormation](https://aws.amazon.com/cloudformation/) / [Terraform](https://www.terraform.io/)). +- Use [AWS Systems Manager Parameter Store](https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-parameter-store.html) for configuration. +- Configure your LangSmith instance to [export telemetry data](/langsmith/export-backend) and continuously monitor via [CloudWatch Logs](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/WhatIsCloudWatchLogs.html). +- The preferred method to manage [LangSmith deployments](/langsmith/deployments) is to create a CI process that builds [Agent Server](/langsmith/agent-server) images and pushes them to [ECR](https://aws.amazon.com/ecr/). Create a test deployment for pull requests before deploying a new revision to staging or production upon PR merge. + +### Security + +- Use [IAM](https://aws.amazon.com/iam/) roles with least-privilege policies. +- Enable encryption at rest ([RDS](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Overview.Encryption.html), [S3](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingEncryption.html), ClickHouse volumes) and in transit (TLS 1.2+). +- Integrate with [AWS Secrets Manager](https://aws.amazon.com/secrets-manager/) for credentials. +- Use [Amazon Cognito](https://aws.amazon.com/cognito/) as an IDP in conjunction with LangSmith's built-in authentication and authorization features to secure access to agents and their tools. + +### Reliability + +- Replicate the LangSmith [data plane](/langsmith/data-plane) across regions: Deploy identical data planes to Kubernetes clusters in different regions for LangSmith Deployment. Deploy [RDS](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concepts.MultiAZSingleStandby.html) and [ECS](https://aws.amazon.com/ecs/) services across [Multi-AZ](https://aws.amazon.com/about-aws/global-infrastructure/regions_az/). +- Implement [auto-scaling](https://aws.amazon.com/autoscaling/) for backend workers. +- Use [Amazon Route 53](https://aws.amazon.com/route53/) health checks and failover policies. + +### Performance efficiency + +- Leverage [EC2](https://aws.amazon.com/ec2/) instances for optimized compute. +- Use [S3 Intelligent-Tiering](https://aws.amazon.com/s3/storage-classes/intelligent-tiering/) for infrequently accessed trace data. + +### Cost optimization + +- Right-size [EKS](https://aws.amazon.com/eks/) clusters using [Compute Savings Plans](https://aws.amazon.com/savingsplans/compute-pricing/). +- Monitor cost KPIs using [AWS Cost Explorer](https://aws.amazon.com/aws-cost-management/aws-cost-explorer/) dashboards. + +### Sustainability + +- Minimize idle workloads with on-demand compute. +- Store telemetry in low-latency, low-cost tiers. +- Enable auto-shutdown for non-prod environments. + +## Security and compliance + +LangSmith can be configured for: + +- [PrivateLink](https://aws.amazon.com/privatelink/)-only access (no public internet exposure, besides egress necessary for billing). +- [KMS](https://aws.amazon.com/kms/)-based encryption keys for S3, RDS, and EBS. +- Audit logging to [CloudWatch](https://aws.amazon.com/cloudwatch/) and [AWS CloudTrail](https://aws.amazon.com/cloudtrail/). + +Customers can deploy in [GovCloud](https://aws.amazon.com/govcloud-us/), ISO, or HIPAA regions as needed. + +## Monitoring and evals + +Use LangSmith to: + +- Capture traces from LLM apps running on [Bedrock](https://aws.amazon.com/bedrock/) or [SageMaker](https://aws.amazon.com/sagemaker/). +- Evaluate model outputs via [LangSmith datasets](/langsmith/manage-datasets). +- Track latency, token usage, and success rates. + +Integrate with: + +- [AWS CloudWatch](https://aws.amazon.com/cloudwatch/) dashboards. +- [OpenTelemetry](https://opentelemetry.io/) and [Prometheus](https://prometheus.io/) exporters. diff --git a/src/langsmith/azure-self-hosted.mdx b/src/langsmith/azure-self-hosted.mdx new file mode 100644 index 0000000000..9c14e20ea4 --- /dev/null +++ b/src/langsmith/azure-self-hosted.mdx @@ -0,0 +1,157 @@ +--- +title: Self-hosted on Azure +sidebarTitle: Azure +icon: "microsoft" +--- + +When running LangSmith on [Microsoft Azure](https://azure.microsoft.com/), you can set up in either [full self-hosted](/langsmith/self-hosted) or [hybrid](/langsmith/hybrid) mode. Full self-hosted mode deploys a complete LangSmith platform with observability functionality as well as the option to create agent deployments. Hybrid mode entails just the infrastructure to run LangSmith-managed agents in a data plane within your cloud, while our SaaS provies the control plane and observability functionality. + +This page provides Azure-specific architecture patterns, service recommendations, and best practices for deploying and operating LangSmith on Azure. + + +LangChain provides Terraform modules specifically for Azure to help provision infrastructure for LangSmith. These modules can quickly set up AKS clusters, Azure Database for PostgreSQL, Azure Managed Redis, Blob Storage, and networking resources. + +View the [Azure Terraform modules](https://github.com/langchain-ai/terraform/tree/main/modules/azure) for documentation and examples. + + +## Reference architecture + +We recommend using Azure's managed services to provide a scalable, secure, and resilient platform. The following architecture applies to both self-hosted and hybrid deployments: + +![Architecture diagram showing Azure relations to LangSmith services](/langsmith/images/azure-architecture-self-hosted.png) + +- **Client interfaces**: Users interact with LangSmith via a web browser or the LangChain SDK. All traffic terminates at an [Azure Load Balancer](https://azure.microsoft.com/en-us/products/load-balancer/) and is routed to the frontend (NGINX) within the [AKS](https://azure.microsoft.com/en-us/products/kubernetes-service/) cluster before being routed to another service within the cluster if necessary. +- **Storage services**: The platform requires persistent storage for traces, metadata and caching. On Azure the recommended services are: + - **[Azure Database for PostgreSQL (Flexible Server)](https://azure.microsoft.com/en-us/products/postgresql/)** for transactional data (e.g., runs, projects). Azure's high-availability options provision a standby replica in another zone; data is synchronously committed to both primary and standby servers. LangSmith requires PostgreSQL version 14 or higher. + - **[Azure Managed Redis](https://azure.microsoft.com/en-us/products/managed-redis/)** for queues and caching. Best practices include storing small values and breaking large objects into multiple keys, using pipelining to maximize throughput and ensuring the client and server reside in the same region. You can also use [Azure Cache for Redis](https://azure.microsoft.com/en-us/products/cache), running in non-cluster mode. LangSmith requires open-source Redis version 5 or higher. + - **ClickHouse** for high-volume analytics of traces. We recommend using an [externally managed ClickHouse solution](/langsmith/self-host-external-clickhouse). If, for security or compliance reasons, that is not an option, deploy a ClickHouse cluster on AKS using the open-source operator. Ensure replication across [availability zones](https://learn.microsoft.com/en-us/azure/reliability/availability-zones-overview) for durability. Clickhouse is not required for a hybrid deployment. + - **[Azure Blob Storage](https://azure.microsoft.com/en-us/products/storage/blobs/)** for large artifacts. Use redundant storage configurations such as read-access geo-redundant (RA-GRS) or geo-zone-redundant (RA-GZRS) storage and design applications to read from the secondary region during an outage. + +## Compute and networking on Azure + +### Azure Kubernetes Service (AKS) + +[AKS](https://azure.microsoft.com/en-us/products/kubernetes-service/) is the recommended compute platform for production deployments. This section outlines the key considerations for planning your setup. + +#### Network model + +Use [Azure CNI](https://learn.microsoft.com/en-us/azure/aks/configure-azure-cni) networking for production clusters. This model integrates the cluster into an existing virtual network, assigns IP addresses to each pod and node, and allows direct connectivity to on-premises or other Azure services. Ensure the subnet has enough IPs for nodes and pods, avoid overlapping address ranges and allocate additional IP space for scale-out events. + +#### Ingress and load balancing + +Use Kubernetes Ingress resources and controllers to distribute HTTP/HTTPS traffic. Ingress controllers operate at layer 7 and can route traffic based on URL paths and handle TLS termination. They reduce the number of public IP addresses compared to layer-4 load balancers. Use the [application routing add-on](https://learn.microsoft.com/en-us/azure/aks/app-routing) for managed NGINX ingress controllers integrated with [Azure DNS](https://azure.microsoft.com/en-us/products/dns/) and [Key Vault](https://azure.microsoft.com/en-us/products/key-vault/) for SSL certificates. + +#### Web Application Firewall (WAF) + +For additional protection against attacks, deploy a [WAF](https://learn.microsoft.com/en-us/azure/web-application-firewall/overview) such as [Azure Application Gateway](https://azure.microsoft.com/en-us/products/application-gateway/). A WAF filters traffic using OWASP rules and can terminate TLS before the traffic reaches your AKS cluster. + +#### Network policies + +Apply [Kubernetes network policies](https://learn.microsoft.com/en-us/azure/aks/use-network-policies) to restrict pod-to-pod traffic and reduce the impact of compromised workloads. Enable network policy support when creating the cluster and design rules based on application connectivity. + +#### High availability + +Configure node pools across [availability zones](https://learn.microsoft.com/en-us/azure/reliability/availability-zones-overview) and use Pod Disruption Budgets (PDB) and multiple replicas for all deployments. Set pod resource requests and limits; the [AKS resource management best practices](https://learn.microsoft.com/en-us/azure/aks/developer-best-practices-resource-management) recommend setting CPU and memory limits to prevent pods from consuming all resources. Use [Cluster Autoscaler](https://learn.microsoft.com/en-us/azure/aks/cluster-autoscaler) and [Vertical Pod Autoscaler](https://learn.microsoft.com/en-us/azure/aks/vertical-pod-autoscaler) to scale node pools and adjust pod resources automatically. + +### Networking and identity + +#### Virtual network integration + +Deploy AKS into its own [virtual network](https://azure.microsoft.com/en-us/products/virtual-network/) and create separate subnets for the cluster, database, Redis, and storage endpoints. Use [Private Link](https://azure.microsoft.com/en-us/products/private-link/) and [service endpoints](https://learn.microsoft.com/en-us/azure/virtual-network/virtual-network-service-endpoints-overview) to keep traffic within your virtual network and avoid exposure to the public internet. + +#### Authentication + +Integrate LangSmith with [Microsoft Entra ID](https://www.microsoft.com/en-us/security/business/identity-access/microsoft-entra-id) (Azure AD) for single sign-on. Use Azure AD OAuth2 for bearer tokens and assign roles to control access to the UI and API. + +## Storage and data services + +### Azure Database for PostgreSQL + +#### High availability + +Use [Flexible Server](https://learn.microsoft.com/en-us/azure/postgresql/flexible-server/overview) with high-availability mode. Azure provisions a standby replica either within the same availability zone (zonal) or across zones (zone-redundant). Data is synchronously committed to both the primary and standby servers, ensuring that committed data is not lost. Zone-redundant configurations place the standby in a different zone to protect against zone outages but may add write latency. + +#### Backups and disaster recovery + +Enable [automatic backups](https://learn.microsoft.com/en-us/azure/postgresql/flexible-server/concepts-backup-restore) and configure geo-redundant backup storage to protect against region-wide outages. For critical applications, create read replicas in a secondary region. + +#### Scaling + +Choose an appropriate SKU that matches your workload; Flexible Server allows scaling compute and storage independently. Monitor metrics and configure alerts through [Azure Monitor](https://azure.microsoft.com/en-us/products/monitor/). + +### Azure Managed Redis + +#### Persistence and redundancy + +Choose a tier that provides replication and persistence. Configure Redis persistence or data backup for durability. For high-availability, use [active geo-replication](https://learn.microsoft.com/en-us/azure/redis/how-to-active-geo-replication) or zone-redundant caches depending on the tier. + +### ClickHouse on Azure + +ClickHouse is used for analytical workloads (traces and feedback). If you cannot use an externally managed solution, deploy a ClickHouse cluster on AKS using Helm or the official operator. For resilience, replicate data across nodes and availability zones. Consider using [Azure Disks](https://azure.microsoft.com/en-us/products/storage/disks/) for local storage and mount them as StatefulSets. + +### Azure Blob Storage + +#### Redundancy + +Choose a redundancy configuration based on your recovery objectives. Use [read-access geo-redundant (RA-GRS) or geo-zone-redundant (RA-GZRS) storage](https://learn.microsoft.com/en-us/azure/storage/common/storage-redundancy) and design applications to switch reads to the secondary region during a primary region outage. + +#### Naming and partitioning + +Use naming conventions that improve load balancing across partitions and plan for the maximum number of concurrent clients. Stay within Azure's scalability and capacity targets and partition data across multiple storage accounts if necessary. + +#### Networking + +Access blob storage through [private endpoints](https://learn.microsoft.com/en-us/azure/storage/common/storage-private-endpoints) or by using SAS tokens and CORS rules to enable direct client access. + +## Security and access control + +### Azure Key Vault + +#### Separate vaults per application and environment + +Store secrets such as database connection strings and API keys in [Azure Key Vault](https://azure.microsoft.com/en-us/products/key-vault/). Use a dedicated vault for each application and environment (dev, test, prod) to limit the impact of a security breach. + +#### Access control + +Use the [RBAC permission model](https://learn.microsoft.com/en-us/azure/key-vault/general/rbac-guide) to assign roles at the vault scope and restrict access to required principals. Restrict network access using Private Link and firewalls. + +#### Data protection and logging + +Enable [soft delete and purge protection](https://learn.microsoft.com/en-us/azure/key-vault/general/soft-delete-overview) to prevent accidental deletion. Turn on logging and configure alerts for Key Vault access events. + +### Network security + +#### Ingress isolation + +Expose only the frontend service through the ingress controller or WAF. Other services should be internal and communicate through cluster networking. + +#### RBAC and pod security + +Use [Kubernetes RBAC](https://kubernetes.io/docs/reference/access-authn-authz/rbac/) to control who can deploy, modify, or read resources. Enable [pod security admission](https://kubernetes.io/docs/concepts/security/pod-security-admission/) to enforce baseline, restricted, or privileged profiles. + +#### Secrets management + +Mount secrets from Key Vault into pods using [CSI Secret Store](https://learn.microsoft.com/en-us/azure/aks/csi-secrets-store-driver). Avoid storing secrets in environment variables or configuration files. + +## Observability and monitoring + +Configure your LangSmith instance to [export telemetry data](/langsmith/export-backend) so you can use Azure's services to monitor it. + +### Azure Monitor + +Use [Azure Monitor](https://azure.microsoft.com/en-us/products/monitor/) for metrics, logs, and alerting. Proactive monitoring involves configuring alerts on key signals like node CPU/memory utilization, pod status, and service latency. Azure Monitor alerts notify you when predefined thresholds are exceeded. + +### Managed Prometheus and Grafana + +Enable [Azure Monitor managed Prometheus](https://learn.microsoft.com/en-us/azure/azure-monitor/essentials/prometheus-metrics-overview) to collect Kubernetes metrics. Combine it with [Grafana dashboards](https://azure.microsoft.com/en-us/products/managed-grafana/) for visualization. Define service-level objectives (SLOs) and configure alerts accordingly. + +### Container Insights + +Install [Container Insights](https://learn.microsoft.com/en-us/azure/azure-monitor/containers/container-insights-overview) to capture logs and metrics from AKS nodes and pods. Use [Azure Log Analytics workspaces](https://learn.microsoft.com/en-us/azure/azure-monitor/logs/log-analytics-overview) to query and analyze logs. + +### Application logging + +Ensure LangSmith services emit logs to stdout/stderr and forward them via [Fluent Bit](https://fluentbit.io/) or the Azure Monitor agent. + +## Continuous integration +- The preferred method to manage [LangSmith deployments](/langsmith/deployments) is to create a CI process that builds [Agent Server](/langsmith/agent-server) images and pushes them to [Azure Container Registry](https://azure.microsoft.com/en-us/products/container-registry). Create a test deployment for pull requests before deploying a new revision to staging or production upon PR merge. diff --git a/src/langsmith/images/aws-architecture-self-hosted.png b/src/langsmith/images/aws-architecture-self-hosted.png new file mode 100644 index 0000000000..097960e999 Binary files /dev/null and b/src/langsmith/images/aws-architecture-self-hosted.png differ diff --git a/src/langsmith/images/azure-architecture-self-hosted.png b/src/langsmith/images/azure-architecture-self-hosted.png new file mode 100644 index 0000000000..4acc80212e Binary files /dev/null and b/src/langsmith/images/azure-architecture-self-hosted.png differ diff --git a/src/langsmith/kubernetes.mdx b/src/langsmith/kubernetes.mdx index 279570db70..fac5af32ad 100644 --- a/src/langsmith/kubernetes.mdx +++ b/src/langsmith/kubernetes.mdx @@ -20,20 +20,19 @@ For [agent deployment](/langsmith/deployments): To add deployment capabilities, LangChain has successfully tested LangSmith on the following Kubernetes distributions: - Google Kubernetes Engine (GKE) -- Amazon Elastic Kubernetes Service (EKS) -- Azure Kubernetes Service (AKS) +- Amazon Elastic Kubernetes Service (EKS): For architecture patterns and best practices, refer to [self-hosting on AWS](/langsmith/aws-self-hosted). +- Azure Kubernetes Service (AKS): For architecture patterns and best practices, refer to [self-hosting on AWS](/langsmith/azure-self-hosted). - OpenShift (4.14+) - Minikube and Kind (for development purposes) -LangChain has several Terraform modules the help in the provisioning of resources for LangSmith. You can find those in the LangChain [public Terraform repo](https://github.com/langchain-ai/terraform). +LangChain provides Terraform modules to help provision infrastructure for LangSmith. These modules can quickly set up Kubernetes clusters, storage, and networking for your deployment. -Supported cloud providers include: +Available modules: +- [AWS Terraform modules](https://github.com/langchain-ai/terraform/tree/main/modules/aws) +- [Azure Terraform modules](https://github.com/langchain-ai/terraform/tree/main/modules/azure) -- [AWS terraform modules](https://github.com/langchain-ai/terraform/tree/main/modules/aws) -- [Azure terraform modules](https://github.com/langchain-ai/terraform/tree/main/modules/azure) - -You can click on the links above to see the documentation for each module. These modules are designed to help you quickly set up the necessary infrastructure for LangSmith, including Kubernetes clusters, storage, and networking. +View the [full Terraform repository](https://github.com/langchain-ai/terraform) for documentation and additional resources. ## Prerequisites @@ -129,7 +128,7 @@ For more information, refer to the following setup guides for external services: 3. You can see a full list of configuration options in the `values.yaml` file in the Helm Chart repository here: [LangSmith Helm Chart](https://github.com/langchain-ai/helm/tree/main/charts/langsmith/values.yaml) - Only override the settings you need in `langsmith_config.yaml`; don’t copy the entire `values.yaml`. + Only override the settings you need in `langsmith_config.yaml`; don’t copy the entire `values.yaml`. Keeping your config minimal ensures you continue to inherit new defaults and upgrades from the Helm chart.