feat: add Helm chart and Dockerfile for deploying Latitude to Kubernetes by geclos · Pull Request #2434 · latitude-dev/latitude-llm

geclos · 2026-03-14T17:57:13Z

Adds a production-ready Helm chart under charts/latitude/ that deploys all four services: web (TanStack Start SSR), api (Hono HTTP), ingest (telemetry ingestion), and workers (Redpanda background jobs).

Includes a multi-stage Dockerfile with per-service build targets (api, web, ingest, workers, migrations) and a health endpoint for the web app.

Key design decisions:

All four app services deployed with health probes, optional HPA/PDB/Ingress
Secret management supports both inline values (dev/test) and an external pre-existing Kubernetes Secret (production) via existingSecret
ConfigMap and Secret cover all LAT_* env vars including Kafka/Redpanda, object storage (S3), Weaviate, ClickHouse migration vars, and admin DB URL
Pre-install/pre-upgrade migration Job runs Postgres (drizzle-kit), ClickHouse (goose), and Weaviate migrations before app pods roll out
Workers deployment uses extended terminationGracePeriodSeconds for in-flight job completion
Pod annotations include config/secret checksums to trigger rolling restarts on configuration changes
VITE_LAT_* vars documented as build-time (client bundle) with SSR runtime fallback

Made-with: Cursor

Adds a production-ready Helm chart under charts/latitude/ that deploys all four services: web (TanStack Start SSR), api (Hono HTTP), ingest (telemetry ingestion), and workers (Redpanda background jobs). Includes a multi-stage Dockerfile with per-service build targets (api, web, ingest, workers, migrations) and a health endpoint for the web app. Key design decisions: - All four app services deployed with health probes, optional HPA/PDB/Ingress - Secret management supports both inline values (dev/test) and an external pre-existing Kubernetes Secret (production) via existingSecret - ConfigMap and Secret cover all LAT_* env vars including Kafka/Redpanda, object storage (S3), Weaviate, ClickHouse migration vars, and admin DB URL - Pre-install/pre-upgrade migration Job runs Postgres (drizzle-kit), ClickHouse (goose), and Weaviate migrations before app pods roll out - Workers deployment uses extended terminationGracePeriodSeconds for in-flight job completion - Pod annotations include config/secret checksums to trigger rolling restarts on configuration changes - VITE_LAT_* vars documented as build-time (client bundle) with SSR runtime fallback Made-with: Cursor

geclos · 2026-03-16T12:37:26Z

@cursoragent review

cursor · 2026-03-16T12:37:33Z

The GitHub token doesn't have write access to post comments. I'll share the review directly here instead.

PR Review: Helm Chart and Dockerfile for Kubernetes Deployment

Good foundation -- the chart structure, component labeling, checksum-based rollout triggers, existingSecret pattern, and optional HPA/PDB/Ingress are all solid choices. Below are the issues I found, ordered by severity.

Critical

1. Workers only handle SIGINT, not SIGTERM -- broken Kubernetes graceful shutdown
apps/workers/src/server.ts:97 registers a handler for SIGINT only. Kubernetes sends SIGTERM for pod termination. Without a SIGTERM handler, the workers container will be force-killed after terminationGracePeriodSeconds (60s), potentially losing in-flight jobs. The API app correctly handles both signals (lines 84-85 in apps/api/src/server.ts).

2. Removing build scripts from shared packages may break the web Docker image
packages/observability/package.json, packages/ui/package.json, packages/utils/package.json had their "build": "tsc -p tsconfig.json" scripts removed. The Dockerfile's web and migrations stages run pnpm build, then sed rewrites "main": "src/index.ts" to "main": "dist/index.js", then deletes packages/*/src. With no build script, no dist/ is produced for these packages. If TanStack Start's SSR externalizes any of them, the web app will fail at runtime with module-not-found errors.

3. .env.example not updated for LAT_CLICKHOUSE_* env var rename
packages/platform/db-clickhouse/src/client.ts now reads LAT_CLICKHOUSE_URL, LAT_CLICKHOUSE_USER, LAT_CLICKHOUSE_PASSWORD, LAT_CLICKHOUSE_DB via parseEnv, but .env.example only has the un-prefixed CLICKHOUSE_* vars. Anyone setting up from .env.example gets MissingEnvValueError at startup. The LAT_CLICKHOUSE_* vars must be added to the "Latitude Application" section (keep the existing un-prefixed ones for Docker container init).

4. COPY --from=deps /app/apps/*/node_modules ./ glob is broken
Dockerfile:48 -- Docker's COPY --from does not reliably support shell-style globs against multi-stage paths. This either copies nothing or copies to the wrong location (flattened into workdir root, not preserving apps/<name>/node_modules/ structure). This likely explains the redundant second pnpm install on line 51-52.

5. Copy-then-delete anti-pattern inflates images
Dockerfile:73, 93, 114, 136, 177 -- Every final stage does COPY --from=build /app ./ then RUN rm -rf .... Docker layers are additive; the COPY layer permanently contains all bytes. The rm adds whiteout entries but does not reclaim space. Use selective COPY --from=build for only the needed artifacts instead.

Major

6. All containers run as root
No USER instruction exists in any runtime stage. This violates least privilege and most Kubernetes Pod Security Admission policies will reject these pods. Add a non-root user in the base stage and switch to it in each final stage.

7. LAT_WORKERS_HEALTH_PORT bypasses parseEnv and is missing from .env.example
apps/workers/src/server.ts:30 uses Number(process.env.LAT_WORKERS_HEALTH_PORT) || 9090 directly. Per AGENTS.md convention, all LAT_-prefixed vars must use parseEnv from @platform/env, and every new env var must be added to .env.example.

8. ServiceAccount is a Helm hook but deployments reference it as a normal resource
serviceaccount.yaml has helm.sh/hook: pre-install,pre-upgrade and hook-weight: "-10". Helm hooks are managed separately from the release. This creates lifecycle ordering risks -- the SA could be garbage-collected between upgrades. Either remove the hook annotations (let the SA be a normal release resource) or ensure hook-delete-policy is set appropriately.

9. Redis host/port are in the Secret but are not sensitive values
secret.yaml lines 14-15 put LAT_REDIS_HOST and LAT_REDIS_PORT in the Secret. These are connection parameters (e.g., redis.default.svc.cluster.local:6379), not credentials. They belong in the ConfigMap. This also means existingSecret documentation tells users to put LAT_REDIS_HOST and LAT_REDIS_PORT into their external Secret, which is awkward for production.

10. Migration job has no activeDeadlineSeconds
migrations-job.yaml -- If a migration hangs (e.g., waiting on a database lock), the job will never time out. Add activeDeadlineSeconds (e.g., 600) at the Job spec level.

11. All env vars (including secrets) are injected into every container
Every deployment uses both envFrom: configMapRef and envFrom: secretRef, meaning the web frontend receives database passwords, Stripe keys, etc. Consider per-service ConfigMaps/Secrets or at minimum document the trade-off.

12. Cache-busting packages/ copy defeats dependency caching
Dockerfile:31 -- COPY packages/ /tmp/packages-src/ copies all source code, tests, configs from every package. Any change to any file in packages/ invalidates the Docker cache at this layer, forcing a full pnpm install re-run. Use a more targeted approach to extract only package.json files.

13. No securityContext / podSecurityContext options in the chart
None of the deployments set runAsNonRoot, readOnlyRootFilesystem, or allowPrivilegeEscalation: false. There are no values for users to configure these. Significant gap for production hardening and PSA compliance.

14. Missing .npmrc in deps stage
Dockerfile:23 -- The root .npmrc contains strict-peer-dependencies=false but is not copied before pnpm install. This may cause install failures or different resolution results.

Minor

15. target: "node20" in tsup configs but project requires Node 25
All three tsup.config.ts files specify target: "node20". Should be target: "node25".

16. curl=7.* version pin will break on base image update
Dockerfile:163 -- Remove the version constraint since curl is purged immediately after use.

17. Redundant test file deletes
The first find . -name "*.ts" ... -delete already removes all .ts files; subsequent .test.ts and .spec.ts finds match zero files.

18. Fragile sed rewrite misses exports field
@repo/ui uses an exports map with "import": "./src/index.ts". The sed pattern only rewrites "main", not "exports" entries.

19. import.meta.url guard comment is misleading
tsup/esbuild shims import.meta.url in CJS output, so the guard is always truthy. The code works correctly (inner existsSync handles missing files), but the comment is misleading.

20. ClickHouse migration vars pollute all pods
CLICKHOUSE_MIGRATION_URL, CLICKHOUSE_USER, etc. are injected into all services via the ConfigMap but are only needed by the migration job.

21. kind/ setup missing critical dependencies

Weaviate is not deployed but migration job runs wv:migrate -- will fail
Redpanda topics (domain-events, domain-events-dlq, span-ingestion) are not created -- workers/ingest will fail
Postgres latitude_app role is not created -- RLS won't work correctly

Suggestions

Add optional securityContext / podSecurityContext values with sensible defaults
Add startupProbe support for services with slow startup
Add nameOverride and fullnameOverride to values.yaml
Add a helm test connection pod
Consider format: ["esm"] in tsup configs to avoid CJS shim surface
Consider splitting migration job per database for independent failure handling
Add app.kubernetes.io/instance to the base latitude.labels template
Add memory-based HPA metric support alongside CPU

Overall this is a well-structured PR with good patterns. The critical items (SIGTERM handling, package build script removal, .env.example gap, and Docker layer issues) should be addressed before merge. The major items around security (root containers, security context) and Helm lifecycle (SA hook, migration timeout) are important for production readiness.

…ample

github-project-automation bot added this to Latitude LLM Roadmap Mar 14, 2026

github-project-automation bot moved this to Inbox in Latitude LLM Roadmap Mar 14, 2026

geclos force-pushed the feat/helmchart_2 branch from fb40e16 to f9c9e7f Compare March 14, 2026 17:59

geclos force-pushed the feat/helmchart_2 branch from f9c9e7f to 4742308 Compare March 16, 2026 09:58

geclos added 8 commits March 16, 2026 13:22

build: add build scripts to all packages for Docker bundling

0881d91

build: add tsup bundling for api, ingest, and workers apps

1ec067b

fix: handle import.meta.url for CJS bundles in server entry points

486128c

fix: use LAT_ prefixed env vars for ClickHouse client

b385015

build: update Dockerfile for production bundling with tsup

666ee74

chore: update pnpm-lock.yaml with tsup dependency

0ab2819

build: pin curl version in Dockerfile for hadolint compliance

f37e790

build: add SHELL directive with pipefail for hadolint compliance

50211a8

geclos added 9 commits March 16, 2026 13:40

chore: update ClickHouse env vars to use LAT_ prefix in .env.example

af256a0

chore: add both CLICKHOUSE_* and LAT_CLICKHOUSE_* env vars in .env.ex…

47a0671

…ample

chore: reorganize ClickHouse env vars in .env.example

071fcdb

fix: ch env vars prefixing

a39d3ac

docs: add README for kind local testing environment

5020aae

fix: use /api/health endpoint for web service health checks

37b82fd

fix: use custom server entry for web app instead of Nitro

13f7a7c

feat: replaced custom server with nitro for production web build

bf2cf3d

fix: simplify docker images and align migration env config

68fa231

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Helm chart and Dockerfile for deploying Latitude to Kubernetes#2434

feat: add Helm chart and Dockerfile for deploying Latitude to Kubernetes#2434
geclos wants to merge 18 commits intomainfrom
feat/helmchart_2

geclos commented Mar 14, 2026

Uh oh!

geclos commented Mar 16, 2026

Uh oh!

cursor bot commented Mar 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

geclos commented Mar 14, 2026

Uh oh!

geclos commented Mar 16, 2026

Uh oh!

cursor bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Helm Chart and Dockerfile for Kubernetes Deployment

Critical

Major

Minor

Suggestions

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cursor bot commented Mar 16, 2026 •

edited

Loading