Skip to content

feat: add Helm chart and Dockerfile for deploying Latitude to Kubernetes#2434

Draft
geclos wants to merge 18 commits intomainfrom
feat/helmchart_2
Draft

feat: add Helm chart and Dockerfile for deploying Latitude to Kubernetes#2434
geclos wants to merge 18 commits intomainfrom
feat/helmchart_2

Conversation

@geclos
Copy link
Collaborator

@geclos geclos commented Mar 14, 2026

Adds a production-ready Helm chart under charts/latitude/ that deploys all four services: web (TanStack Start SSR), api (Hono HTTP), ingest (telemetry ingestion), and workers (Redpanda background jobs).

Includes a multi-stage Dockerfile with per-service build targets (api, web, ingest, workers, migrations) and a health endpoint for the web app.

Key design decisions:

  • All four app services deployed with health probes, optional HPA/PDB/Ingress
  • Secret management supports both inline values (dev/test) and an external pre-existing Kubernetes Secret (production) via existingSecret
  • ConfigMap and Secret cover all LAT_* env vars including Kafka/Redpanda, object storage (S3), Weaviate, ClickHouse migration vars, and admin DB URL
  • Pre-install/pre-upgrade migration Job runs Postgres (drizzle-kit), ClickHouse (goose), and Weaviate migrations before app pods roll out
  • Workers deployment uses extended terminationGracePeriodSeconds for in-flight job completion
  • Pod annotations include config/secret checksums to trigger rolling restarts on configuration changes
  • VITE_LAT_* vars documented as build-time (client bundle) with SSR runtime fallback

Made-with: Cursor

Adds a production-ready Helm chart under charts/latitude/ that deploys all
four services: web (TanStack Start SSR), api (Hono HTTP), ingest (telemetry
ingestion), and workers (Redpanda background jobs).

Includes a multi-stage Dockerfile with per-service build targets (api, web,
ingest, workers, migrations) and a health endpoint for the web app.

Key design decisions:
- All four app services deployed with health probes, optional HPA/PDB/Ingress
- Secret management supports both inline values (dev/test) and an external
  pre-existing Kubernetes Secret (production) via existingSecret
- ConfigMap and Secret cover all LAT_* env vars including Kafka/Redpanda,
  object storage (S3), Weaviate, ClickHouse migration vars, and admin DB URL
- Pre-install/pre-upgrade migration Job runs Postgres (drizzle-kit),
  ClickHouse (goose), and Weaviate migrations before app pods roll out
- Workers deployment uses extended terminationGracePeriodSeconds for
  in-flight job completion
- Pod annotations include config/secret checksums to trigger rolling
  restarts on configuration changes
- VITE_LAT_* vars documented as build-time (client bundle) with SSR
  runtime fallback

Made-with: Cursor
@geclos geclos force-pushed the feat/helmchart_2 branch from f9c9e7f to 4742308 Compare March 16, 2026 09:58
@geclos
Copy link
Collaborator Author

geclos commented Mar 16, 2026

@cursoragent review

@cursor
Copy link

cursor bot commented Mar 16, 2026

The GitHub token doesn't have write access to post comments. I'll share the review directly here instead.


PR Review: Helm Chart and Dockerfile for Kubernetes Deployment

Good foundation -- the chart structure, component labeling, checksum-based rollout triggers, existingSecret pattern, and optional HPA/PDB/Ingress are all solid choices. Below are the issues I found, ordered by severity.


Critical

1. Workers only handle SIGINT, not SIGTERM -- broken Kubernetes graceful shutdown
apps/workers/src/server.ts:97 registers a handler for SIGINT only. Kubernetes sends SIGTERM for pod termination. Without a SIGTERM handler, the workers container will be force-killed after terminationGracePeriodSeconds (60s), potentially losing in-flight jobs. The API app correctly handles both signals (lines 84-85 in apps/api/src/server.ts).

2. Removing build scripts from shared packages may break the web Docker image
packages/observability/package.json, packages/ui/package.json, packages/utils/package.json had their "build": "tsc -p tsconfig.json" scripts removed. The Dockerfile's web and migrations stages run pnpm build, then sed rewrites "main": "src/index.ts" to "main": "dist/index.js", then deletes packages/*/src. With no build script, no dist/ is produced for these packages. If TanStack Start's SSR externalizes any of them, the web app will fail at runtime with module-not-found errors.

3. .env.example not updated for LAT_CLICKHOUSE_* env var rename
packages/platform/db-clickhouse/src/client.ts now reads LAT_CLICKHOUSE_URL, LAT_CLICKHOUSE_USER, LAT_CLICKHOUSE_PASSWORD, LAT_CLICKHOUSE_DB via parseEnv, but .env.example only has the un-prefixed CLICKHOUSE_* vars. Anyone setting up from .env.example gets MissingEnvValueError at startup. The LAT_CLICKHOUSE_* vars must be added to the "Latitude Application" section (keep the existing un-prefixed ones for Docker container init).

4. COPY --from=deps /app/apps/*/node_modules ./ glob is broken
Dockerfile:48 -- Docker's COPY --from does not reliably support shell-style globs against multi-stage paths. This either copies nothing or copies to the wrong location (flattened into workdir root, not preserving apps/<name>/node_modules/ structure). This likely explains the redundant second pnpm install on line 51-52.

5. Copy-then-delete anti-pattern inflates images
Dockerfile:73, 93, 114, 136, 177 -- Every final stage does COPY --from=build /app ./ then RUN rm -rf .... Docker layers are additive; the COPY layer permanently contains all bytes. The rm adds whiteout entries but does not reclaim space. Use selective COPY --from=build for only the needed artifacts instead.


Major

6. All containers run as root
No USER instruction exists in any runtime stage. This violates least privilege and most Kubernetes Pod Security Admission policies will reject these pods. Add a non-root user in the base stage and switch to it in each final stage.

7. LAT_WORKERS_HEALTH_PORT bypasses parseEnv and is missing from .env.example
apps/workers/src/server.ts:30 uses Number(process.env.LAT_WORKERS_HEALTH_PORT) || 9090 directly. Per AGENTS.md convention, all LAT_-prefixed vars must use parseEnv from @platform/env, and every new env var must be added to .env.example.

8. ServiceAccount is a Helm hook but deployments reference it as a normal resource
serviceaccount.yaml has helm.sh/hook: pre-install,pre-upgrade and hook-weight: "-10". Helm hooks are managed separately from the release. This creates lifecycle ordering risks -- the SA could be garbage-collected between upgrades. Either remove the hook annotations (let the SA be a normal release resource) or ensure hook-delete-policy is set appropriately.

9. Redis host/port are in the Secret but are not sensitive values
secret.yaml lines 14-15 put LAT_REDIS_HOST and LAT_REDIS_PORT in the Secret. These are connection parameters (e.g., redis.default.svc.cluster.local:6379), not credentials. They belong in the ConfigMap. This also means existingSecret documentation tells users to put LAT_REDIS_HOST and LAT_REDIS_PORT into their external Secret, which is awkward for production.

10. Migration job has no activeDeadlineSeconds
migrations-job.yaml -- If a migration hangs (e.g., waiting on a database lock), the job will never time out. Add activeDeadlineSeconds (e.g., 600) at the Job spec level.

11. All env vars (including secrets) are injected into every container
Every deployment uses both envFrom: configMapRef and envFrom: secretRef, meaning the web frontend receives database passwords, Stripe keys, etc. Consider per-service ConfigMaps/Secrets or at minimum document the trade-off.

12. Cache-busting packages/ copy defeats dependency caching
Dockerfile:31 -- COPY packages/ /tmp/packages-src/ copies all source code, tests, configs from every package. Any change to any file in packages/ invalidates the Docker cache at this layer, forcing a full pnpm install re-run. Use a more targeted approach to extract only package.json files.

13. No securityContext / podSecurityContext options in the chart
None of the deployments set runAsNonRoot, readOnlyRootFilesystem, or allowPrivilegeEscalation: false. There are no values for users to configure these. Significant gap for production hardening and PSA compliance.

14. Missing .npmrc in deps stage
Dockerfile:23 -- The root .npmrc contains strict-peer-dependencies=false but is not copied before pnpm install. This may cause install failures or different resolution results.


Minor

15. target: "node20" in tsup configs but project requires Node 25
All three tsup.config.ts files specify target: "node20". Should be target: "node25".

16. curl=7.* version pin will break on base image update
Dockerfile:163 -- Remove the version constraint since curl is purged immediately after use.

17. Redundant test file deletes
The first find . -name "*.ts" ... -delete already removes all .ts files; subsequent .test.ts and .spec.ts finds match zero files.

18. Fragile sed rewrite misses exports field
@repo/ui uses an exports map with "import": "./src/index.ts". The sed pattern only rewrites "main", not "exports" entries.

19. import.meta.url guard comment is misleading
tsup/esbuild shims import.meta.url in CJS output, so the guard is always truthy. The code works correctly (inner existsSync handles missing files), but the comment is misleading.

20. ClickHouse migration vars pollute all pods
CLICKHOUSE_MIGRATION_URL, CLICKHOUSE_USER, etc. are injected into all services via the ConfigMap but are only needed by the migration job.

21. kind/ setup missing critical dependencies

  • Weaviate is not deployed but migration job runs wv:migrate -- will fail
  • Redpanda topics (domain-events, domain-events-dlq, span-ingestion) are not created -- workers/ingest will fail
  • Postgres latitude_app role is not created -- RLS won't work correctly

Suggestions

  • Add optional securityContext / podSecurityContext values with sensible defaults
  • Add startupProbe support for services with slow startup
  • Add nameOverride and fullnameOverride to values.yaml
  • Add a helm test connection pod
  • Consider format: ["esm"] in tsup configs to avoid CJS shim surface
  • Consider splitting migration job per database for independent failure handling
  • Add app.kubernetes.io/instance to the base latitude.labels template
  • Add memory-based HPA metric support alongside CPU

Overall this is a well-structured PR with good patterns. The critical items (SIGTERM handling, package build script removal, .env.example gap, and Docker layer issues) should be addressed before merge. The major items around security (root containers, security context) and Helm lifecycle (SA hook, migration timeout) are important for production readiness.

Open in Web Open in Cursor 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Inbox

Development

Successfully merging this pull request may close these issues.

2 participants