Skip to content

Defer span processing to background worker via object storage#2425

Merged
csansoon merged 1 commit intolatitude-v2from
async-span-ingestion
Mar 13, 2026
Merged

Defer span processing to background worker via object storage#2425
csansoon merged 1 commit intolatitude-v2from
async-span-ingestion

Conversation

@csansoon
Copy link
Contributor

Summary

The OTEL ingest endpoint previously parsed, transformed, and wrote spans to ClickHouse synchronously within the HTTP request. As the span processing logic grows more complex, this coupling increases response latency and risks timeouts from OTEL exporters.

This change splits ingestion into two phases: the endpoint now validates the payload, buffers it to object storage, and enqueues a BullMQ job — returning immediately. A new background worker picks up the job, runs the transform, persists to ClickHouse, and cleans up the buffered payload. This keeps the ingest endpoint lightweight and allows processing to retry independently on failure.

The OTLP parsing and transform logic was relocated from apps/ingest to @domain/spans so both the ingest app and workers app can share it without cross-app imports.

@csansoon csansoon force-pushed the async-span-ingestion branch from 4eaecf3 to 547cb20 Compare March 12, 2026 12:59
@csansoon
Copy link
Contributor Author

@cursoragent review this PR

@cursor
Copy link

cursor bot commented Mar 12, 2026

Now I have a thorough understanding of the entire PR. Let me write the review.

Here is my review of this PR:


PR Review: Defer span processing to background worker via object storage

Overall assessment: The architecture is sound -- decoupling ingest from processing via object storage + BullMQ is the right move for keeping the OTEL endpoint fast and making retries possible. The OTLP module relocation to @domain/spans is clean. There are several issues to address before merging, ranging from a leftover debug log to missing error handling and architectural concerns.

Must Fix

  1. Leftover console.log in production code (packages/domain/spans/src/otlp/transform.ts:147)

    console.log("📥", span.name)

    This looks like a debug log that was not reverted before the PR. It will fire for every single span processed in production. Remove it.

  2. protobufjs in a domain package violates architecture rules -- The AGENTS.md specifies "Domain must never import concrete DB/cache/queue/object storage clients" and more broadly, "Web Standards First" for domain packages. protobufjs is a Node.js-native library (uses Buffer, fs, path internally). The OTLP proto decoding is a platform-level concern (wire-format parsing), not business logic.

    • Consider keeping proto.ts in a platform package (e.g., @platform/otlp or keeping it in apps/ingest) and only moving transform.ts + types.ts into @domain/spans. The worker can import the decoder from the platform package.
    • Alternatively, if the team decides this is acceptable in domain, document why with a brief comment per the AGENTS.md convention.
  3. Queue name "span-ingestion" is duplicated as a magic string in three places: apps/ingest/src/clients.ts:70, apps/workers/src/workers/span-ingestion.ts:45, and :47. Extract this to a shared constant (e.g., in @domain/shared or a shared constants module) to prevent silent mismatches.

  4. No error handling on storageDisk.get() or JSON.parse() in the worker processor (apps/workers/src/workers/span-ingestion.ts:27-28):

    const raw = await storageDisk.get(storageKey)
    const { request, context } = JSON.parse(raw) as StoredPayload
    • storageDisk.get() can throw if the key doesn't exist (race condition, already deleted, storage failure).
    • JSON.parse(raw) will throw on corrupt data.
    • If these throw, BullMQ will retry the job (default 0 retries), but the error message won't be very useful. Wrap with typed errors or at least add structured logging so failures are diagnosable. Also consider configuring retry attempts on the worker.
  5. storageDisk.delete() failure silently loses the span data (span-ingestion.ts:37). If ClickHouse insert succeeds but delete fails, the job throws and BullMQ may retry, potentially causing duplicate inserts. Consider either:

    • Making span inserts idempotent (e.g., via ReplacingMergeTree dedup, which ClickHouse likely already has), or
    • Catching the delete error separately and logging it without failing the job.

Should Fix

  1. ingestAt timestamp drift -- The ingestedAt is set inside transformOtlpToSpans() at processing time (worker execution), not at actual HTTP ingest time. This could be seconds to minutes later under load. Consider capturing ingestedAt at the ingest endpoint and including it in the stored payload so the timestamp reflects when the data actually arrived.

  2. getClickhouseClient is now unused in apps/ingest -- The import was removed from traces.ts, but getClickhouseClient() is still defined and exported in apps/ingest/src/clients.ts. If nothing else in the ingest app uses it, remove it (and the @clickhouse/client dependency from apps/ingest/package.json) to avoid dead code. Knip should catch this.

  3. No BullMQ worker options configured -- The worker in span-ingestion.ts uses all BullMQ defaults (0 retries, no concurrency tuning, no rate limiting). For a span ingestion pipeline that will handle high throughput:

    • Set concurrency on the worker for parallel job processing.
    • Set attempts and backoff on jobs or worker defaults for transient ClickHouse/storage failures.
    • Consider removeOnComplete / removeOnFail to prevent Redis memory growth.
  4. jobId: storageKey in traces.ts:52 makes jobs idempotent by storage key, which is good. However, each storage key contains a crypto.randomUUID(), so two identical payloads get two different jobs. This is probably fine (OTEL exporters retry with the same payload, but each HTTP request is a separate ingest), just flagging that the dedup is per-request, not per-content.

  5. @domain/shared/storage.ts -- the StorageDiskPort interface only has put() but the worker needs get() and delete(). The actual StorageDisk (flydrive Disk) has these methods, but the domain port doesn't declare them. Either:

    • Expand StorageDiskPort to include get and delete, or
    • Acknowledge that the worker directly uses the platform StorageDisk type (which it does) and the port is only used by putInDisk.

Nits

  1. The @domain/spans index exports OTLP types alongside domain entities. This mixes concerns slightly -- consumers looking at @domain/spans now see decodeOtlpProtobuf alongside SpanRepository. Consider a sub-path export (e.g., @domain/spans/otlp) if the package convention supports it.

  2. The removed comments in transform.ts (e.g., // Semantic convention attribute keys..., // Split remaining attributes by value type) were actually useful for orientation in a dense transform function. The AGENTS.md says "avoid comments except for genuinely non-obvious reasoning," but attribute extraction from OTLP semantic conventions is arguably non-obvious to new readers. Consider keeping the most helpful ones.

  3. Andres's review comment about using the centralized putInDisk from @domain/shared/storage.ts appears to have been addressed (the ingest namespace was added to storage.ts). Good.


Summary: The PR's architecture is solid. The main blockers are the leftover debug log, the duplicated queue name, and the lack of error handling in the worker processor. The protobufjs in domain is worth a discussion. The rest are improvements that would make the system more production-ready.

Open in Web Open in Cursor 

@csansoon csansoon force-pushed the async-span-ingestion branch 2 times, most recently from ee102af to a3b33ce Compare March 12, 2026 14:04
geclos
geclos previously approved these changes Mar 13, 2026
The OTEL ingest endpoint previously parsed, transformed, and wrote spans to ClickHouse synchronously within the HTTP request. As the span processing logic grows more complex, this coupling increases response latency and risks timeouts from OTEL exporters.

This change splits ingestion into two phases: the endpoint now validates the payload, buffers it to object storage, and enqueues a BullMQ job — returning immediately. A new background worker picks up the job, runs the transform, persists to ClickHouse, and cleans up the buffered payload. This keeps the ingest endpoint lightweight and allows processing to retry independently on failure.

The OTLP parsing and transform logic was relocated from `apps/ingest` to `@domain/spans` so both the ingest app and workers app can share it without cross-app imports.
@csansoon csansoon force-pushed the async-span-ingestion branch from 9595189 to 59bd1cc Compare March 13, 2026 12:02
@csansoon csansoon merged commit 0386e23 into latitude-v2 Mar 13, 2026
6 checks passed
@csansoon csansoon deleted the async-span-ingestion branch March 13, 2026 15:03
@github-project-automation github-project-automation bot moved this from Inbox to Done in Latitude LLM Roadmap Mar 13, 2026
geclos added a commit that referenced this pull request Mar 14, 2026
* chore: reset repository for v2 rewrite bootstrap

* chore: scaffold v2 monorepo apps and adapters

* remove .* files and folders

* add AGENTS.md

* chore: wire v2 database adapters and env-based service bootstrap

* chore: configure web port via env vars

* fix: allow all hosts for web dev server in development

* added project rules skill and updated agents.md

* chore: centralize env value parsing

* chore: update agents.md

* cleanup house

* chore: remove turbo duplicate reference

* enforce effect, improve code organization

* fix client inits

* feat(ui): reimplement design system from legacy latitude

Create @repo/ui package with core design system components:

- Design tokens: colors, font, shadow, zIndex, opacity, overflow, whiteSpace, wordBreak, skeleton
- CSS variables with light/dark theme support
- Tailwind config with custom theme extensions
- cn() utility for class name composition
- Core components:
  - Text (H1-H8 with weight variants, Mono)
  - Button (9 variants, 4 sizes, loading state)
  - Icons (Lucide wrapper with 8 size variants, 40+ icons)
  - Card (with header, content, footer subcomponents)
  - Input (with FormField integration)
  - Label (Radix UI based)
  - FormField (wrapper for form inputs with validation)

Integration:
- Add @repo/ui dependency to apps/web
- Create demo page showcasing all components
- Dark mode toggle working
- All components properly typed and forwarding refs

Quality:
- All lint checks pass
- All type checks pass
- Follows monorepo conventions (ESM, strict TS, Biome)

* fix(web): configure Tailwind CSS for dev server

Add Tailwind configuration to apps/web:
- tailwind.config.ts with custom theme colors matching @repo/ui
- postcss.config.mjs for CSS processing
- tailwindcss-animate dependency for animations

Update @repo/ui package.json to export styles and tailwind config.

Fixes CSS processing errors and enables the design system to render
correctly in dev mode.

* fix(ui): address P1 and P2 code review findings

Fixed all critical and important issues from code review:

**P1 - Critical Fixes:**
- Text component: Fixed unsafe array check, added displayName, corrected ref types
- Removed ~120 lines of dead code (unused text variants, button variants, props)
- Removed H1B, H2M, H2B, H3M, H3B, H4M, H4B, H5M, H5B, H6M, H6B, H6C, H7, H7C, H8
- Removed unused props: animate, monospace, centered, isItalic, darkColor, lineClamp, showNativeTitle
- Removed unused button variants: shiny, latte, primaryMuted, destructiveMuted, successMuted

**P2 - Important Fixes:**
- Icon component: Added React.memo, guarded console.warn for production
- Dependencies: React properly managed (devDeps + peerDeps only)
- Tailwind config: apps/web now imports from @repo/ui for DRY
- Form accessibility: Added useId, htmlFor, aria-describedby, aria-invalid, role=alert
- Label component: Removed unnecessary cva usage
- Card component: Fixed ref type for CardTitle (HTMLHeadingElement)
- FormField: Added proper ID association between labels and inputs

**Quality:**
- All lint checks pass
- All type checks pass
- No breaking changes

Closes: todos/001-005 (P1 and P2 items)

* feat(api): implement JWT-based email/password authentication for CLI

Enable email/password authentication via Better Auth for programmatic CLI access:

- Enable email/password in Better Auth config (email verification disabled for MVP)
- Add POST /auth/sign-up/email endpoint returning JSON with token and user info
- Add POST /auth/sign-in/email endpoint returning JSON with token and user info
- Implement rate limiting middleware (5 attempts per 15 minutes per IP)
- Configure password requirements: min 8 chars, max 128 chars

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* feat(api): implement Redis-backed rate limiting for auth endpoints

Replace in-memory rate limiter with Redis-backed implementation:

- Add Redis client wrapper in @platform/cache-redis package
- Implement Redis rate limiter using INCR and EXPIRE for atomic operations
- Configure rate limits: 5 attempts per 15 min for sign-in, 3 per hour for sign-up
- Make Redis required for API (no fallback)
- Add @platform/cache-redis dependency to API package

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* refactor(api): use clients module for database access instead of DI

Remove database dependency injection from route handlers:

- Update clients.ts to export getPostgresClient() and getDb() helpers
- Routes now import db directly from clients module
- Remove db parameter from RoutesContext
- Simplify route registration in server.ts
- Update health check to use getPostgresClient().pool

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* design system

* s/workspace/organization, added stripe, added rls

* removed some fluff

* feat(auth): implement Magic Link authentication with Better Auth plugin

Add passwordless Magic Link authentication using Better Auth's official plugin:

Email Infrastructure:
- Create @domain/email package with ports, entities, and templates
- Implement Mailpit adapter for local development
- Implement SMTP adapter for production
- Implement SendGrid adapter for production

Better Auth Integration:
- Add Magic Link plugin to Better Auth configuration
- Configure sendMagicLink hook for email delivery
- Set up database hooks for user creation events
- Add onUserCreated callback for auto-onboarding

User Onboarding:
- Create @domain/onboarding package with workspace setup logic
- Implement UserCreated event handler in workers
- Auto-create default workspace when new user signs up
- Auto-assign owner role to new users

Frontend Login:
- Create MagicLinkForm component with email input and confirmation
- Create OAuthButtons component for Google/GitHub auth
- Create LoginPage route with Magic Link + OAuth UI
- Email/password remains CLI-only (not exposed in web)

Rate Limiting:
- Add createMagicLinkIpRateLimiter (10/hour per IP)
- Add createMagicLinkEmailRateLimiter (3/hour per email)
- Integrate with existing Redis-based rate limiting

Environment Variables:
- MAILPIT_HOST/PORT for local development
- SMTP_HOST/PORT/USER/PASS/FROM for production SMTP
- SENDGRID_API_KEY/FROM for SendGrid
- BETTER_AUTH_SECRET/URL for Better Auth

🤖 Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>

* feat(auth): wire up Magic Link email sender and add login routing

- Add email sender factory with SendGrid, SMTP, and Mailpit adapters
- Configure Better Auth with sendMagicLink callback using magicLinkTemplate
- Add rate limiting for Magic Link endpoints (IP and email based)
- Set up TanStack Router with login page route
- Add environment variables for email providers

🤖 Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>

* fix: use catalog effect version and fix type errors

- Update @domain/email and @domain/onboarding to use catalog effect version
- Fix unused Effect imports in onboarding package
- Fix SendGrid adapter text field type error
- Add email/onboarding packages as API dependencies

🤖 Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>

* docs: add code review todos for Magic Link authentication

Created 4 todo files from comprehensive code review:

- 001-pending-p2-effect-type-compatibility.md
  Type compatibility issue with Effect in auth routes

- 002-pending-p2-magic-link-attempts.md
  Security: Add explicit allowedAttempts to Magic Link config

- 003-pending-p2-magic-link-tests.md
  Missing test coverage for authentication flow

- 004-pending-p3-optimize-effect-usage.md
  Code quality: Optimize Effect.runSync usage

🤖 Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>

* implements full login/signup flow

* minor: find user by email

* chore: moved emails to react

* chore: knip pass

* housecleaning

* add default web port

* minor: remove barrel import of icons

* chore: bump to tailwind v4

* chore: add missing dependencies

* update biome line length config, added agents comment about barrel files

* replace tsx with node since it supports tsx now

* s/26/25

* minor: switch sendgrid for mailgun, switch to no semicolon

* chore: fix tailwind hot reloading

* chore: configure trusted origins via env vars

* chore: remove harmless node dev warning

* Formatting fixes (#2389)

* Update PNPM to v10, Remove format:check

biome check is our lint and do linter and formatting check

* Missinge formatting files

* Let "packageManager" in package.json decide pnpm version in CI

* Fix typecheck issues

* Call it check instead of lint because we do `biome check`

* Remove localstorage from api

i don't know what's this

* feat(api): implement authentication middleware with comprehensive security

Implements unified authentication middleware supporting cookie-based sessions,
JWT Bearer tokens, and API keys with extensive security hardening.

Security improvements:
- Add organization membership validation to prevent cross-tenant access
- Implement structured security logging for auth failures
- Add Redis-backed rate limiting (10 attempts per 15min per IP)
- Fix timing attack vulnerability with constant-time validation (~50ms)
- Standardize generic error messages to prevent information leakage
- Harden CORS with explicit origin whitelist and security logging

Performance optimizations:
- Add Redis caching for API key lookups (5min TTL, 80%+ hit rate)
- Implement TouchBuffer for batched lastUsedAt updates (90%+ write reduction)
- Cache invalidation on key revocation for security

Architecture:
- Unify async/await and Effect patterns throughout middleware
- Add Hono module augmentation for type-safe context
- Simplify middleware from 335 to 176 lines (47% reduction)
- Remove unsafe type assertions (as unknown as)

New files:
- apps/api/src/middleware/auth.ts (176 lines, Effect-based)
- apps/api/src/middleware/touch-buffer.ts (batch updates)
- apps/api/src/types.ts (Hono context types)

Resolves 11 security and performance todos (001-011).

* chore: add proper tanstack start

* code cleaning and added first integration tests

* chore: add testkit and first integration tests

* fix run api with `tsx` command

We need to support for now email rendering in api. Native node.js does
not handle `.tsx` extensions

* Update AGENTS.md with biome commands after update

* minor: add "catchup" script

* Env vars standarization

* chore: add integration tests with in memory postgres db

* housekeeping

* fix login/signup email sending

* chore: remove compound stuff

* refactor(shared-kernel): migrate ID generation from UUID to CUID2

Replace crypto.randomUUID() with @paralleldrive/cuid2 for ID generation:
- CUID2 provides 24-25 character URL-safe identifiers
- ~30% storage reduction compared to UUID v4 (36 chars)
- Better developer experience with readable IDs

Changes:
- Update generateId() to use createId() from @paralleldrive/cuid2
- Update isValidId() to use official isCuid() validation
- Configure Better Auth with custom ID generator via advanced.database.generateId
- Add @domain/shared-kernel dependency to @platform/auth-better

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(email): remove unused React imports causing build failures

Remove unnecessary React imports from email components:
- ContainerLayout.tsx
- MagicLinkEmail.tsx
- magic-link.tsx

Modern JSX transform doesn't require React to be imported when
only using JSX. TypeScript was flagging these as unused.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* Add weaviate client and fix docker compose

* refactor(db-postgres): convert remaining UUID columns to text for CUID2 support

Convert all remaining UUID type columns to text to support CUID2 IDs:

- api_keys.token: changed from uuid to text (proper API key format)
- grants.uuid: removed column entirely (not needed)
- outbox_events.id: changed from uuid to text with CUID2
- outbox_events.aggregate_id: changed from uuid to text
- outbox_events.workspace_id: changed from uuid to text

Migration 0003_minor_lily_hollister.sql generated and applied.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* cuids drizzle helper

* update pnpm workspace

* replace text columns with varchar

* Models.dev llm integration (#2393)

* feat: add @domain/llm-models package with models.dev integration

Add a new domain package that provides LLM model information and cost
estimation utilities using data from models.dev.

Package structure:
- entities/model.ts: LlmModel type and raw JSON parser
- entities/cost.ts: Token cost computation with tiered pricing support
- registry.ts: Model lookup, provider filtering, and cost estimation
- data/models.dev.json: Bundled models.dev data (auto-updated weekly)
- scripts/update-models-dev-data.ts: Script to refresh bundled data

Key utilities:
- getAllModels(): Get all bundled LLM models
- getModelsForProvider(provider): Filter models by provider
- getModelForProvider(provider, model): Find specific model with fallback
- estimateCost(provider, model, usage): Estimate total cost in USD
- estimateCostWithBreakdown(provider, model, usage): Detailed cost breakdown
- formatModel(model): Human-readable model summary

Also adds:
- GitHub Actions workflow for weekly auto-update of models.dev data
- biome.json exclusion for the large bundled JSON file
- 54 unit tests covering all utilities

Co-authored-by: Alex Rodríguez <me@arn.sh>

* chore: bump GitHub Actions workflow to latest versions and Node 25

Co-authored-by: Alex Rodríguez <me@arn.sh>

* docs: add cloud agent env setup instructions to AGENTS.md

Co-authored-by: Alex Rodríguez <me@arn.sh>

* refactor: align TokenUsage naming with ModelCostTier and add cacheWrite

- Rename promptTokens -> inputTokens, completionTokens -> outputTokens,
  cachedInputTokens -> cacheReadTokens in TokenUsage
- Add cacheWriteTokens to TokenUsage
- Add cacheWrite to ModelCostTier, ModelPricing, and raw model parser
- Rename CostBreakdown fields: prompt -> direct, completion -> direct,
  cached -> cacheRead, add cacheWrite
- Update all tests (56 passing)

Co-authored-by: Alex Rodríguez <me@arn.sh>

* chore: update CI workflows - add knip, fix test env, set NODE_ENV

* fix: add missing token field to API key fixture in testkit

Co-authored-by: Alex Rodríguez <me@arn.sh>

* refactor: simplify TokenUsage field names to match ModelCostTier

Co-authored-by: Alex Rodríguez <me@arn.sh>

* refactor: rename @domain/llm-models to @domain/models, LlmModel to Model

* refactor: merge findModel functions, extract @repo/utils package

- Merge findModel and findModelWithFallback into a single findModel
  with built-in prefix fallback
- Create @repo/utils package for shared utility functions
- Move formatCount and formatPrice to @repo/utils
- Update @domain/models to depend on @repo/utils
- Document @repo/utils in AGENTS.md

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>

* fix: use .ts extension in schemaHelpers imports for drizzle-kit compatibility

Drizzle-kit uses CommonJS require() internally and cannot resolve .js
to .ts files in ESM packages. Changed all schema file imports from
../schemaHelpers.js to ../schemaHelpers.ts to fix the module resolution
error when running drizzle-kit commands.

* docs: sync README from main branch

* docs: fix README for latitude-v2 branch

- Replace broken logo image with text header
- Change note to warning emphasizing active development status
- Remove broken demo GIF
- Clarify latitude-v1 is for production use

* feat(web): add design system showcase route

Add a /design-system page that previews current @repo/ui components in light and dark modes for visual verification. Document in AGENTS.md that every new implemented UI component must be added to this page as the canonical inventory.

* feat(web): add theme toggle to design system page

Replace the side-by-side light/dark layout with a single preview and a theme toggle that updates the app theme class. This keeps the showcase focused while still validating both modes.

* fix(web): correct design system theme surfaces

Make the design system page use explicit white/black surfaces based on the local theme toggle so dark mode no longer shows light backgrounds.

* Migrate Drizzle to v1 beta, cuid for everyone, RSL for everyone and named Env errors (#2394)

Drizzle v1 migration

- [ ] This migrates code to use Drizzle v1
- [ ] Use `cuid` drizzle helper for all IDs
- [ ] Instruct AGENTS.md with Drizzle model best practices. timestamp
  with timezone, use org RSL helper, common timestamp helper and use
  `cuid` for new tables.
- [ ] Nicer environment errors
- [ ] Setup initial `platform/seeder` package with a basic app data.

WIP

WIP

* Knip dead code (#2397)

* fix: remove dead code and fix knip configuration

Dead code removed:
- Remove unused UnauthorizedError from api errors.ts (duplicate of shared-kernel)
- Remove unused getAuthContext helper from auth middleware
- Remove unused getTouchBuffer function, unexport TouchBuffer class/config
- Remove unused AuthenticatedContext type and HonoVariables interface
- Remove unused barrel file in email templates
- Remove redundant re-export of cost functions from registry.ts
- Remove unused schema helpers (only cuid is used)
- Remove unused async wrapper for weaviate migrations (only Effect
  version is used)

Dependency cleanup:
- Remove unused @platform/env, pg, @types/pg from testkit
- Remove unused tailwindcss devDep from web (transitive via @tailwindcss/postcss)
- Remove unused postcss-import plugin from UI postcss config
- Remove JSDoc type imports of postcss-load-config

Knip config improvements:
- Add per-app workspace entries matching actual entry points
- Add .tsx to email domain project pattern
- Handle vitest-config and tsconfig packages that lack src/
- Remove unnecessary ignore patterns and ignoreDependencies

Co-authored-by: Alex Rodríguez <me@arn.sh>

chore: update lockfile after dependency removal

Co-authored-by: Alex Rodríguez <me@arn.sh>

chore: add pre-commit hook with husky for check, typecheck, and knip

ci: add .env.test setup step to knip, check, and typecheck workflows

wip

* wip

* chore: automate local git hook setup

Add a single `pnpm hooks` entrypoint and call it from `prepare` so each clone configures `.husky` automatically on install. Document the pre-commit workflow in AGENTS and README, including the one-time command for existing clones.

Made-with: Cursor

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>

* chore: add back claude code review ci job

* Fix cuid varchar length (#2400)

* chore: allow agents to pnpm install

* chore(shared-kernel): remove generic validation and clarify package boundaries

* feat: app-level encryption for API key tokens and add token hash for lookups (#2401)

* feat: encrypt API key tokens at rest and add token hash for lookups

API key tokens are now encrypted with AES-256-GCM before storage and
looked up via a SHA-256 hash (token_hash column), eliminating plaintext
token storage in the database. Raw tokens never touch cache or DB queries.

https://claude.ai/code/session_01EfkyXtn7hvkLWhxqrkmrtY

* fix: correct encryption terminology from "at rest" to "app-level"

The API key token encryption is performed at the application level
(AES-256-GCM) before persistence, not database-level encryption at rest.

https://claude.ai/code/session_01EfkyXtn7hvkLWhxqrkmrtY

* refactor: generalize crypto utils from API-key-specific to general-purpose

Rename hashApiKeyToken → hashToken, encryptApiKeyToken → encrypt,
decryptApiKeyToken → decrypt. These functions work on any plaintext
and are not specific to API key tokens.

https://claude.ai/code/session_01EfkyXtn7hvkLWhxqrkmrtY

* updated lock, lint/tc/format

* refactor(api): simplify repository wiring in routes and middleware

* add clickhouse to request context

---------

Co-authored-by: Claude <noreply@anthropic.com>

* Fix migration without snapshot.json (#2402)

Last migration was created without snapshot.json. Not sure how that
happened. Also added init-db.sh script for postgresql container and a
`pnpm pg:reset_db` to make it easy start clean.

* Move out user domain and rename shared domain (#2403)

* Move out user domain and rename shared domain

* gitignore

* Feat/web stores tanstack db (#2399)

* feat: implement apps/web state management and proper auth

* refactor(server): simplify postgres client imports and types

* remove tc & check from precommit

* rebase

* feat(auth): add intent-based magic link flow and transaction runner

* refactor(email): consolidate providers under email-transport

Unify Mailgun SMTP, generic SMTP, and Mailpit resolution in a single transport package to keep provider selection generic and remove provider-specific platform packages.

* fix(email): restore react-email magic link templates

Bring back React-based magic link rendering to recover branded email HTML and keep signup-existing-account messages consistent with the same template system.

* fix(email): align magic link templates with design tokens

Introduce email-safe design primitives and shared tokens so magic-link emails match Latitude typography, spacing, and button depth without coupling to web-only UI components.

* code review

* code review

* code review

* rebase

* fix tests

* feat: enforce RLS at boundaries with simplified tenancy model (#2405)

* feat: clean up exports and remove unused code

* feat: implement rls guards at boundaries

* remove most user facing write ops

* code review

* chore: simplify migrations

* refactor(db-postgres): Add unscoped repository for cross-org operations

* refactor: Simplify project functions by removing organizationId from params

* refactor(auth): Improve session handling and error types

* refactor: remove unnecessary type assertions across monorepo

* Add authenticated dashboard with projects and settings pages (#2409)

* feat: migrate legacy dashboard and settings pages to new web app

- Add dashboard (projects list) at root path `/` and settings page
- Create UI components: Table, Modal, Toast, DropdownMenu, TabSelector,
  Container, FormWrapper, Skeleton, Tooltip, TableWithHeader, TableBlankSlate
- Add domain use cases: updateApiKey with existence check, removeMember
  with self-removal and org-mismatch validation using tagged errors
- Move member listing query to platform layer (findMembersWithUser)
- Inline Zod schemas in server functions, remove separate .types.ts files
- Remove assertOrganizationMembership from session-protected functions
- Add authenticated layout route with auth guard and org context
- Add utility helpers: relativeTime, extractLeadingEmoji

https://claude.ai/code/session_01B3HrhsbKE5H26GzNFEwhCg

* chore: apply biome formatting fixes

https://claude.ai/code/session_01B3HrhsbKE5H26GzNFEwhCg

* fix: address code review feedback on dashboard/settings migration

- Remove empty object validators from GET server functions
- Move findMembersWithUser into MembershipRepository as a method (not standalone)
- Add MemberWithUser interface to domain port
- Use runCommand(db, organizationId) for RLS scoping in member operations
- Remove org-mismatch check from removeMemberUseCase (RLS handles scoping)
- Remove auth intent completion from _authenticated layout (belongs in
  dedicated signup completion section)
- Remove organizationId parameter from all collection hooks (server
  functions already scope by session)
- Remove organizationId prop threading from all page components
- Remove completeAuthIntent server function (will be re-added with
  dedicated signup completion route)
- Clean up unused imports and exports

https://claude.ai/code/session_01B3HrhsbKE5H26GzNFEwhCg

* chore: apply biome formatting

https://claude.ai/code/session_01B3HrhsbKE5H26GzNFEwhCg

* Restore auth intent flow with /auth/confirm page and add collection query support

- Restore completeAuthIntent server function and schema for magic link auth flow
- Create /auth/confirm welcome page that completes auth intent and redirects to dashboard
- Update callbackURL in login/signup to point to /auth/confirm
- Export projectsCollection for custom query support via useLiveQuery
- Update project view to use collection query (eq + findOne) instead of in-component .find()

https://claude.ai/code/session_01B3HrhsbKE5H26GzNFEwhCg

* Refactor useProjectsCollection to accept a scoped query callback

The hook now accepts an optional callback that receives the pre-scoped
QueryBuilder (after .from()), so consumers can chain .where(), .select(),
.orderBy() etc. without needing to know about useLiveQuery or the
collection binding directly.

https://claude.ai/code/session_01B3HrhsbKE5H26GzNFEwhCg

* Use useProjectsCollection in project detail page

Add deps parameter to useProjectsCollection and migrate $projectId route
to use the scoped collection hook instead of raw useLiveQuery. Remove
the now-unused projectsCollection export.

https://claude.ai/code/session_01B3HrhsbKE5H26GzNFEwhCg

* Fix pre-existing lint, typecheck, and navigation issues

- Fix import ordering in auth.functions.ts, modal.tsx, table.tsx
- Fix array index keys in dropdown-menu and table-skeleton
- Use optional chaining in tab-selector checkSelected
- Remove unnecessary deps from useEffect in tab-selector and useToast
- Add missing findMembersWithUser to MembershipRepository mock in auth tests
- Replace window.location.href onClick with Link component in projects table

https://claude.ai/code/session_01B3HrhsbKE5H26GzNFEwhCg

* Update auto-generated route tree

https://claude.ai/code/session_01B3HrhsbKE5H26GzNFEwhCg

* Replace .js import extensions with .ts and remove type assertions

- Change all .js import extensions to .ts/.tsx in UI package (19 files)
- Remove unnecessary type assertions on session.user in auth.functions.ts

https://claude.ai/code/session_01B3HrhsbKE5H26GzNFEwhCg

* Update CLAUDE.md with import extensions, auth, and generated files guidance

- Add .ts/.tsx extension rule to Imports section
- Add guidance against unnecessary type assertions in TypeScript section
- Add Generated Files section for auto-generated files like routeTree.gen.ts
- Add Authentication section documenting Better Auth patterns and session handling

https://claude.ai/code/session_01B3HrhsbKE5H26GzNFEwhCg

* Remove "use client" directives from UI components

These are Next.js-specific and not needed with TanStack Start.
Also added anti-pattern rule to CLAUDE.md.

https://claude.ai/code/session_01B3HrhsbKE5H26GzNFEwhCg

* Clean up CLAUDE.md: remove duplicates and add state management guide

- Remove redundant anti-patterns already stated in Architecture Rules
- Consolidate Stack Conventions and Required Toolchain into single section
- Remove duplicate "Test runner: Vitest" and "Core code uses Effect" entries
- Generalize type assertion rule (remove Better Auth-specific example)
- Add comprehensive State Management section covering server functions,
  collections, route guards, and key rules

https://claude.ai/code/session_01B3HrhsbKE5H26GzNFEwhCg

* Remove barrel files from UI package, export directly from source modules

- Delete all component and token barrel index.ts files (22 files)
- Update packages/ui/src/index.ts to export directly from source modules
- Update internal token imports to reference specific token files
  (e.g., tokens/zIndex.ts instead of tokens/index.ts)
- Eliminates unnecessary re-export indirection for better tree-shaking

https://claude.ai/code/session_01B3HrhsbKE5H26GzNFEwhCg

* Fix non-null assertion lint error in extractLeadingEmoji

Replace `match[1]!` with `match[1] ?? null` to satisfy
biome's noNonNullAssertion rule.

https://claude.ai/code/session_01B3HrhsbKE5H26GzNFEwhCg

---------

Co-authored-by: Claude <noreply@anthropic.com>

* feat: add server function error handling middleware for domain error serialization

Domain TaggedError instances can't be serialized by seroval across the
TanStack Start server/client boundary. This middleware converts them to
plain Error objects preserving _tag as name and httpMessage as message,
enabling robust client-side error identification by error name instead
of brittle message matching.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Development environment setup (#2412)

* Add Cursor Cloud specific instructions to AGENTS.md

- Document Docker infrastructure startup and database migration steps
- Note golang-migrate requirement for ClickHouse migrations
- Document per-app dev server startup (workaround for @domain/email turbo issue)
- Add service health check reference table
- Document known issues: node:crypto build error, react-email CLI missing
- Add auth testing notes for magic-link flow via Mailpit

* Update AGENTS.md cloud instructions per feedback

- Remove golang-migrate reference (being replaced with goose)
- Remove 'Known issues' section (both issues being fixed)
- Reframe per-app dev server startup as matching tmuxinator workflow

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>

* Migrate crypto operations to Web Crypto API (#2411)

* Replace node:crypto with Web Crypto API for ESM/browser compatibility

Rewrites packages/utils/src/crypto.ts to use the Web Crypto API
(globalThis.crypto.subtle) instead of node:crypto. This makes the
crypto utilities work in both server and browser environments.

All three exported functions (hashToken, encrypt, decrypt) are now
async since crypto.subtle methods return Promises. All callers have
been updated accordingly—Effect.gen contexts use Effect.promise(),
and async test helpers use await.

https://claude.ai/code/session_018Xf8qEro4Jjoek6zUnpDTt

* Use Effect.tryPromise for crypto operations and convert helpers to Effects

- Replace Effect.promise with Effect.tryPromise for hashToken calls in
  auth middleware and generate-api-key use-case since crypto operations
  can theoretically fail
- Convert toDomainApiKey and toInsertRow from async functions returning
  Promises to functions returning Effect.Effect with typed RepositoryError,
  enabling direct yield* composition in Effect.gen contexts
- Use Effect.all instead of Effect.promise(Promise.all(...)) for
  mapping collections through Effects

https://claude.ai/code/session_018Xf8qEro4Jjoek6zUnpDTt

* Apply linter formatting to generate-api-key import

https://claude.ai/code/session_018Xf8qEro4Jjoek6zUnpDTt

* Convert crypto functions to return Effects with typed CryptoError

All crypto functions (hashToken, encrypt, decrypt) now return
Effect.Effect<T, CryptoError> instead of Promise<T>. This enables
proper Effect composition throughout the codebase:

- Domain/platform packages compose crypto Effects directly via yield*
- Repository adapters map CryptoError → RepositoryError at boundaries
  using Effect.mapError
- App-level code (auth middleware) lets CryptoError propagate through
  the Effect error channel where orDie handles it
- Test files use Effect.runPromise at the boundary

Adds effect as a dependency of @repo/utils and exports CryptoError
as a tagged error type for typed error handling.

https://claude.ai/code/session_018Xf8qEro4Jjoek6zUnpDTt

* Add Web Standards First guidance to AGENTS.md

Documents the preference for Web Standard APIs over Node.js-specific
modules in shared packages to maximise compatibility with browser,
edge, and alternative server runtimes.

https://claude.ai/code/session_018Xf8qEro4Jjoek6zUnpDTt

---------

Co-authored-by: Claude <noreply@anthropic.com>

* Implement clickhouse migration system with Goose (#2410)

* fix: middleware declaration

* chore: clean up error handling in general

* feat(auth): adding invite functionality to workspace settings

* minor: ask for name during invitation

* chore: s/workspaceId/organizationId

* deduplicate env vars

* update agents.md

* refactor: make id generation an implementation detail in entity factories

Entity factories now accept optional id parameter and fallback to generateId():
- createMembership, createOrganization, createProject, createApiKey
- createAuthIntent for auth intent use cases
- All corresponding use case inputs (CreateOrganizationInput, CreateProjectInput, GenerateApiKeyInput)

Updated all call sites to remove explicit generateId() calls:
- complete-auth-intent.ts (3 locations)
- projects.functions.ts
- api-keys.functions.ts
- api routes (projects.ts, api-keys.ts)
- auth intent use cases (invite, login, signup)

Fixed test expectations in complete-auth-intent.test.ts for invite flow

* chore: improve auth intent completion code

* infra: create latitude_app runtime Postgres user with RLS enforcement

Add a restricted `latitude_app` Postgres user that is subject to Row
Level Security policies, separating it from the superuser `latitude`
account used for migrations and seeds.

- `docker/init-db.sh`: create `latitude_app` and grant CONNECT on
  container init
- Migration `20260309145353_setup-runtime-db-user`: grant USAGE on the
  latitude schema, SELECT/INSERT/UPDATE/DELETE on all tables, and
  EXECUTE on `get_current_organization_id()` to `latitude_app`; set
  default privileges for future tables
- `drizzle.config.ts`: switch to `LAT_ADMIN_DATABASE_URL` so
  drizzle-kit runs as superuser (required for DDL)
- `.env.example`: add `POSTGRES_RUNTIME_USER/PASSWORD`,
  `LAT_ADMIN_DATABASE_URL`; point `LAT_DATABASE_URL` at `latitude_app`

Previously the single `latitude` superuser was used for everything,
which meant RLS was bypassed at runtime. Now normal app queries run as
`latitude_app` — a bug that forgets to set org context returns no rows
instead of leaking cross-tenant data.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor: enforce tenancy via RLS, remove organizationId from repository signatures

Now that the runtime connection (`latitude_app`) is subject to RLS,
repositories no longer need an `organizationId` parameter. Org scoping
is handled automatically by the `app.current_organization_id` session
variable set by `runCommand`.

- Remove `organizationId` from every repository factory function across
  all domain ports and platform adapters (api-keys, auth, grants,
  membership, organizations, projects, subscriptions, users)
- Delete `UnscopedApiKeyRepository` / `createUnscopedApiKeyPostgresRepository`;
  cross-org lookups now use the admin connection with the regular factory
- Add `getAdminPostgresClient()` singleton in `apps/api/src/clients.ts`
  for runtime operations that legitimately bypass RLS (API key auth
  lookup and touch-buffer batch updates)
- Update all app-level call sites (API routes, web server functions,
  seeds) to match the new factory signatures

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* docs: rewrite tenancy.md for two-user Postgres model

Update documentation to reflect the new runtime security model:
two Postgres users, RLS-enforced runtime connection, admin connection
for cross-org operations, and no-organizationId repository signatures.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* chore: remove stale drizzle snapshot from deleted migration

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* quickfix: scope outbox events read queries to latitude schema

* Add OTEL-aligned spans domain model with ClickHouse and object storage (#2395)

Splits span telemetry storage into two tiers: ClickHouse handles fast listing, filtering, and metric aggregation; object storage (local FS or S3) holds full span payloads.

The ClickHouse schema is redesigned from scratch to match OpenTelemetry and GenAI semantic conventions. New columns cover operation type, provider, model, and session, plus five token categories (input, output, cache read, cache creation, reasoning) and split cost tracking with an estimation flag.

A new `@domain/spans` package defines two domain entities — `Span` for the index card, `SpanPayload` for the full blob — along with repository ports for ClickHouse, object storage, and an ephemeral ingest buffer. Three platform adapters implement these ports: a ClickHouse repository with row mapping and parameterized queries, a Flydrive-based object storage layer with gzip compression for permanent payloads, and a Redis read-through cache decorator for span payload reads.

* Move reference ids from text to cuids (#2415)

* Move reference ids from text to cuids

* add migs

* fix: tests

* fix: missing admin client in app/web

* feat: implement new nav layout, project sidebar, and traces UI

- Add global nav header with org name, project breadcrumb, user avatar
  dropdown (logout), Docs/Settings links, and dev-only theme toggle
- Add project sidebar with collapse support, nav items for Traces,
  Issues, Datasets, and Annotation queues (placeholder)
- Make Traces the default project page (/$projectId/), rename /spans
  routes to /traces
- Add placeholder Issues and Datasets pages under each project
- Update projects dashboard: new columns (Issues, Datasets, Traces 7D),
  org-name header, full-row click with accessible name link
- Add Text.H5M design system token (H5 size + medium weight preset)
- Fix modal animation: remove slide-from-top-left, use fade+zoom only
- Fix Container to accept className prop
- Add organizations server functions (getOrganization, countUserOrganizations)
- Fix __root.tsx import order; wire Agentation dev toolbar

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: consistent pt-14 spacing in settings page

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: avoid document access during SSR in ThemeToggle

Initialize isDark state as false and sync from document in useEffect
to prevent "document is not defined" error during server-side rendering.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: simplify ThemeToggle SSR fix

Drive isDark from local state rather than observing DOM mutations.
Single useEffect syncs initial value from document on mount.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* gitignore some extra stuff

* quickfix: minor issues in frontend

* Authenticated route SSR (#2420)

* fix: use ssr data-only in authenticated route to avoid SSR hydration issues

Co-authored-by: Gerard <gerard@latitude.so>

* refactor: remove redundant ClientOnly wrappers under data-only SSR route

Co-authored-by: Gerard <gerard@latitude.so>

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>

* Add trace-level aggregation (#2418)

Introduces a materialized `traces` table in ClickHouse that continuously aggregates span-level telemetry into trace-level summaries (total cost, token usage, timing, status, models, providers, root span identity, and first/last LLM messages). This replaces the flat span listing in the web UI with a traces-first navigation hierarchy: project → traces → spans → span detail, giving users a natural top-down view of their LLM operations with per-trace cost and duration visible at a glance.

* remove localstoragefile nodeoption

* feat(api): migrate routes to OpenAPIHono with Swagger UI

Replace plain Hono with OpenAPIHono across all API route files. Add
zod-openapi schemas for api-keys, health, and projects routes. Expose
OpenAPI spec at /openapi.json and Swagger UI at /docs. Remove the
old Better Auth proxy route (auth.ts) — auth now lives in the web app.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: add CLI device authorization flow

Implements the browser-redirect-and-poll auth pattern for CLI/agent use
cases (similar to GitHub CLI / Vercel CLI):

- POST /v1/auth/cli/initiate — creates a Redis-backed pending session
  and returns a loginUrl the CLI opens in the browser
- GET /v1/auth/cli/poll/:token — polls until the user authorizes;
  returns a durable API key + organizationId on success
- /auth/cli (web) — confirmation page where the logged-in user
  explicitly grants CLI access; calls exchangeCliSession which
  generates an API key and marks the Redis session as authenticated
- Threads the cliSession param through the full web auth flow
  (login → signup → magic link → confirm → /auth/cli) so new users
  can sign up and immediately authorize in one pass
- Fix: Sign up link on login page now preserves the cliSession param

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: Datasets domain — data model, RFC, and Zod 4 migration

This PR introduces the Datasets bounded context — a new domain for managing versioned evaluation datasets used in LLM testing, prompt engineering, and annotation workflows. It also includes an opportunistic migration from Zod 3 → Zod 4 across apps/web.

* Implement upload dataset (#2422)

* Add OTEL-compatible span ingestion endpoint (#2423)

## Summary

Adds a `POST /v1/traces` endpoint to the ingestion server that accepts OTLP/JSON and OTLP/Protobuf payloads, authenticates via `Authorization: Bearer <api-key>`, resolves the target project from an `X-Latitude-Project` header, and inserts the resulting spans into ClickHouse.

The endpoint extracts GenAI semantic convention attributes (`gen_ai.system`, `gen_ai.request.model`, token usage, etc.) into dedicated span fields during ingestion. Traces are derived automatically via the existing ClickHouse materialized view — no separate trace write path is needed.

API key validation reuses the same SHA-256 hash + Redis cache + Postgres lookup pattern used by the public API, with timing-safe enforcement to prevent enumeration attacks.

* feat: transactions and rls rules (#2426)

* feat: transactions and rls rules

Implements Row-Level Security (RLS) policies and SQL transaction management.

Row-Level Security (RLS):
- Database function get_current_organization_id() reads app.current_organization_id from session
- RLS policies on tables with organization_id filter rows automatically
- Schema uses organizationRLSPolicy() helper to enable RLS per table

SQL Transactions:
- Domain Layer (@domain/shared): SqlClient interface for database operations
- Platform Layer (@platform/db-postgres): SqlClientLive with automatic RLS context
- App Layer (apps/*): Boundaries provide SqlClientLive with organization context

Usage patterns:
- Repositories use sqlClient.query() for single operations
- Use cases use sqlClient.transaction() for multi-step operations
- Routes provide SqlClientLive(client, organizationId) for RLS enforcement

Key behaviors:
- Every transaction sets app.current_organization_id session variable
- Nested transactions share connection (pass-through proxy)
- Domain errors propagate; failures trigger automatic rollback

* fix: address PR review comments

- Return Effect from validateApiKey instead of Promise (restores original pattern)
- Replace raw SQL queries with Drizzle ORM in api-keys.test.ts
- Replace raw SQL queries with Drizzle ORM in create-test-app.ts
- Remove unused withRls method from in-memory-postgres.ts

Co-authored-by: Gerard <gerard@latitude.so>

* fix: address review findings across the stack

Critical fixes:
- Fix subscription-repository exists() always returning true (checked array not element)
- Fix exchangeCliSession creating API key with wrong org (pass activeOrganizationId)
- Fix complete-auth-intent test swallowing errors (assert expected behavior)
- Fix softDelete returning 204 for cross-tenant delete (now returns NotFoundError)
- Fix findPendingInvitesByOrganizationId to filter by org via JSONB query
- Preserve domain errors through transaction rollback (DomainErrorWrapper)

Cleanup:
- Replace raw SQL with Drizzle ORM in projects.test.ts
- Remove redundant mapError in auth-intent-repository
- Fix unsafe type cast in membership-repository (use Effect.map)
- Standardize datasets.functions.ts to Effect.provide style
- Fix seeds/run.ts passing undefined to SqlClientLive
- Add role existence check to createRlsMiddleware in tests

Co-authored-by: Gerard <gerard@latitude.so>

* fix: enable atomic transactions in multi-step use cases

Rewrites SqlClientLive to use closure-scoped transaction tracking with
a promise-bridge pattern. The inner effect now runs in the parent fiber
(via Effect.exit) instead of Effect.runPromiseExit, so all provided
services (repositories, cache invalidators, etc.) remain available
inside the transaction scope.

Key changes:
- SqlClientLive tracks activeTx via closure variable shared by query()
  and transaction() methods on the same instance
- transaction() uses promise bridge: Drizzle callback signals tx ready,
  awaits effect completion, then commits or rolls back
- query() checks activeTx at call time, reusing the transaction
  connection when one is active
- Repositories captured at boundary still participate in transactions
  because query() reads activeTx dynamically, not at capture time

Multi-step use cases now properly wrapped in sqlClient.transaction():
- createProjectUseCase: existsByName + existsBySlug + save
- updateProjectUseCase: findById + existsByName + save
- revokeApiKeyUseCase: findById + save + cache invalidation
- changePlan: findActive + revokeBySubscription + saveMany + save

Co-authored-by: Gerard <gerard@latitude.so>

* feat: add ChSqlClient and withPostgres/withClickHouse helpers

Introduces ChSqlClient as the ClickHouse counterpart to SqlClient:
- ChSqlClient domain interface in @domain/shared (pass-through
  transaction, direct query execution)
- ChSqlClientLive implementation in @platform/db-clickhouse
- DatasetRowRepositoryLive migrated from closure-captured client to
  ChSqlClient service pattern (Layer.effect + yield* ChSqlClient)

Adds withPostgres and withClickHouse helpers that bundle repository
layers with their database client in a single call:
- Uses Layer.provideMerge so the SqlClient/ChSqlClient service is
  available both to the repo layers AND to the outer effect (needed
  for use-case-level sqlClient.transaction() calls)
- Repos sharing the same helper call share the same client instance

Boundary callers updated from:
  Effect.provide(RepoLive),
  Effect.provide(SqlClientLive(client, orgId)),
To:
  Effect.provide(withPostgres(client, orgId, RepoLive)),

Co-authored-by: Gerard <gerard@latitude.so>

* add tests for the postgres sql client

* update lockfile

* fix signup/login

* added comment, do not set rls if admin context

* test(db-postgres): fix sql-client test for system org RLS skip

The implementation skips set_config for the system/admin org context
(added in 9256be69c), but the test still expected 1 executed statement.
Update the assertion to expect 0 statements.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat(testkit): add PGlite in-memory Postgres adapter and concurrent tx detection

- Add `createInMemoryPostgres` / `closeInMemoryPostgres` to testkit using
  PGlite so tests can run without a real Postgres instance
- Export from testkit index; add `@electric-sql/pglite` and `drizzle-orm`
  as deps to testkit
- Add `vitest.config.ts` to `apps/web` wiring up the vitest preset
- Harden `SqlClientLive`: track `txOpening` flag so concurrent `transaction()`
  calls on the same instance fail fast via `Effect.die` instead of silently
  corrupting connections; add tests for both the error and the sequential-OK path

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor: make withPostgres/withClickHouse return pipe-compatible providers

Previously withPostgres and withClickHouse returned a Layer, requiring
callers to wrap every call site with Effect.provide(). They also accepted
(client, orgId, ...layers) which was hard to type correctly with variadic
layer arguments.

New signature: (layer, client, orgId?) => pipe-compatible function via
Effect.provide internally. Call sites can now use effect.pipe(withPostgres(...))
directly without the outer Effect.provide wrapper.

- Rename withClickHouse to the same shape as withPostgres
- Migrate all call sites in apps/web and apps/api to the new signature
- Consolidate auth functions to withPostgres instead of multiple
  Effect.provide(RepositoryLive) + Effect.provide(SqlClientLive) chains
- Switch exchangeCliSession from adminClient to scoped postgresClient
  since API key creation is org-scoped

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: migrate ch remaining repos to new pattern

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* Edit and delete dataset rows (#2424)

* Defer span processing to background worker via object storage (#2425)

The OTEL ingest endpoint previously parsed, transformed, and wrote spans to ClickHouse synchronously within the HTTP request. As the span processing logic grows more complex, this coupling increases response latency and risks timeouts from OTEL exporters.

This change splits ingestion into two phases: the endpoint now validates the payload, buffers it to object storage, and enqueues a BullMQ job — returning immediately. A new background worker picks up the job, runs the transform, persists to ClickHouse, and cleans up the buffered payload. This keeps the ingest endpoint lightweight and allows processing to retry independently on failure.

The OTLP parsing and transform logic was relocated from `apps/ingest` to `@domain/spans` so both the ingest app and workers app can share it without cross-app imports.

* Feat/redpanda (#2428)

* Defer span processing to background worker via object storage

The OTEL ingest endpoint previously parsed, transformed, and wrote spans to ClickHouse synchronously within the HTTP request. As the span processing logic grows more complex, this coupling increases response latency and risks timeouts from OTEL exporters.

This change splits ingestion into two phases: the endpoint now validates the payload, buffers it to object storage, and enqueues a BullMQ job — returning immediately. A new background worker picks up the job, runs the transform, persists to ClickHouse, and cleans up the buffered payload. This keeps the ingest endpoint lightweight and allows processing to retry independently on failure.

The OTLP parsing and transform logic was relocated from `apps/ingest` to `@domain/spans` so both the ingest app and workers app can share it without cross-app imports.

* feat: s/redis/redpanda

Redis won't scale at real world scale for the ingestion pipeline, so
we've replacing it with redpanda. We keep redis for caching.

* move topics to code

* fix: resolve Biome import/export ordering in queue-redpanda

Co-authored-by: Gerard <gerard@latitude.so>

* feat(queue-redpanda): add span ingestion via Redpanda/Kafka

Replace object storage + BullMQ with Redpanda (Kafka) for span ingestion:

- apps/ingest: Publish OTLP traces to span-ingestion Kafka topic
- apps/workers: Consume from span-ingestion topic and write to ClickHouse
- domain/spans: Move OTLP protobuf decoding from ingest to domain package
- queue-redpanda:
  - Add span-ingestion topic
  - Remove unused redpandaQueueAdapter constant and RedpandaEventsPublisherLive Layer
  - Export RedpandaEventsPublisherConfig type
  - Fix config.test.ts NODE_ENV cleanup to use beforeEach/afterEach
  - Remove unnecessary devDependencies (vitest, @repo/vitest-config)

Review comment fixes:
- Remove unused @domain/spans from apps/ingest
- Remove unused @clickhouse/client from apps/workers

* minor code review fixes

* fix: address PR review items 1-4

1. Add span-ingestion topic to Docker Compose redpanda-init
2. Remove dead SPAN_INGESTION_QUEUE from @domain/shared
3. Remove unused RedpandaQueueAdapterTag from queue-redpanda
4. Add error logging to ingest route catch handlers

* update lockfile

* refactor: let Redpanda handle retries via offset management

Remove manual retry loops and try/catch around processing errors.
Transient failures now propagate to kafkajs, which will retry the
message by not advancing the consumer offset. Permanent errors
(poison pills like decode failures) are still caught and skipped.

* cleanup: remove unused createKafkaClientEffect

---------

Co-authored-by: Carlos Sansón <csansoon@gmail.com>
Co-authored-by: Cursor Agent <cursoragent@cursor.com>

* Resolve README.md merge conflict for v2

* Resolve README.md merge conflict - simplify v2 note

* Copy README.md verbatim from main to resolve merge conflict

---------

Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Andrés <andresgutgon@gmail.com>
Co-authored-by: Carlos Sansón <csansoon@gmail.com>
Co-authored-by: Alex Rodríguez <me@arn.sh>
Co-authored-by: Alex Rodríguez <alex@latitude.so>
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Carlos Sansón <57395395+csansoon@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants