Engineering handbook #2 (#2288)

maxdeichmann · ellipsis-dev[bot] · web-flow · commit ce5758ef3d2b · 2025-11-12T10:14:06.000Z
* chore: new engineering docs

* chore: new engineering docs

* chore: new engineering docs

* add security advisory

* add security advisory

* fixes

* fixes

* Update pages/handbook/product-engineering/how-we-work/code-review.mdx

Co-authored-by: ellipsis-dev[bot] &lt;65095814+ellipsis-dev[bot]@users.noreply.github.com&gt;

* Update pages/handbook/product-engineering/how-we-work/code-review.mdx

Co-authored-by: ellipsis-dev[bot] &lt;65095814+ellipsis-dev[bot]@users.noreply.github.com&gt;

* Update pages/handbook/product-engineering/how-we-work/code-review.mdx

Co-authored-by: ellipsis-dev[bot] &lt;65095814+ellipsis-dev[bot]@users.noreply.github.com&gt;

* fixes

* fixes

* fixes

* fixes

* fixes

* fixes

* Update pages/handbook/product-engineering/how-we-work/code-review.mdx

Co-authored-by: ellipsis-dev[bot] &lt;65095814+ellipsis-dev[bot]@users.noreply.github.com&gt;

* fixes

* fixes

* fixes

* fixes

* add tech stack

* add tech stack

* improve engineering docs

* improve engineering docs

* improve engineering docs

* improve engineering docs

---------

Co-authored-by: ellipsis-dev[bot] &lt;65095814+ellipsis-dev[bot]@users.noreply.github.com&gt;
diff --git a/pages/handbook/product-engineering/architecture.mdx b/pages/handbook/product-engineering/architecture.mdx
@@ -39,6 +39,13 @@ flowchart LR
 - **Redis**: Stores event queue (BullMQ) and caching layer (API keys, prompts).
 - **S3**: Stores raw ingestion events and multi-modal attachments (images, audio).
 
+
+### Why do we need an OLAP database (Clickhouse) for observability data?
+- We built Langfuse initially on Postgres and eventually migrated to Clickhouse. We always knew that Postgres wont be the best fit for our observability data.
+- OLAP databases have a columnar layout. With that the database only scans data required to produce results for analytical queries (e.g. LLM cost over time).
+- We needed a multi-node database to scale our data insert.
+- As we are an open source product, we required a database which runs on an open source license.
+
 ---
 
 
diff --git a/pages/handbook/product-engineering/how-we-work/_meta.tsx b/pages/handbook/product-engineering/how-we-work/_meta.tsx
@@ -1,6 +1,7 @@
 export default {
   "roadmapping": "Roadmapping",
   "workflow": "Workflow",
+  "onboarding": "Onboarding",
   "*": {
     layout: "default",
   },
diff --git a/pages/handbook/product-engineering/how-we-work/code-review.mdx b/pages/handbook/product-engineering/how-we-work/code-review.mdx
@@ -4,7 +4,7 @@
 - Low risk (2 way door): go ahead and merge. Engineer is responsible for monitoring Datadog in case something goes wrong. Notify on-call engineer for roll-backs if required.
   - Product expert review: sometimes we build a larger change in someone else’s area of responsibility. Feel free to ask for feedback from [DRI](https://docs.google.com/spreadsheets/d/1gOvWf_uSAtcXxWkR_gMuWX8-OYmSJNIPXTmPGfZ2DGY/edit?gid=0#gid=0).
 - Risk (1 way door): Assign Max as reviewer (Max sees open PR reviews in Linear, SLA 24h, ping if more urgent). **Non exhaustive list of PRs which require a review**: database migrations, changes in the public API, changes in SDK signatures, auth changes, major changes in the ingestion pipeline, larger infra changes. If you are unsure, ask Max for a review.
-- New joiners should get all their PRs reviewed for the first 1 month at least. This helps a lot with knowledge sharing and making sure we use abstractions within our code base the right way. Afterwards, we move towards no-review merges.
+- New joiners should get all their PRs reviewed for the first 1 month at least. By this, new joiners will learn about the code base and how our system works.
 
 
 ## Responsibility of the PR author
diff --git a/pages/handbook/product-engineering/how-we-work/onboarding.mdx b/pages/handbook/product-engineering/how-we-work/onboarding.mdx
@@ -0,0 +1,83 @@
+# Onboarding at Langfuse
+
+Welcome to Langfuse!
+
+This document outlines how onboarding works for every new team member — what to expect on your first day, first week, first month, and beyond. Our goal is to help you quickly gain context, feel confident contributing, and become an owner of your area of the product.
+
+At Langfuse, we lead with context, not control. We share the "why" and "what's important right now," so you can make the best decisions for our users. You'll get support, feedback, and guidance along the way — but you'll also have the space to take ownership early.
+
+
+## Timeline Overview
+
+| Milestone | Focus | Key Outcome |
+|-----------|-------|-------------|
+| **Day 1** | Getting oriented | Understand company priorities and get set up |
+| **Week 1** | Making your first contribution | Deploy your first change to production |
+| **Months 1-3** | Taking ownership | Complete your first independent project end-to-end |
+| **Month 6** | Becoming a go-to person | Own a product area and be a trusted domain expert |
+
+---
+
+## Day 1 — Getting Oriented
+
+**Goal:** Understand company priorities and get set up for work.
+
+- **Welcome meeting with Max (CTO)** — Context on company status, top priorities, challenges, and decision-making
+- **Setup and access** — GitHub, Datadog, AWS, Slack, and local development environment
+- **First task** — Small starter task to explore codebase and workflows. Goal: merge something within the first few days
+
+
+
+## Week 1 — Making Your First Contribution
+
+**Goal:** Deploy your first change to production and understand our systems.
+
+By the end of Week 1:
+- Fully working development setup
+- Merged and deployed your first change
+- Understand high-level architecture, CI/CD process, and monitoring
+- Read and acknowledge policies (security, compliance, data handling)
+- Engage in team communication
+
+_We don't expect speed — we expect curiosity. Take the time to explore and ask "why."_
+
+
+
+## Months 1-3 — Taking Ownership
+
+**Goal:** Deliver your first independent project end-to-end.
+
+By the end of Month 3, you should have:
+- **Shipped an independent project** — Led a small feature or improvement from start to finish (technical planning, product thinking, implementation, documentation, user research)
+- **Joined support rotation** — Started handling support tickets, seeing product through users' eyes, and maintaining your improvement backlog
+- **Learned through reviews** — Received feedback on every PR from the team during your first ~2 months
+- **Balanced priorities** — Demonstrated ability to align your backlog (support, fixes, improvements) with company priorities
+- **Contributed to planning** — Participated in planning discussions
+
+
+## Month 6 — Becoming a Go-To Person
+
+**Goal:** Own a product area and be a trusted domain expert.
+
+By six months:
+- **Domain expertise** — Be the go-to person for a specific area (e.g., evals, datasets, integrations, infrastructure)
+- **Shipped projects** — Multiple projects delivered from idea to production
+- **Mentorship** — Able to mentor new joiners and review PRs in your area
+- **Self-sufficient** — Fully independent in planning, building, and deploying
+- **Impact-driven** — Clear sense of how to prioritize for maximum impact
+
+_You're not just contributing — you're shaping how we build and make product decisions._
+
+
+
+## Ongoing Support
+
+Throughout your onboarding and beyond, you'll have support from:
+
+- **Max and the team**: Available for prioritization help and technical guidance
+- **Pull request reviews**: Continuous feedback on your code
+- **Support rotation**: Learn from real user problems and needs
+- **Context sharing**: Regular updates on company priorities and strategic direction
+
+
+Welcome to the team!
diff --git a/pages/handbook/product-engineering/how-we-work/roadmapping.mdx b/pages/handbook/product-engineering/how-we-work/roadmapping.mdx
@@ -14,7 +14,7 @@ title: Roadmapping
 ## Process
 
 <Callout type="info">
-  All process stages are run in the “Roadmap” Figjam (internal)
+  All process stages are run in the “Roadmap” Figjam (internal). We run this process once a quarter.
 </Callout>
 
 1. Exploration
diff --git a/pages/handbook/product-engineering/how-we-work/workflow.mdx b/pages/handbook/product-engineering/how-we-work/workflow.mdx
@@ -1,61 +1,55 @@
 # Engineering Workflow
 
-## Specification
+## Prioritization
+
+Our goals is to build a company where we do not spend hours each week to triage and prioritize tickets. Therefore, everyone has to keep track of their own priorities and to escalate things with Max if work is getting too much. All of this only works, if we all have a shared understanding of priorities and SLAs.
+
+In case of uncertainty: Tag Max (24h SLA on Linear inbox). Everyone should have a clear view on what priorities are. The best way to achieve this is by having one Linear project or issue for the current main task. For bugs and smaller improvements, use the [bug view](https://linear.app/langfuse/view/bugs-confirmedopen-97856f9f745c) or “My issues”. Tickets in Linear should always have a prioritization:
+
+### Issue States
+
+| State | Description |
+|-------|-------------|
+| **Triage** | - Unassigned, no labels or prios<br/>- Created via GitHub Issue integration or in Linear by non-engineering team members<br/>- Marc/Max subscribe, triage and assign, and add labels, and refine title<br/>- After assignment, Engineer dedupes with existing tickets and handles communications with users |
+| **Backlog** | - Everyone manages their backlogs in Linear<br/>- Issues always have a label. Use Linear views to look at all tickets of a label. We only create projects for work that has a clear end (no endless bucket of tickets)<br/>- Add user feedback to tickets or projects: If via Plain, use the "Link thread" feature. Link Linear issues to Plain threads straight from Plain. Snooze issues in Plain for which we want to review again e.g. next week<br/>- Only backlog what you plan/hope to do over the next 1-2 Quarters, rest GitHub Discussions<br/>- Add labels to product areas<br/>- Issue titles: Titles have to be good enough so that someone who reads a list of titles knows what each is about. Descriptions are optional. Write short and precise descriptions so everyone understands |
+| **Todo/Progress** | See priority table below for handling guidelines |
+
+### Priority Levels
 
-Goal: we should make sure to have as few loops as possible per change.
+| Priority | Timeline | Description | Examples |
+|----------|----------|-------------|----------|
+| **P0 (urgent)** | Drop everything and fix | - Security incidents (e.g. data breach)<br/>- Performance issues (ingestion delay, clickhouse CPU/memory issues)<br/>- Issues that have large scale impact and break our application | Traces table does not load, login broken |
+| **P1 (high)** | Same week resolution | - Issues with smaller scale impact<br/>- Improvements that are \<1h work and have a big impact for many customers; great to move fast on these to make users share more feedback as they are excited that we ship<br/>- Delight a user same day for a small change that helps them | Some edge case does not work for dashboards |
+| **P2 (mid)** | Same month resolution | - Fixes which are papercut for our users but do not have a wide range impact | Trace tree UI breaks when users have many scores |
+| **P3 (low)** | Backlog | - Addition to Langfuse which is nice to have but not urgent | Create new prompt based on non latest prompt version |
 
-- Small/mid sized changes (e.g. adding a new filter to a table, adding a new field to a form…)
-  - Engineer leads, very brief notes on linear or just a descriptive title
-  - Open to do brief discussions with Max/Marc if helpful to speed process up and reduce risk/uncertainty
+## Specification on how to build things
+
+- We want to ship fast with a small team. To achieve this and maintain quality we need to balance individual agency and being able to move fast and smart design decisions early in the process.
+- As Langfuse has scaled to a platform processing billions of events, implementation strategy of features can have substantial impact on implementation effort, maintainability, user experience, cloud cost, and performance.
+- For such features, getting the right people early in one room is a massive time saver and allows us to ship fast and with high quality.
+
+### Small/mid sized changes (e.g. adding a new filter to a table, adding a new field to a form…)
+  - Engineer creates a Linear ticket with very brief notes or just a descriptive title
+  - Open to do brief discussions with Max if helpful to speed process up and reduce risk/uncertainty
   - If you plan to have a reviewer of a change, please make sure to involve the reviewer in the planning process.
-- Large projects (e.g. supporting Agents in Langfuse, rebuilding SDKs ..)
-  1. Engineer does initial investigation and then schedules meeting with Marc/Max and other team members who have a lot of relevant context
-  2. Meeting goal: Derisking and clarifying all important topics
+  - Asking Claude Code or ChatGPT "what am I missing?" or "how would you build this?" is a great way to get feedback and reduce risk/uncertainty.
+
+### Large projects (e.g. supporting Agents in Langfuse, rebuilding SDKs ..)
+  1. Engineer does initial research and then schedules meeting with Marc/Max and other team members who have a lot of relevant context. Ideally, he creates a google doc or Lineat ticket with the initial research and a rough specification.
+  2. Meeting to make decisions and plan the implementation:
+     - Meeting usually starts with everyone reading and commenting to make sure we are all on the same page.
      - Recorded \-\> meeting includes lots of details, good to generate spec/issue-description/working with Claude Code
      - Discussion on timelines and how we can make cut requirements to build faster
      - Implementation plan afterwards, spin with Max if needed
      - If things change in a meeting (also follow-up discussions), small note logged to Linear ticket / project.
      - Engineer needs to make sure to have relevant people in the room to make a decision ([DRI](https://docs.google.com/spreadsheets/d/1gOvWf_uSAtcXxWkR_gMuWX8-OYmSJNIPXTmPGfZ2DGY/edit?gid=0#gid=0)).
      - Engineer needs to manage Linear project based on the outcome of the meeting.
-  3. Engineer may need input on UI/UX or Clickhouse queries
-     - Get quick feedback on implementation thoughts from respective owners ([DRI](https://docs.google.com/spreadsheets/d/1gOvWf_uSAtcXxWkR_gMuWX8-OYmSJNIPXTmPGfZ2DGY/edit?gid=0#gid=0)).
-     - Based on the initial discussion, ask the owner for PR reviews if needed.
-
-## Prioritization
-
-In case of uncertainty: Tag Max (24h SLA on Linear inbox). Everyone should have a clear view on what priorities are. The best way to achieve this is by having one Linear project or issue for the current main task. For bugs and smaller improvements, use the [bug view](https://linear.app/langfuse/view/bugs-confirmedopen-97856f9f745c) or “My issues”. Tickets in Linear should always have a prioritization:
 
-- **State Triage**
-  - State: unassigned, no labels or prios
-  - Created via
-    - GitHub Issue integration
-    - In Linear by non-engineering team members
-  - Marc/Max subscribe, triage and assign, and add labels, and refine title
-  - After assignment, Engineer dedupes with existing tickets and handles communications with users.
-- **State Backlog**
-  - Everyone manages their backlogs.
-  - Issues always have a label. Use Linear views to look at all tickets of a label. We only create projects for work that has a clear end (no endless bucket of tickets).
-  - Add user feedback to tickets or projects
-    - If via Plain, use the “Link thread” feature of plain. Thereby you get all those threads in “close the loop” once the issue is resolved
-    - Also, link Linear issues to Plain threads straight from Plain.
-    - Snooze issues in Plain for which we want to review again e.g. next week.
-  - Only backlog what you plan/hope to do over the next 1-2 Quarters, rest GitHub Discussions
-  - Add labels to product areas
-  - Issue titles: Titles have to be good enough so that someone who reads a list of titles knows what each is about. Descriptions are optional. Write short and precise descriptions so everyone understands.
-- **State Todo/progress**
-  - P0 (urgent) – drop everything and fix
-    - Incident
-      - Security incident (e.g. data breach)
-      - Performance issues (ingestion delay, clickhouse CPU / memory issues)
-    - issues that have large scale impact and break our application (e.g. traces table does not load, login broken..).
-  - P1 (high) – same week resolution
-    - issues with smaller scale impact (e.g. some edge case does not work for dashboards)
-    - Improvements that are \<1h work and have a big impact for many customers; great to move fast on these to make users share more feedback as they are excited that we ship.
-    - Delight a user same day for a small change that helps them.
-  - P2 (mid) – same month resolution
-    - Fixes which are papercut for our users but do not have a wide range impact (e.g. trace tree UI breaks when users have many scores)
-  - P2 (low) – backlog
-    - Addition to Langfuse which is nice to have but not urgent (e.g. create new prompt based on non latest prompt version)
+### Ways of pulling help from others
+Engineers can ask others any time to help them with their work. 15 minutes of shared discussion can very much improve the overall output.
+- UI/UX: For larger UI/UX changes, it is helpful to get input from Max or Marc. Otherwise, draw a small sketch and ask anyone from the team for feedback. We have to take time to polish the UI/UX.
+- Clickhouse: For more complex Clickhouse queries, it is helpful to get input from the Clickhouse ([DRI](https://docs.google.com/spreadsheets/d/1gOvWf_uSAtcXxWkR_gMuWX8-OYmSJNIPXTmPGfZ2DGY/edit?gid=0#gid=0)). Otherwise, we may end up with anti patterns in performance and maintainability.
 
 ## Implementation
 
@@ -80,19 +74,20 @@ Slack is extremely busy with many noisy channels. We do not want you to have too
 We all need to do busy work and at the same time need to make progress on the most important project we want to drive forward. Hence:
 
 - You should spend max 2h/day on coding bug fixes, support tickets and similar. Sometimes it is necessary and super important for our company to fix bugs. Yet, if you continuously spend more time than 2h/day on bug, talk to Max to get buy-in or distribute work in the team.  
-- If you are a reviewer on a PR, you have to review the same day. It's important to not block others.  
+- If you are a reviewer on a PR, you have to review within 24h. Please see (/handbook/product-engineering/how-we-work/code-review) for more details.
 - You are expected to clear out your Plain inbox 2 times a day. Always acknowledge a request by a user ASAP and then look into it / work on a fix.  
 - You are expected to clear out your Linear inbox once a day.
 
 ## Linear
 
-Let us all heavily use Linear for planning and for technical and product discussions. I think that Linear can help us all individually and as a team:
+We use Linear as internal project planning / ticketing tool. It helps us to:
 
 - Collect user feedback in one place  
 - Discuss product requirements and implementation details  
 - Understand what work is left to finish a project  
 - Triage and prioritize bugs  
 - Reduce Linear/Slack knowledge split. Keep as much knowledge in Linear as possible.
+- Integrate with different tools (Plain, GitHub, Cursor, etc.) to make the workflow smoother.
 
 ### Conventions