Skip to content

feat: Projects with shared knowledge (mini-RAG)#2192

Open
tessaherself wants to merge 47 commits intohuggingface:mainfrom
tessaherself:research-projects-feature
Open

feat: Projects with shared knowledge (mini-RAG)#2192
tessaherself wants to merge 47 commits intohuggingface:mainfrom
tessaherself:research-projects-feature

Conversation

@tessaherself
Copy link

Summary

Adds a Projects feature — named containers for conversations with shared context:

  • Custom instructions (preprompt) applied to all conversations in a project
  • Default model override per project
  • Knowledge files — upload documents (PDF, TXT, MD, CSV, JSON, etc.) that get injected into every conversation's system prompt

Knowledge injection (two-tier system)

  • Tier 1 — Context stuffing: For small knowledge bases (< 50k chars total), all file text is prepended directly to the system prompt. Zero extra infrastructure needed.
  • Tier 2 — Chunk + retrieve: For larger knowledge bases, files are chunked → embedded via HuggingFace TEI → top-K relevant chunks retrieved per user message via cosine similarity. Requires TEI_ENDPOINT env var.

Tier selection is automatic based on total project knowledge size. Falls back gracefully to Tier 1 if no TEI endpoint is configured.

What's included

Area Details
Types Project, ProjectKnowledgeFile, ProjectKnowledgeChunk
API Full CRUD for projects + knowledge files (/api/v2/projects/...)
DB 3 new collections with proper indexes, ownership via authCondition
UI Sidebar project section with filter, create/edit modal, file upload manager
Pipeline Single 6-line change in textGeneration/index.ts — knowledge becomes part of preprompt
Cleanup Deleting a project removes all knowledge files, chunks, and GridFS data

Design decisions

  • Follows existing Conversation/Assistant ownership patterns (userId/sessionId)
  • Text extraction is synchronous at upload (immediate Tier 1 availability)
  • Embedding is async (fire-and-forget, status shown in UI)
  • Cosine similarity computed in application code (works with MongoMemoryServer in dev — no Atlas dependency)
  • New dependency: pdf-parse for PDF text extraction

Configuration

Env var Default Description
TEI_ENDPOINT HuggingFace TEI server URL (enables Tier 2)
PROJECT_KNOWLEDGE_CHAR_THRESHOLD 50000 Chars above which Tier 2 kicks in
PROJECT_KNOWLEDGE_CHUNK_SIZE 1000 Characters per chunk
PROJECT_KNOWLEDGE_CHUNK_OVERLAP 200 Overlap between chunks
PROJECT_KNOWLEDGE_TOP_K 5 Chunks retrieved per query

Relationship to existing work

Complementary to the drag-to-group branch (claude/drag-conversations-grouping-GpwPJ) — that adds drag-and-drop as an interaction pattern, while Projects add semantic meaning (instructions, model defaults, knowledge). They could be combined in the future.

Test plan

  • Create a project with name, description, custom instructions, and model override
  • Start a conversation from the project — verify preprompt and model are applied
  • Upload a small text file → verify content appears in system prompt (Tier 1)
  • Configure TEI_ENDPOINT → upload a large file → verify chunks are embedded and retrieved (Tier 2)
  • Delete a project → verify conversations are ungrouped (not deleted) and all knowledge files/chunks are cleaned up
  • Verify sidebar filtering works (click project → only its conversations shown)

🤖 Generated with Claude Code

tessaherself and others added 30 commits November 16, 2025 23:34
- Add BINARY_DOC_ALLOWLIST for PDF, DOCX, XLSX, PPTX etc.
- Add COOKIE_SECURE=true and COOKIE_SAMESITE=lax to production
- Configure HF_ORG_ADMIN=xpartners-admins for admin access
- Fix avatar to use user.avatarUrl from OIDC instead of HuggingFace
- Change 'Add text file' to 'Add file', hide MCP Servers menu
- Add debug logging in prepareFiles() for file upload tracing
- Add file-upload-flow.md documentation
…gface#1)

[Aikido] AI Fix for NoSQL injection attack possible
…uest huggingface#2)

[Aikido] Fix critical issue in @sveltejs/kit via minor version upgrade from 2.21.2 to 2.49.5
tessaherself and others added 16 commits February 18, 2026 17:04
…gface#3)

[Aikido] AI Fix for NoSQL injection attack possible
- Fix NoSQL injection vulnerabilities by adding mongoSanitize.ts utility
- Fix timing attack in adminToken.ts using timingSafeEqual
- Add SSRF protection in isURLLocal.ts and models.ts
- Pin Docker base images to SHA digests in Dockerfile
- Add Kubernetes security context to deployment.yaml (runAsNonRoot, readOnlyRootFilesystem, drop ALL capabilities)
- Add NetworkPolicy for pod network isolation
- Pin GitHub Actions to specific SHA commits
- Fix path traversal in findRepoRoot.ts
- Update vulnerable dependencies (@aws-sdk, @modelcontextprotocol/sdk, elysia, pino)
nanoid(7) uses alphabet A-Za-z0-9_- so the sanitizer regex was too restrictive.
- Add explicit authFilter variables before MongoDB queries
- Add SECURITY comments to explain sanitization happens internally
- Sanitize params.id with sanitizeObjectIdString() before ObjectId
- This makes the data flow clearer for static analysis (Aikido)
elysia 1.4 removed 'error' export, replaced by 'status()'
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The CI Dockerfile was running `npm run dev` (Vite dev server) in
production, causing slow page loads and janky UI. Switch to a
multi-stage build that runs `npm run build` and serves the compiled
output via `node build/index.js`, matching the main Dockerfile pattern.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of navigating away to home when the parent app sends a model
switch message, PATCH the current conversation's model and refresh
the page data. This lets users switch models mid-conversation without
losing context.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Projects let users organize conversations with shared custom instructions,
default models, and uploaded knowledge files. Knowledge injection uses a
two-tier system: context stuffing for small files (<50k chars) and
chunk+retrieve via HuggingFace TEI embeddings for larger knowledge bases.

New types: Project, ProjectKnowledgeFile, ProjectKnowledgeChunk
New API routes: /projects CRUD + /projects/:id/files CRUD
New UI: sidebar project section, create/edit modal, file upload manager
Pipeline: resolveProjectKnowledge injects into preprompt at generation time

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@tessaherself
Copy link
Author

Related discussion for community feedback on the knowledge/RAG approach: #2193

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bf5f6533ae

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

- Remove debug console.log statements from prepareFiles and streaming
  update handler that leaked attachment content and conversation data
- Remove hardcoded setTheme("light") that overwrote user theme preference
- Fix sanitizeParamId regex to accept _ and - in nanoid share IDs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant