Skip to content

feat(rag): scrapy web loader + litellm embeddings#790

Merged
subbaksh merged 23 commits intomainfrom
rag-server-scrapy
Feb 12, 2026
Merged

feat(rag): scrapy web loader + litellm embeddings#790
subbaksh merged 23 commits intomainfrom
rag-server-scrapy

Conversation

@subbaksh
Copy link
Collaborator

@subbaksh subbaksh commented Feb 10, 2026

Scrapy Web Loader + LiteLLM Embeddings

Replaces basic web scraping with Scrapy for proper crawling. Adds LiteLLM proxy support for embeddings.

Web Ingestor

  • Scrapy-based loader - configurable crawl depth, allowed domains, URL patterns, rate limiting, robots.txt compliance
  • JavaScript rendering - Playwright/Chromium support for dynamic pages (enable with render_js: true)
  • More configuration options - control crawl behavior with max_depth, allowed_domains, url_patterns, download_delay, concurrent_requests, obey_robots_txt
  • Better metadata - correct source URL tracking per document
  • Status messages - UI feedback when JS rendering is active
  • Backwards compat - deprecated settings fields still work

Server

  • Chunk count tracking - stores document and chunk counts per knowledge base
  • X-Identity-Token header - extracts ID token claims for auth

Embeddings

  • LiteLLM provider - connects to a LiteLLM proxy for embeddings
    EMBEDDINGS_PROVIDER=litellm
    EMBEDDINGS_MODEL=azure/text-embedding-3-small
    LITELLM_API_BASE=http://localhost:4000

UI

  • Shows document and chunk counts in ingest job view
  • Adds new web ingestor settings
  • Sends ID token to RAG server

Docker optimisation

  • Reduced Docker image from 2.58GB to 1.33GB (48% smaller) by moving PyTorch/HuggingFace dependencies to an optional extra
  • Added -hf image variant for users who need local HuggingFace embeddings
  • Reduced ingestors image size by 62% by adding a slim variant without Playwright/Chromium (~950MB vs 2.5GB default), for ingestors that don't need JavaScript rendering

Helm Chart

  • Simplified helm chart by forcing all configuration through env, instead of tracking values (except for a few)

Rag server chart

Removed Key Use Instead
enableMcp env.ENABLE_MCP
skipInitTests env.SKIP_INIT_TESTS
embeddingsProvider env.EMBEDDINGS_PROVIDER
embeddingsModel env.EMBEDDINGS_MODEL
maxDocumentsPerIngest env.MAX_DOCUMENTS_PER_INGEST
maxResultsPerQuery env.MAX_RESULTS_PER_QUERY
maxIngestionConcurrency env.MAX_INGESTION_CONCURRENCY
logLevel env.LOG_LEVEL
rbac.allowUnauthenticated env.ALLOW_UNAUTHENTICATED
rbac.adminGroups env.RBAC_ADMIN_GROUPS
rbac.readonlyGroups env.RBAC_READONLY_GROUPS
rbac.defaultRole env.RBAC_DEFAULT_ROLE

Web Ingestor

Removed Key Use Instead
webIngestor.logLevel webIngestor.env.LOG_LEVEL
webIngestor.maxConcurrency webIngestor.env.WEBLOADER_MAX_CONCURRENCY
webIngestor.maxIngestionTasks webIngestor.env.WEBLOADER_MAX_INGESTION_TASKS
webIngestor.reloadInterval webIngestor.env.WEBLOADER_RELOAD_INTERVAL

Breaking changes

  • Helm values are no longer supported - use env vars instead (see above)

Type of Change

  • Bugfix
  • New Feature
  • Breaking Change
  • Refactor
  • Documentation
  • Other (please describe)

Pre-release Helm Charts (Optional)

For chart changes, you can test pre-release versions before merging:

  • Base repo contributors: Create a branch starting with pre/ for automatic pre-release builds
  • Fork contributors: Ask a maintainer to add the helm-prerelease label
  • Pre-release charts are published to ghcr.io/cnoe-io/pre-release-helm-charts
  • Cleanup happens automatically when the PR closes or label is removed

Checklist

  • I have read the contributing guidelines
  • Existing issues have been referenced (where applicable)
  • I have verified this change is not present in other open pull requests
  • Functionality is documented
  • All code style checks pass
  • New code contribution is covered by automated tests
  • All new and existing tests pass

@github-actions
Copy link
Contributor

✅ No proprietary content detected. This PR is clear for review!

@github-actions
Copy link
Contributor

github-actions bot commented Feb 10, 2026

🧪 CAIPE UI Test Results

All tests passed

🔴 Overall Coverage: 30%

Coverage
lines
statements
functions
branches

📊 Detailed Coverage

Metric Covered Total Percentage
Lines 3028 9497 31.88%
Statements 3230 10301 31.35%
Functions 537 2008 26.74%
Branches 1897 6765 28.04%

✅ Test Suites

  • ✅ auth-guard.test.tsx - Route protection & authorization
  • ✅ token-expiry-guard.test.tsx - Token expiry handling
  • ✅ a2a-sdk-client.test.ts - A2A streaming SDK
  • ✅ auth-utils.test.ts - Authentication utilities (100% coverage)
  • ✅ auth-config.test.ts - OIDC configuration
📈 Coverage Thresholds
Threshold Target Current Status
Minimum 40% 30% ❌ Fail
Good 60% 30% ⚠️ Below target
Excellent 80% 30% ⚠️ Below target
⚠️ Areas Needing Tests

High Priority:

  • hooks/use-a2a-streaming.ts - Core streaming functionality
  • store/chat-store.ts - Chat state management
  • store/agent-config-store.ts - Agent configuration
  • lib/api-client.ts - API communication
  • lib/storage-mode.ts - MongoDB/localStorage switching

Medium Priority:

  • components/chat/ChatPanel.tsx - Main chat interface
  • components/agent-builder/* - Agent builder UI
  • lib/mongodb.ts - MongoDB integration

💡 Run locally: make caipe-ui-tests
📦 Full report: Check workflow artifacts

@github-actions
Copy link
Contributor

📊 Test Coverage Report

Main Tests Coverage

Metric Coverage Details
Lines 47.8% 4151/8690 lines
Branches 0.0% 0/0 branches

📁 Coverage Artifacts

  • Main tests: coverage-reports-main artifact
  • RAG tests: coverage-reports-rag artifact (not available)
  • Download artifacts to view detailed HTML coverage reports

@github-actions
Copy link
Contributor

✅ No proprietary content detected. This PR is clear for review!

@github-actions
Copy link
Contributor

📊 Test Coverage Report

Main Tests Coverage

Metric Coverage Details
Lines 47.8% 4151/8690 lines
Branches 0.0% 0/0 branches

📁 Coverage Artifacts

  • Main tests: coverage-reports-main artifact
  • RAG tests: coverage-reports-rag artifact (not available)
  • Download artifacts to view detailed HTML coverage reports

@subbaksh subbaksh changed the title Rag server scrapy Scrapy Web Loader + LiteLLM Embeddings Feb 10, 2026
@github-actions
Copy link
Contributor

✅ No proprietary content detected. This PR is clear for review!

Replace legacy loader with a modern Scrapy-based web scraping system:

- Add spider types: sitemap, recursive, and single-url crawling modes
- Add parser framework with auto-detection for common doc sites
  (Docusaurus, MkDocs, Sphinx, ReadTheDocs, VitePress, generic)
- Add document processing pipeline with LangChain integration
- Add worker pool for parallel Scrapy crawls with batching
- Add comprehensive test suite for parsers, pipeline, and spiders

This provides better error handling, progress tracking, and support
for various documentation site frameworks.

Signed-off-by: subbaksh <subbaksh@cisco.com>
- Add ScrapySettings model with crawl_mode, max_pages, allowed_domains,
  exclude_patterns, and request timeout configuration
- Add document_count and chunk_count fields to JobInfo for metrics
- Add increment_document_count() and increment_chunk_count() methods
  to JobManager for atomic counter updates
- Update UrlIngestRequest to use new ScrapySettings for web ingestion

Signed-off-by: subbaksh <subbaksh@cisco.com>
- Replace legacy loader.py with new Scrapy-based scrapy_loader.py
- Remove deprecated url/ scrapers (docsaurus, mkdocs stubs)
- Update ingestor to use ScrapySettings from request
- Add document count tracking via job manager on batch send
- Improve error handling with descriptive failure messages
- Update README with new architecture documentation

Signed-off-by: subbaksh <subbaksh@cisco.com>
- Increment chunk_count via job manager after upserting to vector DB
- Add duplicate primary key cleanup script for Milvus maintenance
- Refactor ingestion processing for improved reliability

Signed-off-by: subbaksh <subbaksh@cisco.com>
- Add document_count and chunk_count to IngestionJob type
- Show metrics (X documents, Y chunks) on collapsed datasource row
- Display Documents and Chunks fields in expanded job details
- UI improvements: terminal-style errors, default sitemap mode,
  moved description outside advanced options

Signed-off-by: subbaksh <subbaksh@cisco.com>
…fields

Support old datasource format with check_for_sitemaps, sitemap_max_urls,
and ingest_type fields while encouraging migration to new settings object:

- Add deprecated optional fields to UrlIngestRequest model
- Add _get_effective_settings() to map old fields to new ScrapySettings
- Log warnings when deprecated fields are detected
- Show warning in job status message with action to delete and re-ingest
- New settings take precedence if both old and new are provided

Signed-off-by: subbaksh <subbaksh@cisco.com>
Move source, language, and generator fields into nested metadata dict
to match DocumentMetadata model structure. The UI expects the source
URL at metadata.metadata.source for displaying clickable links in
search results.

Signed-off-by: subbaksh <subbaksh@cisco.com>
- Add Chromium system dependencies (libnss3, libatk, etc.)
- Install Playwright Chromium browser for JavaScript rendering
- Fix AWS CLI architecture detection for both ARM and x86_64
- Enable render_javascript option for SPAs and JS-heavy sites

Signed-off-by: subbaksh <subbaksh@cisco.com>
Signed-off-by: subbaksh <subbaksh@cisco.com>
…xtraction

Some OIDC providers only include user claims (email, groups) in the ID token,
not the access token. This change allows the UI to pass the ID token via the
X-Identity-Token header, which the server validates and uses for claims extraction.

Changes:
- Add validate_id_token() method to OIDCProvider and AuthManager classes
- Add ECDSA algorithm support (ES256, ES384, ES512) for JWT validation
- Modify _authenticate_from_token() to extract and validate X-Identity-Token
- Improve group extraction: use Set for deduplication, check ALL claims
- Support comma-separated OIDC_GROUP_CLAIM for multiple claim names
- Add 'members' to default group claims (for Duo SSO)
- Update UI proxy routes to send X-Identity-Token header
- Update README with documentation

Signed-off-by: subbaksh <subbaksh@cisco.com>
Add litellm as a new embeddings provider that connects to a LiteLLM proxy.
Uses OpenAIEmbeddings with custom base_url since LiteLLM proxy is OpenAI-compatible.

Env vars:
- EMBEDDINGS_PROVIDER=litellm
- EMBEDDINGS_MODEL=<model-name>
- LITELLM_API_BASE=<proxy-url> (required)
- LITELLM_API_KEY=<api-key> (optional)

Signed-off-by: subbaksh <subbaksh@cisco.com>
Signed-off-by: subbaksh <subbaksh@cisco.com>
Signed-off-by: subbaksh <subbaksh@cisco.com>
@github-actions
Copy link
Contributor

📊 Test Coverage Report

Main Tests Coverage

Metric Coverage Details
Lines 47.8% 4151/8690 lines
Branches 0.0% 0/0 branches

📁 Coverage Artifacts

  • Main tests: coverage-reports-main artifact
  • RAG tests: coverage-reports-rag artifact (not available)
  • Download artifacts to view detailed HTML coverage reports

@github-actions
Copy link
Contributor

✅ No proprietary content detected. This PR is clear for review!

@github-actions
Copy link
Contributor

📊 Test Coverage Report

Main Tests Coverage

Metric Coverage Details
Lines 47.8% 4151/8690 lines
Branches 0.0% 0/0 branches

📁 Coverage Artifacts

  • Main tests: coverage-reports-main artifact
  • RAG tests: coverage-reports-rag artifact (not available)
  • Download artifacts to view detailed HTML coverage reports

Replace individual values.yaml keys with a generic env: map pattern for
rag-server and web-ingestor configuration. This reduces template complexity
and makes it easier to add new environment variables without chart changes.

Kept as computed values (from global):
- REDIS_URL, NEO4J_*, MILVUS_URI, ONTOLOGY_AGENT_RESTAPI_ADDR
- ENABLE_GRAPH_RAG (has global fallback)

All other config now via env: map with string values.
Added migration guide to README for users upgrading from old format.

Signed-off-by: subbaksh <subbaksh@cisco.com>
@github-actions
Copy link
Contributor

✅ No proprietary content detected. This PR is clear for review!

@subbaksh subbaksh changed the title Scrapy Web Loader + LiteLLM Embeddings feat(rag): scrapy web loader + litellm embeddings Feb 11, 2026
@github-actions
Copy link
Contributor

📊 Test Coverage Report

Main Tests Coverage

Metric Coverage Details
Lines 47.8% 4151/8690 lines
Branches 0.0% 0/0 branches

📁 Coverage Artifacts

  • Main tests: coverage-reports-main artifact
  • RAG tests: coverage-reports-rag artifact (not available)
  • Download artifacts to view detailed HTML coverage reports

@github-actions
Copy link
Contributor

📊 Test Coverage Report

Main Tests Coverage

Metric Coverage Details
Lines 47.8% 4151/8690 lines
Branches 0.0% 0/0 branches

📁 Coverage Artifacts

  • Main tests: coverage-reports-main artifact
  • RAG tests: coverage-reports-rag artifact (not available)
  • Download artifacts to view detailed HTML coverage reports

Signed-off-by: subbaksh <subbaksh@cisco.com>
@github-actions
Copy link
Contributor

✅ No proprietary content detected. This PR is clear for review!

@github-actions
Copy link
Contributor

📊 Test Coverage Report

Main Tests Coverage

Metric Coverage Details
Lines 47.8% 4151/8690 lines
Branches 0.0% 0/0 branches

📁 Coverage Artifacts

  • Main tests: coverage-reports-main artifact
  • RAG tests: coverage-reports-rag artifact (not available)
  • Download artifacts to view detailed HTML coverage reports

…onal

- Move langchain-huggingface to optional dependency in common/pyproject.toml
- Remove sentence-transformers and huggingface-hub from server core deps
- Add [huggingface] optional extra for local embedding models
- Implement lazy imports in embeddings_factory.py for all providers
- Add VARIANT build arg to Dockerfile.server (default/huggingface)
- Add .venv cleanup step to reduce image size further
- Add build_hf_variant workflow input for on-demand HF builds
- Document image variants and embedding providers in README

Default image reduced from 2.58GB to 1.33GB (-48%).
HuggingFace variant available with -hf tag suffix when needed.

Signed-off-by: subbaksh <subbaksh@cisco.com>
@github-actions
Copy link
Contributor

✅ No proprietary content detected. This PR is clear for review!

1 similar comment
@github-actions
Copy link
Contributor

✅ No proprietary content detected. This PR is clear for review!

@github-actions
Copy link
Contributor

📊 Test Coverage Report

Main Tests Coverage

Metric Coverage Details
Lines 46.7% 4151/8888 lines
Branches 0.0% 0/0 branches

📁 Coverage Artifacts

  • Main tests: coverage-reports-main artifact
  • RAG tests: coverage-reports-rag artifact (not available)
  • Download artifacts to view detailed HTML coverage reports

@github-actions
Copy link
Contributor

📊 Test Coverage Report

Main Tests Coverage

Metric Coverage Details
Lines 46.7% 4151/8888 lines
Branches 0.0% 0/0 branches

📁 Coverage Artifacts

  • Main tests: coverage-reports-main artifact
  • RAG tests: coverage-reports-rag artifact (not available)
  • Download artifacts to view detailed HTML coverage reports

- Add slim ingestors variant without Playwright (~950MB vs 2.5GB)
- Move scrapy-playwright to optional [playwright] extra
- Consolidate all image variants (server-hf, ingestors-slim) into single matrix
- Add .venv cleanup to reduce image sizes

Signed-off-by: subbaksh <subbaksh@cisco.com>
@github-actions
Copy link
Contributor

✅ No proprietary content detected. This PR is clear for review!

@github-actions
Copy link
Contributor

📊 Test Coverage Report

Main Tests Coverage

Metric Coverage Details
Lines 46.7% 4151/8888 lines
Branches 0.0% 0/0 branches

📁 Coverage Artifacts

  • Main tests: coverage-reports-main artifact
  • RAG tests: coverage-reports-rag artifact (not available)
  • Download artifacts to view detailed HTML coverage reports

Signed-off-by: subbaksh <subbaksh@cisco.com>
@github-actions
Copy link
Contributor

✅ No proprietary content detected. This PR is clear for review!

@github-actions
Copy link
Contributor

📊 Test Coverage Report

Main Tests Coverage

Metric Coverage Details
Lines 46.7% 4151/8888 lines
Branches 0.0% 0/0 branches

📁 Coverage Artifacts

  • Main tests: coverage-reports-main artifact
  • RAG tests: coverage-reports-rag artifact (not available)
  • Download artifacts to view detailed HTML coverage reports

…tch ingestion to job status

- Add sitemap fetch errors to self.errors for job status reporting
- Add robots.txt fetch errors to self.errors for job status reporting
- Add batch ingestion failures to job_manager.add_error_msg() for UI visibility

Signed-off-by: subbaksh <subbaksh@cisco.com>
@github-actions
Copy link
Contributor

✅ No proprietary content detected. This PR is clear for review!

@github-actions
Copy link
Contributor

📊 Test Coverage Report

Main Tests Coverage

Metric Coverage Details
Lines 46.7% 4151/8888 lines
Branches 0.0% 0/0 branches

📁 Coverage Artifacts

  • Main tests: coverage-reports-main artifact
  • RAG tests: coverage-reports-rag artifact (not available)
  • Download artifacts to view detailed HTML coverage reports

…nd robots.txt failures

Use _get_failure_reason() to extract proper HTTP status codes and error
details instead of raw failure.value for better error messages.

Signed-off-by: subbaksh <subbaksh@cisco.com>
@github-actions
Copy link
Contributor

✅ No proprietary content detected. This PR is clear for review!

@github-actions
Copy link
Contributor

📊 Test Coverage Report

Main Tests Coverage

Metric Coverage Details
Lines 46.7% 4151/8888 lines
Branches 0.0% 0/0 branches

📁 Coverage Artifacts

  • Main tests: coverage-reports-main artifact
  • RAG tests: coverage-reports-rag artifact (not available)
  • Download artifacts to view detailed HTML coverage reports

@subbaksh subbaksh merged commit 09eab8a into main Feb 12, 2026
44 checks passed
@subbaksh subbaksh deleted the rag-server-scrapy branch February 12, 2026 12:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants