feat(rag): scrapy web loader + litellm embeddings by subbaksh · Pull Request #790 · cnoe-io/ai-platform-engineering

subbaksh · 2026-02-10T19:02:15Z

Scrapy Web Loader + LiteLLM Embeddings

Replaces basic web scraping with Scrapy for proper crawling. Adds LiteLLM proxy support for embeddings.

Web Ingestor

Scrapy-based loader - configurable crawl depth, allowed domains, URL patterns, rate limiting, robots.txt compliance
JavaScript rendering - Playwright/Chromium support for dynamic pages (enable with render_js: true)
More configuration options - control crawl behavior with max_depth, allowed_domains, url_patterns, download_delay, concurrent_requests, obey_robots_txt
Better metadata - correct source URL tracking per document
Status messages - UI feedback when JS rendering is active
Backwards compat - deprecated settings fields still work

Server

Chunk count tracking - stores document and chunk counts per knowledge base
X-Identity-Token header - extracts ID token claims for auth

Embeddings

LiteLLM provider - connects to a LiteLLM proxy for embeddings

EMBEDDINGS_PROVIDER=litellm
EMBEDDINGS_MODEL=azure/text-embedding-3-small
LITELLM_API_BASE=http://localhost:4000

UI

Shows document and chunk counts in ingest job view
Adds new web ingestor settings
Sends ID token to RAG server

Docker optimisation

Reduced Docker image from 2.58GB to 1.33GB (48% smaller) by moving PyTorch/HuggingFace dependencies to an optional extra
Added -hf image variant for users who need local HuggingFace embeddings
Reduced ingestors image size by 62% by adding a slim variant without Playwright/Chromium (~950MB vs 2.5GB default), for ingestors that don't need JavaScript rendering

Helm Chart

Simplified helm chart by forcing all configuration through env, instead of tracking values (except for a few)

Rag server chart

Removed Key	Use Instead
`enableMcp`	`env.ENABLE_MCP`
`skipInitTests`	`env.SKIP_INIT_TESTS`
`embeddingsProvider`	`env.EMBEDDINGS_PROVIDER`
`embeddingsModel`	`env.EMBEDDINGS_MODEL`
`maxDocumentsPerIngest`	`env.MAX_DOCUMENTS_PER_INGEST`
`maxResultsPerQuery`	`env.MAX_RESULTS_PER_QUERY`
`maxIngestionConcurrency`	`env.MAX_INGESTION_CONCURRENCY`
`logLevel`	`env.LOG_LEVEL`
`rbac.allowUnauthenticated`	`env.ALLOW_UNAUTHENTICATED`
`rbac.adminGroups`	`env.RBAC_ADMIN_GROUPS`
`rbac.readonlyGroups`	`env.RBAC_READONLY_GROUPS`
`rbac.defaultRole`	`env.RBAC_DEFAULT_ROLE`

Web Ingestor

Removed Key	Use Instead
`webIngestor.logLevel`	`webIngestor.env.LOG_LEVEL`
`webIngestor.maxConcurrency`	`webIngestor.env.WEBLOADER_MAX_CONCURRENCY`
`webIngestor.maxIngestionTasks`	`webIngestor.env.WEBLOADER_MAX_INGESTION_TASKS`
`webIngestor.reloadInterval`	`webIngestor.env.WEBLOADER_RELOAD_INTERVAL`

Breaking changes

Helm values are no longer supported - use env vars instead (see above)

Type of Change

Pre-release Helm Charts (Optional)

For chart changes, you can test pre-release versions before merging:

Base repo contributors: Create a branch starting with pre/ for automatic pre-release builds
Fork contributors: Ask a maintainer to add the helm-prerelease label
Pre-release charts are published to ghcr.io/cnoe-io/pre-release-helm-charts
Cleanup happens automatically when the PR closes or label is removed

Checklist

I have read the contributing guidelines
Existing issues have been referenced (where applicable)
I have verified this change is not present in other open pull requests
Functionality is documented
All code style checks pass
New code contribution is covered by automated tests
All new and existing tests pass

github-actions · 2026-02-10T19:02:32Z

✅ No proprietary content detected. This PR is clear for review!

github-actions · 2026-02-10T19:03:27Z

🧪 CAIPE UI Test Results

✅ All tests passed

🔴 Overall Coverage: 30%

📊 Detailed Coverage

Metric	Covered	Total	Percentage
Lines	3028	9497	31.88%
Statements	3230	10301	31.35%
Functions	537	2008	26.74%
Branches	1897	6765	28.04%

✅ Test Suites

✅ auth-guard.test.tsx - Route protection & authorization
✅ token-expiry-guard.test.tsx - Token expiry handling
✅ a2a-sdk-client.test.ts - A2A streaming SDK
✅ auth-utils.test.ts - Authentication utilities (100% coverage)
✅ auth-config.test.ts - OIDC configuration

📈 Coverage Thresholds

Threshold	Target	Current	Status
Minimum	40%	30%	❌ Fail
Good	60%	30%	⚠️ Below target
Excellent	80%	30%	⚠️ Below target

⚠️ Areas Needing Tests

High Priority:

hooks/use-a2a-streaming.ts - Core streaming functionality
store/chat-store.ts - Chat state management
store/agent-config-store.ts - Agent configuration
lib/api-client.ts - API communication
lib/storage-mode.ts - MongoDB/localStorage switching

Medium Priority:

components/chat/ChatPanel.tsx - Main chat interface
components/agent-builder/* - Agent builder UI
lib/mongodb.ts - MongoDB integration

💡 Run locally: make caipe-ui-tests
📦 Full report: Check workflow artifacts

github-actions · 2026-02-10T19:03:28Z

📊 Test Coverage Report

Main Tests Coverage

Metric	Coverage	Details
Lines	47.8%	4151/8690 lines
Branches	0.0%	0/0 branches

📁 Coverage Artifacts

Main tests: coverage-reports-main artifact
RAG tests: coverage-reports-rag artifact (not available)
Download artifacts to view detailed HTML coverage reports

github-actions · 2026-02-10T19:03:36Z

✅ No proprietary content detected. This PR is clear for review!

github-actions · 2026-02-10T19:04:51Z

📊 Test Coverage Report

Main Tests Coverage

Metric	Coverage	Details
Lines	47.8%	4151/8690 lines
Branches	0.0%	0/0 branches

📁 Coverage Artifacts

Main tests: coverage-reports-main artifact
RAG tests: coverage-reports-rag artifact (not available)
Download artifacts to view detailed HTML coverage reports

github-actions · 2026-02-10T19:09:34Z

✅ No proprietary content detected. This PR is clear for review!

Replace legacy loader with a modern Scrapy-based web scraping system: - Add spider types: sitemap, recursive, and single-url crawling modes - Add parser framework with auto-detection for common doc sites (Docusaurus, MkDocs, Sphinx, ReadTheDocs, VitePress, generic) - Add document processing pipeline with LangChain integration - Add worker pool for parallel Scrapy crawls with batching - Add comprehensive test suite for parsers, pipeline, and spiders This provides better error handling, progress tracking, and support for various documentation site frameworks. Signed-off-by: subbaksh <subbaksh@cisco.com>

- Add ScrapySettings model with crawl_mode, max_pages, allowed_domains, exclude_patterns, and request timeout configuration - Add document_count and chunk_count fields to JobInfo for metrics - Add increment_document_count() and increment_chunk_count() methods to JobManager for atomic counter updates - Update UrlIngestRequest to use new ScrapySettings for web ingestion Signed-off-by: subbaksh <subbaksh@cisco.com>

- Replace legacy loader.py with new Scrapy-based scrapy_loader.py - Remove deprecated url/ scrapers (docsaurus, mkdocs stubs) - Update ingestor to use ScrapySettings from request - Add document count tracking via job manager on batch send - Improve error handling with descriptive failure messages - Update README with new architecture documentation Signed-off-by: subbaksh <subbaksh@cisco.com>

- Increment chunk_count via job manager after upserting to vector DB - Add duplicate primary key cleanup script for Milvus maintenance - Refactor ingestion processing for improved reliability Signed-off-by: subbaksh <subbaksh@cisco.com>

- Add document_count and chunk_count to IngestionJob type - Show metrics (X documents, Y chunks) on collapsed datasource row - Display Documents and Chunks fields in expanded job details - UI improvements: terminal-style errors, default sitemap mode, moved description outside advanced options Signed-off-by: subbaksh <subbaksh@cisco.com>

…fields Support old datasource format with check_for_sitemaps, sitemap_max_urls, and ingest_type fields while encouraging migration to new settings object: - Add deprecated optional fields to UrlIngestRequest model - Add _get_effective_settings() to map old fields to new ScrapySettings - Log warnings when deprecated fields are detected - Show warning in job status message with action to delete and re-ingest - New settings take precedence if both old and new are provided Signed-off-by: subbaksh <subbaksh@cisco.com>

Move source, language, and generator fields into nested metadata dict to match DocumentMetadata model structure. The UI expects the source URL at metadata.metadata.source for displaying clickable links in search results. Signed-off-by: subbaksh <subbaksh@cisco.com>

- Add Chromium system dependencies (libnss3, libatk, etc.) - Install Playwright Chromium browser for JavaScript rendering - Fix AWS CLI architecture detection for both ARM and x86_64 - Enable render_javascript option for SPAs and JS-heavy sites Signed-off-by: subbaksh <subbaksh@cisco.com>

Signed-off-by: subbaksh <subbaksh@cisco.com>

…xtraction Some OIDC providers only include user claims (email, groups) in the ID token, not the access token. This change allows the UI to pass the ID token via the X-Identity-Token header, which the server validates and uses for claims extraction. Changes: - Add validate_id_token() method to OIDCProvider and AuthManager classes - Add ECDSA algorithm support (ES256, ES384, ES512) for JWT validation - Modify _authenticate_from_token() to extract and validate X-Identity-Token - Improve group extraction: use Set for deduplication, check ALL claims - Support comma-separated OIDC_GROUP_CLAIM for multiple claim names - Add 'members' to default group claims (for Duo SSO) - Update UI proxy routes to send X-Identity-Token header - Update README with documentation Signed-off-by: subbaksh <subbaksh@cisco.com>

Add litellm as a new embeddings provider that connects to a LiteLLM proxy. Uses OpenAIEmbeddings with custom base_url since LiteLLM proxy is OpenAI-compatible. Env vars: - EMBEDDINGS_PROVIDER=litellm - EMBEDDINGS_MODEL=<model-name> - LITELLM_API_BASE=<proxy-url> (required) - LITELLM_API_KEY=<api-key> (optional) Signed-off-by: subbaksh <subbaksh@cisco.com>

Signed-off-by: subbaksh <subbaksh@cisco.com>

github-actions · 2026-02-10T19:10:34Z

📊 Test Coverage Report

Main Tests Coverage

Metric	Coverage	Details
Lines	47.8%	4151/8690 lines
Branches	0.0%	0/0 branches

📁 Coverage Artifacts

Main tests: coverage-reports-main artifact
RAG tests: coverage-reports-rag artifact (not available)
Download artifacts to view detailed HTML coverage reports

github-actions · 2026-02-10T19:10:41Z

✅ No proprietary content detected. This PR is clear for review!

github-actions · 2026-02-10T19:11:42Z

📊 Test Coverage Report

Main Tests Coverage

Metric	Coverage	Details
Lines	47.8%	4151/8690 lines
Branches	0.0%	0/0 branches

📁 Coverage Artifacts

Main tests: coverage-reports-main artifact
RAG tests: coverage-reports-rag artifact (not available)
Download artifacts to view detailed HTML coverage reports

Replace individual values.yaml keys with a generic env: map pattern for rag-server and web-ingestor configuration. This reduces template complexity and makes it easier to add new environment variables without chart changes. Kept as computed values (from global): - REDIS_URL, NEO4J_*, MILVUS_URI, ONTOLOGY_AGENT_RESTAPI_ADDR - ENABLE_GRAPH_RAG (has global fallback) All other config now via env: map with string values. Added migration guide to README for users upgrading from old format. Signed-off-by: subbaksh <subbaksh@cisco.com>

github-actions · 2026-02-11T17:02:20Z

✅ No proprietary content detected. This PR is clear for review!

…eering

github-actions · 2026-02-11T17:03:19Z

📊 Test Coverage Report

Main Tests Coverage

Metric	Coverage	Details
Lines	47.8%	4151/8690 lines
Branches	0.0%	0/0 branches

📁 Coverage Artifacts

Main tests: coverage-reports-main artifact
RAG tests: coverage-reports-rag artifact (not available)
Download artifacts to view detailed HTML coverage reports

github-actions · 2026-02-11T17:03:36Z

📊 Test Coverage Report

Main Tests Coverage

Metric	Coverage	Details
Lines	47.8%	4151/8690 lines
Branches	0.0%	0/0 branches

📁 Coverage Artifacts

Main tests: coverage-reports-main artifact
RAG tests: coverage-reports-rag artifact (not available)
Download artifacts to view detailed HTML coverage reports

Signed-off-by: subbaksh <subbaksh@cisco.com>

github-actions · 2026-02-11T17:05:15Z

✅ No proprietary content detected. This PR is clear for review!

github-actions · 2026-02-11T17:05:53Z

📊 Test Coverage Report

Main Tests Coverage

Metric	Coverage	Details
Lines	47.8%	4151/8690 lines
Branches	0.0%	0/0 branches

📁 Coverage Artifacts

Main tests: coverage-reports-main artifact
RAG tests: coverage-reports-rag artifact (not available)
Download artifacts to view detailed HTML coverage reports

…onal - Move langchain-huggingface to optional dependency in common/pyproject.toml - Remove sentence-transformers and huggingface-hub from server core deps - Add [huggingface] optional extra for local embedding models - Implement lazy imports in embeddings_factory.py for all providers - Add VARIANT build arg to Dockerfile.server (default/huggingface) - Add .venv cleanup step to reduce image size further - Add build_hf_variant workflow input for on-demand HF builds - Document image variants and embedding providers in README Default image reduced from 2.58GB to 1.33GB (-48%). HuggingFace variant available with -hf tag suffix when needed. Signed-off-by: subbaksh <subbaksh@cisco.com>

Signed-off-by: Shubham Bakshi <subbaksh@cisco.com>

github-actions · 2026-02-12T11:27:52Z

✅ No proprietary content detected. This PR is clear for review!

github-actions · 2026-02-12T11:28:07Z

✅ No proprietary content detected. This PR is clear for review!

github-actions · 2026-02-12T11:28:53Z

📊 Test Coverage Report

Main Tests Coverage

Metric	Coverage	Details
Lines	46.7%	4151/8888 lines
Branches	0.0%	0/0 branches

📁 Coverage Artifacts

Main tests: coverage-reports-main artifact
RAG tests: coverage-reports-rag artifact (not available)
Download artifacts to view detailed HTML coverage reports

github-actions · 2026-02-12T11:29:08Z

📊 Test Coverage Report

Main Tests Coverage

Metric	Coverage	Details
Lines	46.7%	4151/8888 lines
Branches	0.0%	0/0 branches

📁 Coverage Artifacts

Main tests: coverage-reports-main artifact
RAG tests: coverage-reports-rag artifact (not available)
Download artifacts to view detailed HTML coverage reports

- Add slim ingestors variant without Playwright (~950MB vs 2.5GB) - Move scrapy-playwright to optional [playwright] extra - Consolidate all image variants (server-hf, ingestors-slim) into single matrix - Add .venv cleanup to reduce image sizes Signed-off-by: subbaksh <subbaksh@cisco.com>

github-actions · 2026-02-12T11:53:18Z

✅ No proprietary content detected. This PR is clear for review!

github-actions · 2026-02-12T11:54:17Z

📊 Test Coverage Report

Main Tests Coverage

Metric	Coverage	Details
Lines	46.7%	4151/8888 lines
Branches	0.0%	0/0 branches

📁 Coverage Artifacts

Main tests: coverage-reports-main artifact
RAG tests: coverage-reports-rag artifact (not available)
Download artifacts to view detailed HTML coverage reports

Signed-off-by: subbaksh <subbaksh@cisco.com>

github-actions · 2026-02-12T11:55:16Z

✅ No proprietary content detected. This PR is clear for review!

github-actions · 2026-02-12T11:56:17Z

📊 Test Coverage Report

Main Tests Coverage

Metric	Coverage	Details
Lines	46.7%	4151/8888 lines
Branches	0.0%	0/0 branches

📁 Coverage Artifacts

Main tests: coverage-reports-main artifact
RAG tests: coverage-reports-rag artifact (not available)
Download artifacts to view detailed HTML coverage reports

…tch ingestion to job status - Add sitemap fetch errors to self.errors for job status reporting - Add robots.txt fetch errors to self.errors for job status reporting - Add batch ingestion failures to job_manager.add_error_msg() for UI visibility Signed-off-by: subbaksh <subbaksh@cisco.com>

github-actions · 2026-02-12T12:06:02Z

✅ No proprietary content detected. This PR is clear for review!

github-actions · 2026-02-12T12:06:58Z

📊 Test Coverage Report

Main Tests Coverage

Metric	Coverage	Details
Lines	46.7%	4151/8888 lines
Branches	0.0%	0/0 branches

📁 Coverage Artifacts

Main tests: coverage-reports-main artifact
RAG tests: coverage-reports-rag artifact (not available)
Download artifacts to view detailed HTML coverage reports

…nd robots.txt failures Use _get_failure_reason() to extract proper HTTP status codes and error details instead of raw failure.value for better error messages. Signed-off-by: subbaksh <subbaksh@cisco.com>

github-actions · 2026-02-12T12:08:52Z

✅ No proprietary content detected. This PR is clear for review!

github-actions · 2026-02-12T12:09:47Z

📊 Test Coverage Report

Main Tests Coverage

Metric	Coverage	Details
Lines	46.7%	4151/8888 lines
Branches	0.0%	0/0 branches

📁 Coverage Artifacts

Main tests: coverage-reports-main artifact
RAG tests: coverage-reports-rag artifact (not available)
Download artifacts to view detailed HTML coverage reports

subbaksh requested a review from sriaradhyula as a code owner February 10, 2026 19:02

github-project-automation bot added this to SIG/CAIPE Agentic AI (AI Platform Engineering) Feb 10, 2026

subbaksh force-pushed the rag-server-scrapy branch from 6c22e9e to 50cc7c0 Compare February 10, 2026 19:03

subbaksh changed the title ~~Rag server scrapy~~ Scrapy Web Loader + LiteLLM Embeddings Feb 10, 2026

subbaksh added 13 commits February 10, 2026 19:10

feat(ingestors): add status messages for JavaScript rendering mode

9ec3bca

Signed-off-by: subbaksh <subbaksh@cisco.com>

chore(deps): update openai package in lock files

d4b783f

Signed-off-by: subbaksh <subbaksh@cisco.com>

fix(lint): remove unused imports and delete scripts folder

2a3290e

Signed-off-by: subbaksh <subbaksh@cisco.com>

subbaksh force-pushed the rag-server-scrapy branch from 3fddbae to 2a3290e Compare February 10, 2026 19:10

chore: bump chart versions for rag-stack rag-server ai-platform-engin…

2d06959

…eering

subbaksh changed the title ~~Scrapy Web Loader + LiteLLM Embeddings~~ feat(rag): scrapy web loader + litellm embeddings Feb 11, 2026

fix(lint): remove unnecessary f-string prefix

18d73c2

Signed-off-by: subbaksh <subbaksh@cisco.com>

subbaksh mentioned this pull request Feb 12, 2026

fix(webloader): surface sitemap discovery errors to UI and logs #794

Closed

3 tasks

subbaksh and others added 2 commits February 12, 2026 11:27

Merge branch 'main' into rag-server-scrapy

9440ec2

Signed-off-by: Shubham Bakshi <subbaksh@cisco.com>

chore: bump chart versions for ai-platform-engineering

b6ebcc7

fix: uv lock

87fc240

Signed-off-by: subbaksh <subbaksh@cisco.com>

fix(web-ingestor): use consistent HTTP error formatting for sitemap a…

19a13ed

…nd robots.txt failures Use _get_failure_reason() to extract proper HTTP status codes and error details instead of raw failure.value for better error messages. Signed-off-by: subbaksh <subbaksh@cisco.com>

suwhang-cisco approved these changes Feb 12, 2026

View reviewed changes

subbaksh merged commit 09eab8a into main Feb 12, 2026
44 checks passed

github-project-automation bot moved this to Done in SIG/CAIPE Agentic AI (AI Platform Engineering) Feb 12, 2026

subbaksh deleted the rag-server-scrapy branch February 12, 2026 12:54

Conversation

subbaksh commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Scrapy Web Loader + LiteLLM Embeddings

Web Ingestor

Server

Embeddings

UI

Docker optimisation

Helm Chart

Rag server chart

Web Ingestor

Breaking changes

Type of Change

Pre-release Helm Charts (Optional)

Checklist

Uh oh!

github-actions bot commented Feb 10, 2026

Uh oh!

github-actions bot commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🧪 CAIPE UI Test Results

🔴 Overall Coverage: 30%

📊 Detailed Coverage

✅ Test Suites

Uh oh!

github-actions bot commented Feb 10, 2026

📊 Test Coverage Report

Main Tests Coverage

📁 Coverage Artifacts

Uh oh!

github-actions bot commented Feb 10, 2026

Uh oh!

github-actions bot commented Feb 10, 2026

📊 Test Coverage Report

Main Tests Coverage

📁 Coverage Artifacts

Uh oh!

github-actions bot commented Feb 10, 2026

Uh oh!

github-actions bot commented Feb 10, 2026

📊 Test Coverage Report

Main Tests Coverage

📁 Coverage Artifacts

Uh oh!

github-actions bot commented Feb 10, 2026

Uh oh!

github-actions bot commented Feb 10, 2026

📊 Test Coverage Report

Main Tests Coverage

📁 Coverage Artifacts

Uh oh!

github-actions bot commented Feb 11, 2026

Uh oh!

github-actions bot commented Feb 11, 2026

📊 Test Coverage Report

Main Tests Coverage

📁 Coverage Artifacts

Uh oh!

github-actions bot commented Feb 11, 2026

📊 Test Coverage Report

Main Tests Coverage

📁 Coverage Artifacts

Uh oh!

github-actions bot commented Feb 11, 2026

Uh oh!

github-actions bot commented Feb 11, 2026

📊 Test Coverage Report

Main Tests Coverage

📁 Coverage Artifacts

Uh oh!

github-actions bot commented Feb 12, 2026

Uh oh!

github-actions bot commented Feb 12, 2026

Uh oh!

github-actions bot commented Feb 12, 2026

📊 Test Coverage Report

Main Tests Coverage

📁 Coverage Artifacts

Uh oh!

subbaksh commented Feb 10, 2026 •

edited

Loading

github-actions bot commented Feb 10, 2026 •

edited

Loading