feat(rag): scrapy web loader + litellm embeddings#790
Conversation
|
✅ No proprietary content detected. This PR is clear for review! |
6c22e9e to
50cc7c0
Compare
🧪 CAIPE UI Test Results✅ All tests passed 🔴 Overall Coverage: 30%📊 Detailed Coverage
✅ Test Suites
📈 Coverage Thresholds
|
📊 Test Coverage ReportMain Tests Coverage
📁 Coverage Artifacts
|
|
✅ No proprietary content detected. This PR is clear for review! |
📊 Test Coverage ReportMain Tests Coverage
📁 Coverage Artifacts
|
|
✅ No proprietary content detected. This PR is clear for review! |
Replace legacy loader with a modern Scrapy-based web scraping system: - Add spider types: sitemap, recursive, and single-url crawling modes - Add parser framework with auto-detection for common doc sites (Docusaurus, MkDocs, Sphinx, ReadTheDocs, VitePress, generic) - Add document processing pipeline with LangChain integration - Add worker pool for parallel Scrapy crawls with batching - Add comprehensive test suite for parsers, pipeline, and spiders This provides better error handling, progress tracking, and support for various documentation site frameworks. Signed-off-by: subbaksh <subbaksh@cisco.com>
- Add ScrapySettings model with crawl_mode, max_pages, allowed_domains, exclude_patterns, and request timeout configuration - Add document_count and chunk_count fields to JobInfo for metrics - Add increment_document_count() and increment_chunk_count() methods to JobManager for atomic counter updates - Update UrlIngestRequest to use new ScrapySettings for web ingestion Signed-off-by: subbaksh <subbaksh@cisco.com>
- Replace legacy loader.py with new Scrapy-based scrapy_loader.py - Remove deprecated url/ scrapers (docsaurus, mkdocs stubs) - Update ingestor to use ScrapySettings from request - Add document count tracking via job manager on batch send - Improve error handling with descriptive failure messages - Update README with new architecture documentation Signed-off-by: subbaksh <subbaksh@cisco.com>
- Increment chunk_count via job manager after upserting to vector DB - Add duplicate primary key cleanup script for Milvus maintenance - Refactor ingestion processing for improved reliability Signed-off-by: subbaksh <subbaksh@cisco.com>
- Add document_count and chunk_count to IngestionJob type - Show metrics (X documents, Y chunks) on collapsed datasource row - Display Documents and Chunks fields in expanded job details - UI improvements: terminal-style errors, default sitemap mode, moved description outside advanced options Signed-off-by: subbaksh <subbaksh@cisco.com>
…fields Support old datasource format with check_for_sitemaps, sitemap_max_urls, and ingest_type fields while encouraging migration to new settings object: - Add deprecated optional fields to UrlIngestRequest model - Add _get_effective_settings() to map old fields to new ScrapySettings - Log warnings when deprecated fields are detected - Show warning in job status message with action to delete and re-ingest - New settings take precedence if both old and new are provided Signed-off-by: subbaksh <subbaksh@cisco.com>
Move source, language, and generator fields into nested metadata dict to match DocumentMetadata model structure. The UI expects the source URL at metadata.metadata.source for displaying clickable links in search results. Signed-off-by: subbaksh <subbaksh@cisco.com>
- Add Chromium system dependencies (libnss3, libatk, etc.) - Install Playwright Chromium browser for JavaScript rendering - Fix AWS CLI architecture detection for both ARM and x86_64 - Enable render_javascript option for SPAs and JS-heavy sites Signed-off-by: subbaksh <subbaksh@cisco.com>
Signed-off-by: subbaksh <subbaksh@cisco.com>
…xtraction Some OIDC providers only include user claims (email, groups) in the ID token, not the access token. This change allows the UI to pass the ID token via the X-Identity-Token header, which the server validates and uses for claims extraction. Changes: - Add validate_id_token() method to OIDCProvider and AuthManager classes - Add ECDSA algorithm support (ES256, ES384, ES512) for JWT validation - Modify _authenticate_from_token() to extract and validate X-Identity-Token - Improve group extraction: use Set for deduplication, check ALL claims - Support comma-separated OIDC_GROUP_CLAIM for multiple claim names - Add 'members' to default group claims (for Duo SSO) - Update UI proxy routes to send X-Identity-Token header - Update README with documentation Signed-off-by: subbaksh <subbaksh@cisco.com>
Add litellm as a new embeddings provider that connects to a LiteLLM proxy. Uses OpenAIEmbeddings with custom base_url since LiteLLM proxy is OpenAI-compatible. Env vars: - EMBEDDINGS_PROVIDER=litellm - EMBEDDINGS_MODEL=<model-name> - LITELLM_API_BASE=<proxy-url> (required) - LITELLM_API_KEY=<api-key> (optional) Signed-off-by: subbaksh <subbaksh@cisco.com>
Signed-off-by: subbaksh <subbaksh@cisco.com>
Signed-off-by: subbaksh <subbaksh@cisco.com>
3fddbae to
2a3290e
Compare
📊 Test Coverage ReportMain Tests Coverage
📁 Coverage Artifacts
|
|
✅ No proprietary content detected. This PR is clear for review! |
📊 Test Coverage ReportMain Tests Coverage
📁 Coverage Artifacts
|
Replace individual values.yaml keys with a generic env: map pattern for rag-server and web-ingestor configuration. This reduces template complexity and makes it easier to add new environment variables without chart changes. Kept as computed values (from global): - REDIS_URL, NEO4J_*, MILVUS_URI, ONTOLOGY_AGENT_RESTAPI_ADDR - ENABLE_GRAPH_RAG (has global fallback) All other config now via env: map with string values. Added migration guide to README for users upgrading from old format. Signed-off-by: subbaksh <subbaksh@cisco.com>
|
✅ No proprietary content detected. This PR is clear for review! |
📊 Test Coverage ReportMain Tests Coverage
📁 Coverage Artifacts
|
📊 Test Coverage ReportMain Tests Coverage
📁 Coverage Artifacts
|
Signed-off-by: subbaksh <subbaksh@cisco.com>
|
✅ No proprietary content detected. This PR is clear for review! |
📊 Test Coverage ReportMain Tests Coverage
📁 Coverage Artifacts
|
…onal - Move langchain-huggingface to optional dependency in common/pyproject.toml - Remove sentence-transformers and huggingface-hub from server core deps - Add [huggingface] optional extra for local embedding models - Implement lazy imports in embeddings_factory.py for all providers - Add VARIANT build arg to Dockerfile.server (default/huggingface) - Add .venv cleanup step to reduce image size further - Add build_hf_variant workflow input for on-demand HF builds - Document image variants and embedding providers in README Default image reduced from 2.58GB to 1.33GB (-48%). HuggingFace variant available with -hf tag suffix when needed. Signed-off-by: subbaksh <subbaksh@cisco.com>
Signed-off-by: Shubham Bakshi <subbaksh@cisco.com>
|
✅ No proprietary content detected. This PR is clear for review! |
1 similar comment
|
✅ No proprietary content detected. This PR is clear for review! |
📊 Test Coverage ReportMain Tests Coverage
📁 Coverage Artifacts
|
📊 Test Coverage ReportMain Tests Coverage
📁 Coverage Artifacts
|
- Add slim ingestors variant without Playwright (~950MB vs 2.5GB) - Move scrapy-playwright to optional [playwright] extra - Consolidate all image variants (server-hf, ingestors-slim) into single matrix - Add .venv cleanup to reduce image sizes Signed-off-by: subbaksh <subbaksh@cisco.com>
|
✅ No proprietary content detected. This PR is clear for review! |
📊 Test Coverage ReportMain Tests Coverage
📁 Coverage Artifacts
|
Signed-off-by: subbaksh <subbaksh@cisco.com>
|
✅ No proprietary content detected. This PR is clear for review! |
📊 Test Coverage ReportMain Tests Coverage
📁 Coverage Artifacts
|
…tch ingestion to job status - Add sitemap fetch errors to self.errors for job status reporting - Add robots.txt fetch errors to self.errors for job status reporting - Add batch ingestion failures to job_manager.add_error_msg() for UI visibility Signed-off-by: subbaksh <subbaksh@cisco.com>
|
✅ No proprietary content detected. This PR is clear for review! |
📊 Test Coverage ReportMain Tests Coverage
📁 Coverage Artifacts
|
…nd robots.txt failures Use _get_failure_reason() to extract proper HTTP status codes and error details instead of raw failure.value for better error messages. Signed-off-by: subbaksh <subbaksh@cisco.com>
|
✅ No proprietary content detected. This PR is clear for review! |
📊 Test Coverage ReportMain Tests Coverage
📁 Coverage Artifacts
|
Scrapy Web Loader + LiteLLM Embeddings
Replaces basic web scraping with Scrapy for proper crawling. Adds LiteLLM proxy support for embeddings.
Web Ingestor
render_js: true)max_depth,allowed_domains,url_patterns,download_delay,concurrent_requests,obey_robots_txtServer
Embeddings
UI
Docker optimisation
Helm Chart
Rag server chart
enableMcpenv.ENABLE_MCPskipInitTestsenv.SKIP_INIT_TESTSembeddingsProviderenv.EMBEDDINGS_PROVIDERembeddingsModelenv.EMBEDDINGS_MODELmaxDocumentsPerIngestenv.MAX_DOCUMENTS_PER_INGESTmaxResultsPerQueryenv.MAX_RESULTS_PER_QUERYmaxIngestionConcurrencyenv.MAX_INGESTION_CONCURRENCYlogLevelenv.LOG_LEVELrbac.allowUnauthenticatedenv.ALLOW_UNAUTHENTICATEDrbac.adminGroupsenv.RBAC_ADMIN_GROUPSrbac.readonlyGroupsenv.RBAC_READONLY_GROUPSrbac.defaultRoleenv.RBAC_DEFAULT_ROLEWeb Ingestor
webIngestor.logLevelwebIngestor.env.LOG_LEVELwebIngestor.maxConcurrencywebIngestor.env.WEBLOADER_MAX_CONCURRENCYwebIngestor.maxIngestionTaskswebIngestor.env.WEBLOADER_MAX_INGESTION_TASKSwebIngestor.reloadIntervalwebIngestor.env.WEBLOADER_RELOAD_INTERVALBreaking changes
Type of Change
Pre-release Helm Charts (Optional)
For chart changes, you can test pre-release versions before merging:
pre/for automatic pre-release buildshelm-prereleaselabelghcr.io/cnoe-io/pre-release-helm-chartsChecklist