Skip to content

Conversation

@amirasaran
Copy link
Contributor

@amirasaran amirasaran commented Aug 22, 2025

User description

Description

feat: Introduce Knowledge Base feature with plan-based credit tracking and usage history

  • Add Knowledge Base as a new core feature:
    • Implement models, signals, and logic for managing knowledge bases and their documents.
    • Integrate KnowledgeBaseProcessor and related tooling.
  • Extend Plan model to support knowledge base quotas:
    • Add fields for max number of knowledge bases, max documents per knowledge base, and retrieval rate limits.
  • Refactor UsageHistory to support generic credit tracking:
    • Use GenericForeignKey to associate usage history with crawl, search, sitemap, and knowledge base document events.
    • Enforce UUID primary key for referenced models.
    • Update admin and serializer logic for new usage history structure.
  • Update plan enforcement and validators:
    • Modularize validation for crawl, search, sitemap, and knowledge base operations.
    • Enforce plan limits for knowledge base creation and document addition.
    • Improve error messages and validation feedback.
  • Add API endpoints and filters for usage history and knowledge base management.
  • Update signals to automate usage history creation for knowledge base document events.
  • Update .gitignore for new directories and artifacts.
  • Add new dependencies for knowledge base and search (elasticsearch, aiohttp, dataclasses-json, etc.).
  • Remove debugging code and improve formatting across updated files.

This commit introduces the Knowledge Base system, enabling teams to create, manage, and track knowledge bases and their documents with full plan-based credit enforcement and usage history auditing.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update

UI Changes

Testing

  • Test A
  • Test B

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published

PR Type

Enhancement


Description

Knowledge Base System: Complete implementation of a new knowledge base feature with document management, vector search, and AI-powered querying capabilities
LLM Provider Management: Add comprehensive provider configuration system supporting OpenAI and WaterCrawl with model discovery and API key management
Generic Usage Tracking: Refactor usage history to support multiple content types (crawl, search, sitemap, knowledge base) using GenericForeignKey
Plan Extensions: Extend subscription plans with knowledge base quotas (max knowledge bases, documents per KB, retrieval rate limits)
Vector Store Integration: Implement OpenSearch-based vector store with multiple retrieval strategies (similarity, MMR, hybrid search)
Document Processing Pipeline: Add comprehensive document processing with text splitting, embedding, summarization, and keyword extraction
Frontend Components: Complete React frontend with knowledge base management, provider configuration, and usage history interfaces
Admin Interface: Add superuser-only admin panels for managing LLM providers, models, and system configurations
API Endpoints: Comprehensive REST API for knowledge base operations, document import, querying, and provider management
Background Tasks: Celery-based async processing for document indexing, vector store operations, and content processing


Diagram Walkthrough

flowchart LR
  KB["Knowledge Base"] --> DOC["Documents"]
  DOC --> CHUNK["Chunks"]
  CHUNK --> VS["Vector Store"]
  VS --> SEARCH["Search & Query"]
  
  LLM["LLM Providers"] --> EMB["Embeddings"]
  LLM --> SUM["Summarization"]
  EMB --> VS
  SUM --> DOC
  
  PLAN["Plan Model"] --> QUOTA["KB Quotas"]
  USAGE["Usage History"] --> GFK["Generic FK"]
  GFK --> KB
  GFK --> CRAWL["Crawl Requests"]
  GFK --> SITEMAP["Sitemaps"]
  
  FRONTEND["React Frontend"] --> API["REST API"]
  API --> KB
  API --> LLM
  API --> USAGE
Loading

File Walkthrough

Relevant files
New feature
14 files
knowledgeBase.ts
Knowledge Base API Service Implementation                               

frontend/src/services/api/knowledgeBase.ts

• Add comprehensive API service for knowledge base operations
including CRUD operations, document management, and querying

Implement methods for importing documents from URLs, crawl results,
and files with upload progress tracking
• Add support for
context-aware enhancement, chunk retrieval, and retry indexing
functionality

+128/-0 
provider.ts
Admin Provider Management API Service                                       

frontend/src/services/api/admin/provider.ts

• Add admin API service for managing provider configurations, LLM
models, and embedding models
• Implement CRUD operations with
pagination support for all provider-related entities
• Add provider
synchronization and configuration testing endpoints

+87/-0   
knowledge.ts
Knowledge Base TypeScript Type Definitions                             

frontend/src/types/knowledge.ts

• Define comprehensive TypeScript interfaces for knowledge base
entities including status enums and form data types
• Add interfaces
for documents, chunks, context-aware enhancement, and import
operations
• Include default values and utility functions for chunk
size calculations

+104/-0 
provider.ts
Provider Configuration API Service                                             

frontend/src/services/api/provider.ts

• Add API service for provider configuration management with CRUD
operations
• Implement provider listing, configuration testing, and
model management endpoints
• Support both paginated and non-paginated
provider configuration retrieval

+66/-0   
provider.ts
Provider and Model Type Definitions                                           

frontend/src/types/provider.ts

• Define TypeScript interfaces for providers, models, embeddings, and
configurations
• Add enums for option requirements and form data
structures
• Include comprehensive type definitions for LLM and
embedding model properties

+74/-0   
provider.ts
Admin Provider Type Definitions                                                   

frontend/src/types/admin/provider.ts

• Add admin-specific TypeScript interfaces for provider management

Define visibility level enums and request/response types for LLM and
embedding models
• Include comprehensive admin provider configuration
interfaces

+71/-0   
usage_history.ts
Usage History API Service                                                               

frontend/src/services/api/usage_history.ts

• Add API service for retrieving usage history with pagination and
filtering support
• Implement filtering by team API key and content
type parameters

+21/-0   
usage_history.ts
Usage History Type Definitions                                                     

frontend/src/types/usage_history.ts

• Define TypeScript interfaces for usage history tracking across
different content types
• Add content type enum for crawl requests,
sitemaps, searches, and knowledge base documents
• Include team API
key summary interface for usage attribution

+24/-0   
vectore_store.py
OpenSearch Vector Store Implementation                                     

backend/knowledge_base/tools/vectore_store.py

• Implement comprehensive OpenSearch vector store with pluggable
retrieval strategies
• Add support for similarity search, MMR search,
and hybrid retrieval methods
• Include automatic index creation,
document management, and keyword-based scoring

+734/-0 
views.py
Knowledge Base API ViewSets                                                           

backend/knowledge_base/views.py

• Add comprehensive ViewSets for knowledge bases, documents, and
chunks with full CRUD operations
• Implement document import from
URLs, crawl results, files, and context-aware enhancement
• Add query
functionality with rate limiting and plan validation integration

+467/-0 
retrieval_strategies.py
Vector Store Retrieval Strategies                                               

backend/knowledge_base/tools/retrieval_strategies.py

• Implement pluggable retrieval strategies for different search
approaches (dense, content, keyword)
• Add BM25-optimized queries with
hybrid search support and keyword boosting
• Include comprehensive
OpenSearch index mapping generation with similarity metrics

+451/-0 
serializers.py
Knowledge Base API Serializers                                                     

backend/knowledge_base/serializers.py

• Add comprehensive serializers for knowledge base entities with
validation
• Implement form data serializers for document import from
various sources
• Include context-aware enhancement and query
serializers with plan validation integration

+337/-0 
factories.py
Knowledge Base Component Factories                                             

backend/knowledge_base/factories.py

• Implement factory pattern for creating knowledge base components
(embedders, vector stores, summarizers)
• Add support for multiple
providers (OpenAI, WaterCrawl) and file format converters
• Include
configurable text splitters and keyword extractors with knowledge base
integration

+249/-0 
serializers.py
LLM Provider and Model Serializers                                             

backend/llm/serializers.py

• Add serializers for LLM models, embedding models, and provider
configurations
• Implement API key encryption/decryption and provider
configuration testing
• Include comprehensive validation for provider
settings and model parameters

+150/-0 
Enhancement
49 files
utils.ts
CSS Class Name Utility Function                                                   

frontend/src/lib/utils.ts

• Add classnames utility function for conditional CSS class generation

• Implement object-based class name filtering and joining
functionality

+7/-0     
subscription.ts
Knowledge Base Subscription Fields                                             

frontend/src/types/subscription.ts

• Extend CurrentSubscription interface with knowledge base quota
fields
• Add properties for number of knowledge bases, documents per
knowledge base, and retrieval rate limits

+3/-0     
user.ts
User Profile Superuser Field                                                         

frontend/src/types/user.ts

• Add is_superuser boolean field to the Profile interface
• Extend
user profile with administrative privilege indicator

+1/-0     
validators.py
Modular Plan Validation System                                                     

backend/plan/validators.py

• Refactor plan validation into modular mixins for different request
types
• Add comprehensive knowledge base validation for creation and
document limits
• Implement generic credit validation with daily and
total credit checking

+215/-90
services.py
Generic Usage Tracking and Knowledge Base Quotas                 

backend/plan/services.py

• Extend plan services with knowledge base quota support and generic
usage tracking
• Refactor usage history to support multiple content
types via GenericForeignKey
• Add credit calculation and validation
for knowledge base operations

+116/-87
views.py
Add admin API views for LLM management                                     

backend/llm/admin_api/views.py

• Add comprehensive admin API views for LLM provider configurations,
models, and embeddings
• Implement CRUD operations with OpenAPI
documentation and proper permissions
• Include custom actions for
syncing models/embeddings and testing configurations

+230/-0 
services.py
Implement knowledge base core services                                     

backend/knowledge_base/services.py

• Implement core knowledge base service classes for managing knowledge
bases and documents
• Add methods for adding URLs, files, crawl
results and processing documents
• Include document indexing, vector
store operations and content processing logic

+173/-0 
views.py
Add LLM provider configuration API views                                 

backend/llm/views.py

• Add team-specific provider configuration API endpoints with full
CRUD operations
• Implement provider testing, model listing, and
configuration validation
• Include comprehensive OpenAPI documentation
and proper authentication

+178/-0 
models.py
Define knowledge base data models                                               

backend/knowledge_base/models.py

• Define core knowledge base models: KnowledgeBase,
KnowledgeBaseDocument, KnowledgeBaseChunk
• Add fields for chunking
configuration, embedding models, and summarization settings
• Include
status tracking and metadata fields with proper relationships

+171/-0 
processor.py
Add knowledge base processing engine                                         

backend/knowledge_base/tools/processor.py

• Implement main knowledge base processor for text splitting,
embedding, and vector storage
• Add methods for document persistence,
search operations, and vector store management
• Include factory
pattern integration for various processing components

+177/-0 
0001_initial.py
Initial LLM database schema migration                                       

backend/llm/migrations/0001_initial.py

• Create initial database schema for LLM models, provider configs, and
embedding models
• Define relationships between models and teams with
proper constraints
• Add visibility levels and temperature
configuration fields

+75/-0   
serializers.py
Add admin API serializers for LLM management                         

backend/llm/admin_api/serializers.py

• Add serializers for admin API with validation and encryption
handling
• Implement provider configuration testing and API key
encryption
• Include comprehensive field validation and error handling

+137/-0 
models.py
Define LLM and provider configuration models                         

backend/llm/models.py

• Define LLM model classes: LLMModel, ProviderConfig, EmbeddingModel

Add provider configuration with team relationships and
global/team-specific logic
• Include temperature settings, visibility
levels, and model metadata

+126/-0 
0001_initial.py
Initial knowledge base database schema                                     

backend/knowledge_base/migrations/0001_initial.py

• Create initial database schema for knowledge base models
• Define
tables for knowledge bases, documents, and chunks with proper indexing

• Add status tracking and configuration fields for processing

+72/-0   
tasks.py
Add knowledge base background tasks                                           

backend/knowledge_base/tasks.py

• Implement Celery tasks for knowledge base operations: creation,
deletion, crawling
• Add document processing pipeline with error
handling and status updates
• Include vector store initialization and
OpenSearch configuration

+126/-0 
models.py
Extend plan model for knowledge base support                         

backend/plan/models.py

• Extend Plan model with knowledge base quotas and rate limiting
fields
• Refactor UsageHistory to use generic foreign keys for
flexible content association
• Add UUID validation for referenced
models and team API key tracking

+44/-20 
services.py
Add LLM provider service implementations                                 

backend/llm/services.py

• Implement provider service classes for configuration management and
testing
• Add OpenAI provider validation and model temperature
handling
• Include team-specific provider configuration retrieval
logic

+108/-0 
summarizers.py
Add document summarization tools                                                 

backend/knowledge_base/tools/summarizers.py

• Implement LLM-based summarizers with standard and context-aware
variants
• Add context enhancement service for improving user-provided
goals
• Include prompt templates and temperature configuration

+93/-0   
0006_auto_20250807_1344.py
Migrate usage history to generic relationships                     

backend/plan/migrations/0006_auto_20250807_1344.py

• Migrate existing usage history foreign key relationships to generic
foreign keys
• Populate content_type and content_id fields from
existing data
• Include reverse migration for rollback capability

+87/-0   
admin.py
Add knowledge base Django admin interface                               

backend/knowledge_base/admin.py

• Add Django admin interface for knowledge base models
• Configure
fieldsets, search fields, and list displays for better management

Include proper field organization and readonly configurations

+105/-0 
providers.py
Add LLM provider implementations                                                 

backend/llm/providers.py

• Implement provider classes for OpenAI and WaterCrawl with model
discovery
• Add temperature configuration logic and embedding model
definitions
• Include client initialization and API interaction
methods

+99/-0   
keyword_extractors.py
Add keyword extraction tools                                                         

backend/knowledge_base/tools/keyword_extractors.py

• Implement keyword extraction using Jieba and LLM-based approaches

Add configurable keyword count and filtering logic
• Include Pydantic
schema for structured LLM output parsing

+93/-0   
factories.py
Add LLM factory classes                                                                   

backend/llm/factories.py

• Implement factory classes for creating chat models and providers

Add support for OpenAI and WaterCrawl provider configurations

Include temperature validation and API key decryption

+89/-0   
0002_initial.py
Add knowledge base model relationships                                     

backend/knowledge_base/migrations/0002_initial.py

• Add foreign key relationships between knowledge base and LLM models

• Create proper constraints and indexes for model relationships

Include team associations and provider configuration links

+57/-0   
signals.py
Update usage tracking signals for generic approach             

backend/plan/signals.py

• Refactor signal handlers to use generic usage history service
methods
• Add knowledge base document usage tracking with credit
calculation
• Update method names for consistency across different
request types

+28/-7   
file_to_markdown.py
Add file format conversion tools                                                 

backend/knowledge_base/tools/file_to_markdown.py

• Implement file-to-markdown converters for various formats (HTML,
DOCX, CSV)
• Add base converter class with storage integration

Include PyPandoc integration for document format conversion

+76/-0   
helpers.py
Add content cleaning utilities                                                     

backend/knowledge_base/helpers.py

• Implement noise removal utility for cleaning markdown content
• Add
methods for removing SVG, base64 images, HTML tags, and fixing
relative URLs
• Include URL parsing and absolute path conversion logic

+56/-0   
views.py
Add usage history API endpoint                                                     

backend/plan/views.py

• Add usage history API endpoint with filtering and team-based access

• Include proper queryset definitions for existing viewsets
• Add
comprehensive OpenAPI documentation for new endpoints

+31/-2   
models.py
Define agent system data models                                                   

backend/agent/models.py

• Define agent system models: Agent, Tool, Conversation, Message
• Add
relationships with LLM models, provider configs, and teams
• Include
configuration fields for agent behavior and tool integration

+74/-0   
interfaces.py
Define knowledge base component interfaces                             

backend/knowledge_base/interfaces.py

• Define abstract base classes for knowledge base components
• Add
interfaces for summarizers, keyword extractors, and file converters

Include proper inheritance structure and method signatures

+69/-0   
throttle.py
Add team-based rate throttling                                                     

backend/plan/throttle.py

• Implement team-based throttling for knowledge base operations
• Add
configurable rate limiting based on team plan settings
• Include cache
key generation and rate limit enforcement

+52/-0   
admin.py
Add LLM Django admin interface                                                     

backend/llm/admin.py

• Add Django admin interface for LLM models and configurations

Configure list displays, search fields, and fieldset organization

Include proper field grouping and readonly configurations

+57/-0   
serializers.py
Extend plan serializers for knowledge base                             

backend/plan/serializers.py

• Extend team plan serializer with knowledge base quota fields
• Add
usage history serializer with content type and API key information

Include proper field serialization and relationship handling

+32/-1   
services.py
Add admin services for provider management                             

backend/llm/admin_api/services.py

• Implement admin service for provider configuration management
• Add
automatic model and embedding synchronization logic
• Include provider
factory integration for model discovery

+45/-0   
storage.py
Add knowledge base file storage service                                   

backend/knowledge_base/tools/storage.py

• Implement storage service for knowledge base file management
• Add
file path generation and storage abstraction
• Include unique ID
generation and file extension handling

+46/-0   
0005_usagehistory_content_id_usagehistory_content_type_and_more.py
Add generic foreign key fields to usage history                   

backend/plan/migrations/0005_usagehistory_content_id_usagehistory_content_type_and_more.py

• Add generic foreign key fields to usage history model
• Include
content type and content ID fields for flexible associations
• Add
team API key relationship for tracking usage attribution

+31/-0   
filters.py
Add usage history filtering capabilities                                 

backend/plan/filters.py

• Implement usage history filtering by content type and API key
• Add
validation for content type format and proper error handling
• Include
support for filtering across different model types

+38/-0   
0004_plan_number_of_each_knowledge_base_documents_and_more.py
Add knowledge base quotas to plan model                                   

backend/plan/migrations/0004_plan_number_of_each_knowledge_base_documents_and_more.py

• Add knowledge base quota fields to plan model
• Include number of
knowledge bases and documents per knowledge base limits
• Set default
values for new plan configuration fields

+23/-0   
serializers.py
Add validation to spider option serializers                           

backend/core/serializers.py

• Add minimum value validation to spider option fields
• Ensure
max_depth and page_limit have minimum value of 1
• Improve input
validation for crawling parameters

+2/-2     
0008_plan_knowledge_base_retrival_rate_limit.py
Add knowledge base retrieval rate limiting                             

backend/plan/migrations/0008_plan_knowledge_base_retrival_rate_limit.py

• Add rate limiting field to plan model for knowledge base retrieval

Include DRF-style rate string format for flexible rate configuration

Set default value to None for optional rate limiting

+18/-0   
0007_remove_usagehistory_crawl_request_and_more.py
Remove deprecated usage history foreign keys                         

backend/plan/migrations/0007_remove_usagehistory_crawl_request_and_more.py

• Remove old foreign key fields from usage history model
• Clean up
deprecated crawl_request, search_request, and sitemap_request fields

Complete migration to generic foreign key approach

+25/-0   
admin.py
Update plan admin for knowledge base features                       

backend/plan/admin.py

• Add knowledge base quota fields to plan admin interface
• Update
usage history admin display to show generic content
• Include new
fields in plan fieldset organization

+4/-1     
decorators.py
Add API key context tracking in authentication decorator 

backend/user/decorators.py

• Import and call set_application_context_api_key function to store
API key in application context
• Add context tracking for API key
usage during authentication

+2/-0     
0002_alter_providerconfig_api_key.py
Expand API key field size in ProviderConfig model               

backend/llm/migrations/0002_alter_providerconfig_api_key.py

• Change api_key field type from CharField to TextField in
ProviderConfig model
• Allow for longer API keys storage

+18/-0   
application_context.py
Implement application context for API key tracking             

backend/common/application_context.py

• Create application context management using Django's Local storage

Add functions to set, get, and clear API key context

+15/-0   
permissions.py
Add superuser permission class                                                     

backend/user/permissions.py

• Add IsSuperUser permission class for superuser-only access

+5/-0     
serializers.py
Include superuser status in user serialization                     

backend/user/serializers.py

• Add is_superuser field to user serializer fields list

+1/-0     
markdown.css
Add markdown styling with syntax highlighting support       

frontend/src/styles/markdown.css

• Add comprehensive CSS styling for markdown content with syntax
highlighting
• Include both light and dark theme support for code
blocks and markdown elements

+175/-0 
SelectCrawlResultsPage.tsx
Add crawl results selection page for knowledge base import

frontend/src/pages/dashboard/knowledge-base/SelectCrawlResultsPage.tsx

• Create comprehensive page for selecting crawl results to import into
knowledge base
• Implement pagination, bulk selection, and import
functionality
• Add breadcrumb navigation and loading states

+392/-0 
Bug fix
2 files
views.py
Fix viewset queryset definitions and cleanup                         

backend/core/views.py

• Add missing queryset attributes to existing viewsets for consistency

• Remove debugging print statement from proxy server testing
• Import
additional models for proper type hints

+12/-2   
views.py
Fix user viewset queryset definitions                                       

backend/user/views.py

• Add missing queryset attributes to existing viewsets for consistency

• Import additional models for proper type hints
• Maintain existing
functionality while fixing queryset definitions

+5/-1     
Configuration changes
10 files
consts.py
Add LLM configuration constants                                                   

backend/llm/consts.py

• Define LLM provider constants, choices, and configuration options

Add visibility levels, truncation options, and provider information

Include structured provider configuration with required/optional
fields

+48/-0   
consts.py
Add knowledge base configuration constants                             

backend/knowledge_base/consts.py

• Define knowledge base status constants and document source types

Add summarizer type choices and processing status options
• Include
comprehensive choice definitions for model fields

+43/-0   
urls.py
Add knowledge base URL routing                                                     

backend/knowledge_base/urls.py

• Define URL routing for knowledge base API endpoints
• Add nested
routing for documents and chunks within knowledge bases
• Include
proper UUID parameter handling in URL patterns

+23/-0   
urls.py
Add usage history URL routing                                                       

backend/plan/urls.py

• Add usage history endpoint to plan URL routing
• Include proper
router registration for new viewset
• Import required view classes for
URL configuration

+7/-1     
urls.py
Add LLM admin API URL routing                                                       

backend/llm/admin_api/urls.py

• Define URL routing for LLM admin API endpoints
• Add router
registration for provider configs, models, and embeddings
• Include
proper basename configuration for API endpoints

+19/-0   
urls.py
Add knowledge base and LLM URL routing                                     

backend/watercrawl/urls.py

• Add URL patterns for knowledge base and LLM endpoints
• Include
admin API routes for LLM management

+10/-0   
apps.py
Add knowledge base Django app configuration                           

backend/knowledge_base/apps.py

• Create Django app configuration for knowledge base module
• Import
signals module in ready method

+9/-0     
apps.py
Add LLM Django app configuration                                                 

backend/llm/apps.py

• Create Django app configuration for LLM module

+6/-0     
apps.py
Add agent Django app configuration                                             

backend/agent/apps.py

• Create Django app configuration for agent module

+6/-0     
.env.example
Add OpenSearch configuration to environment template         

docker/.env.example

• Add OpenSearch configuration settings with password and dashboard
port
• Fix trailing whitespace formatting

+9/-2     
Tests
4 files
test.py
Add noise removal test script                                                       

backend/test.py

• Add test script for noise removal functionality
• Include sample
text processing and output verification
• Demonstrate usage of
NoiseRemover helper class

+40/-0   
tests.py
Add knowledge base test file placeholder                                 

backend/knowledge_base/tests.py

• Create empty test file placeholder for knowledge base module

+1/-0     
tests.py
Add LLM test file placeholder                                                       

backend/llm/tests.py

• Create empty test file placeholder for LLM module

+1/-0     
tests.py
Add agent test file placeholder                                                   

backend/agent/tests.py

• Create empty test file placeholder for agent module

+1/-0     
Miscellaneous
2 files
admin.py
Add agent admin file placeholder                                                 

backend/agent/admin.py

• Create empty admin file placeholder for agent module

+1/-0     
views.py
Add agent views file placeholder                                                 

backend/agent/views.py

• Create empty views file placeholder for agent module

+1/-0     
Dependencies
1 files
pnpm-lock.yaml
Add markdown rendering dependencies to package lock           

frontend/pnpm-lock.yaml

• Add dependencies for markdown rendering: @tailwindcss/typography,
react-markdown, rehype-highlight, rehype-raw, remark-gfm
• Include all
related transitive dependencies and type definitions

+956/-4 
Additional files
86 files
.env.example +5/-0     
__init__.py [link]   
__init__.py [link]   
__init__.py [link]   
__init__.py [link]   
__init__.py [link]   
signals.py +15/-0   
__init__.py [link]   
__init__.py [link]   
interfaces.py +20/-0   
__init__.py [link]   
urls.py +13/-0   
utils.py +7/-0     
pyproject.toml +13/-2   
settings.py +6/-0     
.env.local +9/-0     
README.md +32/-1   
docker-compose.local.yml +30/-1   
docker-compose.yml +38/-11 
python.md +106/-20
package.json +5/-0     
App.tsx +80/-8   
AdminCard.tsx +58/-0   
TeamSelector.tsx +9/-9     
PageOptionsForm.tsx +1/-1     
EnhanceContextModal.tsx +113/-0 
KnowledgeBaseApiDocumentation.tsx +248/-0 
KnowledgeBasePricingInfo.tsx +152/-0 
KnowledgeBaseQueryForm.tsx +233/-0 
KnowledgeBaseQueryResult.tsx +178/-0 
SearchForm.tsx +0/-1     
ProviderConfigForm.tsx +347/-0 
ProviderConfigList.tsx +208/-0 
ProviderConfigSettings.tsx +179/-0 
Breadcrumbs.tsx +3/-9     
Card.tsx +82/-0   
MarkdownRenderer.tsx +26/-0   
Modal.tsx +85/-0   
OptionCard.tsx +71/-0   
Slider.tsx +127/-0 
StatusBadge.tsx +44/-8   
UsageLimitBox.tsx +57/-0   
WithBreadcrumbs.tsx +0/-44   
SitemapApiDocumentation.tsx +6/-9     
index.d.ts +0/-24   
BreadcrumbContext.tsx +40/-0   
AdminLayout.tsx +238/-0 
DashboardLayout.tsx +34/-6   
AdminDashboard.tsx +43/-0   
ManageLLMProvidersPage.tsx +197/-0 
ManageProxiesPage.tsx +165/-0 
ProviderConfigDetailPage.tsx +251/-0 
ApiKeysPage.tsx +9/-0     
CrawlLogsPage.tsx +11/-2   
CrawlPage.tsx +8/-0     
CrawlRequestDetailPage.tsx +10/-0   
DashboardPage.tsx +17/-5   
ProfilePage.tsx +11/-0   
SearchLogsPage.tsx +9/-0     
SearchPage.tsx +9/-1     
SearchRequestDetailPage.tsx +10/-1   
SettingsPage.tsx +27/-2   
SitemapLogsPage.tsx +10/-0   
SitemapPage.tsx +10/-2   
SitemapRequestDetailPage.tsx +10/-0   
UsageHistoryPage.tsx +470/-0 
UsagePage.tsx +9/-0     
BatchUrlImportPage.tsx +138/-0 
ImportOptionsPage.tsx +224/-0 
ImportProgressPage.tsx +195/-0 
KnowledgeBaseDetailPage.tsx +480/-0 
KnowledgeBaseDocumentDetailPage.tsx +295/-0 
KnowledgeBaseEditPage.tsx +469/-0 
KnowledgeBaseNewPage.tsx +815/-0 
KnowledgeBasePage.tsx +227/-0 
KnowledgeBaseQueryPage.tsx +58/-0   
ManualEntryPage.tsx +156/-0 
NewCrawlPage.tsx +127/-0 
NewSitemapPage.tsx +187/-0 
SelectCrawlPage.tsx +223/-0 
SelectSitemapPage.tsx +217/-0 
UploadDocumentsPage.tsx +193/-0 
UrlSelectorPage.tsx +584/-0 
breadcrumbs.ts +0/-105 
classNames.ts +0/-3     
tailwind.config.mjs +2/-1     

…g and usage history

- Add Knowledge Base as a new core feature:
  - Implement models, signals, and logic for managing knowledge bases and their documents.
  - Integrate KnowledgeBaseProcessor and related tooling.
- Extend Plan model to support knowledge base quotas:
  - Add fields for max number of knowledge bases, max documents per knowledge base, and retrieval rate limits.
- Refactor UsageHistory to support generic credit tracking:
  - Use GenericForeignKey to associate usage history with crawl, search, sitemap, and knowledge base document events.
  - Enforce UUID primary key for referenced models.
  - Update admin and serializer logic for new usage history structure.
- Update plan enforcement and validators:
  - Modularize validation for crawl, search, sitemap, and knowledge base operations.
  - Enforce plan limits for knowledge base creation and document addition.
  - Improve error messages and validation feedback.
- Add API endpoints and filters for usage history and knowledge base management.
- Update signals to automate usage history creation for knowledge base document events.
- Update .gitignore for new directories and artifacts.
- Add new dependencies for knowledge base and search (elasticsearch, aiohttp, dataclasses-json, etc.).
- Remove debugging code and improve formatting across updated files.

This commit introduces the Knowledge Base system, enabling teams to create, manage, and track knowledge bases and their documents with full plan-based credit enforcement and usage history auditing.
… UI, reset migrations

- Improved security by updating default OpenSearch password across configuration files (.env.example, .env.local, docker-compose files).
- Refactored factories to centralize OpenSearch client creation and added decryption logic for embedding API keys.
- Updated ProviderConfig model to use TextField for API keys, improving credential handling.
- Enhanced document status service to always clear the error field.
- Added Celery initializer task for OpenSearch pipeline setup on worker startup.
- Deleted all knowledge_base and llm migrations for migration reset and schema changes.
- Increased top_k default/max values for knowledge base queries in serializers and UI forms.
- Improved validation and filetype checking for uploads.
- Major frontend improvements:
  - Revamped KnowledgeBase pages with beta notices and clarified document labels.
  - Redesigned crawl, sitemap, and crawl results selection pages; switched to table layouts and added pagination.
  - Enhanced selection state handling for crawl import, supporting cross-page result selection and total counters.
  - Updated API usage patterns and documentation to reflect new query signature.
  - Refined feedback messages, added enterprise-only fields, and improved loading/empty states in history and selection pages.
- Removed legacy/commented code and improved code formatting in several backend/frontend files.

BREAKING CHANGE:
- All Django migration files for knowledge_base and llm were deleted.
@dosubot dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Aug 22, 2025
@amirasaran amirasaran changed the title Feature/knowledge base feat: Introduce Knowledge Base feature with plan-based credit tracking and usage history Aug 22, 2025
@dosubot dosubot bot added the 🧠 feat:workflow Smart crawl planning, route building label Aug 22, 2025
@qodo-code-review qodo-code-review bot changed the title feat: Introduce Knowledge Base feature with plan-based credit tracking and usage history Feature/knowledge base Aug 22, 2025
@qodo-code-review
Copy link
Contributor

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Possible Issue

In MMR ranking, embeddings for documents are generated using embed_query on document text each time, which is inefficient and may be semantically wrong (should use embed_documents or stored vectors). Also selection logic uses documents.index(documents[i]) which is redundant; verify correctness and performance for large k.

)

# Get embeddings for all documents
doc_embeddings = []
for doc in documents:
    doc_embedding = self.embedding.embed_query(doc.page_content)
    doc_embeddings.append(doc_embedding)

# Convert to numpy arrays
query_emb = np.array(query_embedding)
doc_embs = np.array(doc_embeddings)

# Calculate similarities to query
query_similarities = np.dot(doc_embs, query_emb) / (
    np.linalg.norm(doc_embs, axis=1) * np.linalg.norm(query_emb)
)

selected = []
remaining = list(range(len(documents)))

# Select first document (highest similarity to query)
best_idx = np.argmax(query_similarities)
selected.append(remaining.pop(best_idx))

# Select remaining documents using MMR
for _ in range(min(k - 1, len(remaining))):
    mmr_scores = []

    for idx in remaining:
        # Relevance score
        relevance = query_similarities[idx]

        # Diversity score (max similarity to already selected)
        if selected:
            selected_embs = doc_embs[
                [documents.index(documents[i]) for i in selected]
            ]
            current_emb = doc_embs[idx]
            similarities = np.dot(selected_embs, current_emb) / (
                np.linalg.norm(selected_embs, axis=1)
                * np.linalg.norm(current_emb)
            )
            max_similarity = np.max(similarities)
        else:
            max_similarity = 0

        # MMR score
        mmr_score = lambda_mult * relevance - (1 - lambda_mult) * max_similarity
        mmr_scores.append(mmr_score)

    # Select document with highest MMR score
    best_idx = np.argmax(mmr_scores)
    selected.append(remaining.pop(best_idx))

return [documents[i] for i in selected]
Logic Consistency

Knowledge base documents limit check uses number_of_knowledge_bases instead of per-KB documents field when gating (-1 check). This likely should reference number_of_each_knowledge_base_documents for unlimited case; otherwise unlimited KB count could erroneously allow unlimited docs.

if self.team_plan_service.number_of_knowledge_bases == -1:
    return

total_number_of_documents = (
    knowledge_base.documents.count() + new_document_count
)

if (
    total_number_of_documents
    >= self.team_plan_service.number_of_each_knowledge_base_documents
):
    raise PermissionDenied(
Validation Flow

In FillKnowledgeBaseFromCrawlResultsSerializer.validate, attrs["crawl_result_uuids"] is replaced by a queryset via field validator, but later .count() is used and credits validated; ensure downstream code expects a queryset not a list of UUIDs to avoid type mismatches in services using these attrs.

def validate(self, attrs):
    return PlanLimitValidator(
        team=self.context["team"],
    ).validate_create_knowledge_base_document_from_crawl_results(
        self.context["knowledge_base"], attrs["crawl_result_uuids"].count(), attrs
    )

@qodo-code-review
Copy link
Contributor

qodo-code-review bot commented Aug 22, 2025

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix invalid nested router paths

DRF's DefaultRouter does not support nested regex paths in router.register;
these patterns won't route as intended. Use a nested router (e.g.,
drf-nested-routers) or move nested segments into the viewset's lookup and use
standard register prefixes.

backend/knowledge_base/urls.py [6-19]

 router = DefaultRouter()
-router.register(
-    r"knowledge-bases", views.KnowledgeBaseViewSet, basename="knowledge-base"
-)
-router.register(
-    r"knowledge-bases/(?P<knowledge_base_uuid>[0-9a-fA-F-]{36})/documents",
-    views.KnowledgeBaseDocumentViewSet,
-    basename="knowledge-base-document",
-)
-router.register(
-    r"knowledge-bases/(?P<knowledge_base_uuid>[0-9a-fA-F-]{36})/documents/(?P<document_uuid>[0-9a-fA-F-]{36})/chunks",
-    views.KnowledgeBaseChunkViewSet,
-    basename="knowledge-base-chunk",
-)
+router.register(r"knowledge-bases", views.KnowledgeBaseViewSet, basename="knowledge-base")
+router.register(r"documents", views.KnowledgeBaseDocumentViewSet, basename="knowledge-base-document")
+router.register(r"chunks", views.KnowledgeBaseChunkViewSet, basename="knowledge-base-chunk")
  • Apply / Chat
Suggestion importance[1-10]: 10

__

Why: The suggestion correctly identifies that DefaultRouter does not support nested regex paths, which would make the defined document and chunk endpoints completely non-functional.

High
Fix MMR indexing and stability

The MMR implementation recomputes indexes via documents.index(...) and can
mis-index, and it recomputes norms repeatedly, risking division by zero.
Precompute arrays and norms once, use indices directly, and guard zero norms.
This fixes incorrect selection and potential crashes for identical/zero vectors.

backend/knowledge_base/tools/vectore_store.py [350-414]

 def _apply_mmr_ranking(
-    self, documents: List[Document], query: str, k: int, lambda_mult: float
+    self, documents: List[Document], query: str | List[float], k: int, lambda_mult: float
 ) -> List[Document]:
-    """Apply MMR ranking to documents."""
     if not documents or len(documents) <= k:
         return documents
 
-    # Generate query embedding
-    query_embedding = (
-        self.embedding.embed_query(query) if isinstance(query, str) else query
-    )
+    # Prepare embeddings
+    query_embedding = self.embedding.embed_query(query) if isinstance(query, str) else query
+    doc_embs = np.array([self.embedding.embed_query(doc.page_content) for doc in documents], dtype=float)
+    query_emb = np.array(query_embedding, dtype=float)
 
-    # Get embeddings for all documents
-    doc_embeddings = []
-    for doc in documents:
-        doc_embedding = self.embedding.embed_query(doc.page_content)
-        doc_embeddings.append(doc_embedding)
+    # Guard against zero norms
+    doc_norms = np.linalg.norm(doc_embs, axis=1)
+    doc_norms[doc_norms == 0] = 1e-12
+    query_norm = np.linalg.norm(query_emb)
+    if query_norm == 0:
+        query_norm = 1e-12
 
-    # Convert to numpy arrays
-    query_emb = np.array(query_embedding)
-    doc_embs = np.array(doc_embeddings)
+    # Similarity to query
+    query_similarities = (doc_embs @ query_emb) / (doc_norms * query_norm)
 
-    # Calculate similarities to query
-    query_similarities = np.dot(doc_embs, query_emb) / (
-        np.linalg.norm(doc_embs, axis=1) * np.linalg.norm(query_emb)
-    )
+    selected: list[int] = []
+    remaining: list[int] = list(range(len(documents)))
 
-    selected = []
-    remaining = list(range(len(documents)))
+    # First pick
+    first_idx = int(np.argmax(query_similarities))
+    selected.append(first_idx)
+    remaining.remove(first_idx)
 
-    # Select first document (highest similarity to query)
-    best_idx = np.argmax(query_similarities)
-    selected.append(remaining.pop(best_idx))
+    # Iteratively pick with MMR
+    for _ in range(min(k - 1, len(remaining))):
+        max_sim_to_selected = np.zeros(len(remaining))
+        if selected:
+            selected_embs = doc_embs[selected]
+            selected_norms = np.linalg.norm(selected_embs, axis=1)
+            selected_norms[selected_norms == 0] = 1e-12
 
-    # Select remaining documents using MMR
-    for _ in range(min(k - 1, len(remaining))):
-        mmr_scores = []
+            # Compute cosine similarity between each remaining and selected, take max
+            rem_embs = doc_embs[remaining]
+            rem_norms = doc_norms[remaining]
+            sims = (rem_embs @ selected_embs.T) / (rem_norms[:, None] * selected_norms[None, :])
+            max_sim_to_selected = sims.max(axis=1)
 
-        for idx in remaining:
-            # Relevance score
-            relevance = query_similarities[idx]
+        relevance = query_similarities[remaining]
+        mmr_scores = lambda_mult * relevance - (1 - lambda_mult) * max_sim_to_selected
+        pick_pos = int(np.argmax(mmr_scores))
+        pick_idx = remaining[pick_pos]
+        selected.append(pick_idx)
+        remaining.pop(pick_pos)
 
-            # Diversity score (max similarity to already selected)
-            if selected:
-                selected_embs = doc_embs[
-                    [documents.index(documents[i]) for i in selected]
-                ]
-                current_emb = doc_embs[idx]
-                similarities = np.dot(selected_embs, current_emb) / (
-                    np.linalg.norm(selected_embs, axis=1)
-                    * np.linalg.norm(current_emb)
-                )
-                max_similarity = np.max(similarities)
-            else:
-                max_similarity = 0
+    return [documents[i] for i in selected[:k]]
 
-            # MMR score
-            mmr_score = lambda_mult * relevance - (1 - lambda_mult) * max_similarity
-            mmr_scores.append(mmr_score)
-
-        # Select document with highest MMR score
-        best_idx = np.argmax(mmr_scores)
-        selected.append(remaining.pop(best_idx))
-
-    return [documents[i] for i in selected]
-
  • Apply / Chat
Suggestion importance[1-10]: 9

__

Why: The suggestion correctly identifies and fixes multiple critical issues in the _apply_mmr_ranking method, including a potential crash from division by zero and incorrect results with duplicate documents, while also improving performance.

High
Fix duplicated file extensions in paths

The generated path currently duplicates the file extension (e.g.,
"file.csv.csv"). Either keep the original filename or strip the extension before
appending yours. This prevents incorrect paths and broken file retrieval later.

backend/knowledge_base/tools/storage.py [7-22]

 class StorageFile:
-    uuid: str
     name: str
     path: str
 
     def __init__(self, unique_id: str, name: str):
         self.unique_id = unique_id
         self.name = name
 
     @property
     def extension(self):
-        return self.name.split(".")[-1]
+        return self.name.rsplit(".", 1)[-1] if "." in self.name else ""
+
+    @property
+    def basename(self):
+        return self.name.rsplit(".", 1)[0] if "." in self.name else self.name
 
     def make_path(self, knowledge_base_uuid):
-        self.path = f"knowledge_base/{knowledge_base_uuid}/{self.unique_id}/{self.name}.{self.extension}"
+        if self.extension:
+            filename = f"{self.basename}.{self.extension}"
+        else:
+            filename = self.basename
+        self.path = f"knowledge_base/{knowledge_base_uuid}/{self.unique_id}/{filename}"
         return self
  • Apply / Chat
Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies a bug in make_path where the file extension is appended to the full filename, resulting in incorrect paths like file.txt.txt, which would cause issues with file storage and retrieval.

Medium
Align stored and indexed chunk content

Keep vector-store page_content consistent with what you persist in DB. Currently
the DB chunk includes the summary, but the vector doc does not, causing
retrieval/traceability mismatches. Use the same chunk.content for both.

backend/knowledge_base/tools/processor.py [89-113]

 def persist_to_vector_store(self, document: KnowledgeBaseDocument) -> List[str]:
-    ...
+    self.remove_from_vector_store(document)
+    document.chunks.all().delete()
+    chunks = []
+    index = 1
+    summary = ""
+    if self.summarizer:
+        summary = self.summarizer.summarize(document.content)
     for chunk_text in self.text_splitter.split_text(document.content):
+        enriched_text = f"{summary}\n\n{chunk_text}" if summary else chunk_text
         chunk = KnowledgeBaseChunk(
             document=document,
             index=index,
-            content=f"{summary}\n\n{chunk_text}" if summary else chunk_text,
+            content=enriched_text,
             keywords=self.keyword_extractor.extract_keywords(chunk_text),
         )
         chunk.save()
         chunk_uuid = str(chunk.uuid)
         chunks.append(
             Document(
-                page_content=chunk_text,
+                page_content=enriched_text,
                 id=chunk_uuid,
                 metadata={
                     "index": index,
                     "title": document.title,
                     "uuid": chunk_uuid,
                     "source": document.source,
                     "knowledge_base_id": str(document.knowledge_base.uuid),
                     "document_id": str(document.uuid),
                     "keywords": chunk.keywords,
                 },
             )
         )
         index += 1
+    return self.vector_store.add_documents(chunks)

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies a critical inconsistency where the content saved to the database (KnowledgeBaseChunk) includes a summary, but the content indexed in the vector store does not, which would lead to retrieval mismatches and degrade search quality.

Medium
Ensure tool returns a string

_run returns None despite the type hint -> str, which will break tool pipelines
expecting a string result. Return a meaningful status or scraped output, and
handle exceptions to surface errors instead of failing silently.

backend/agent/tools/scraper.py [13-19]

 class ScrapperTool(BaseTool):
     name = "scrapper"
     description = "scrapper"
     args_schema: Type[BaseModel] = ScraperParameters
 
     def _run(self, url: str) -> str:
-        CrawlerService.make_with_urls([url], self.agent.knowledge_base.team).run()
+        try:
+            CrawlerService.make_with_urls([url], self.agent.knowledge_base.team).run()
+            return f"Scrape started for: {url}"
+        except Exception as e:
+            return f"Scrape failed for {url}: {e}"
  • Apply / Chat
Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies that the _run method violates its -> str type hint by returning None, which would cause a runtime issue in the LangChain tool pipeline.

Medium
High-level
Risky OpenSearch integration

The vector store assumes OpenSearch availability and creates indices/pipelines
at runtime with broad settings, but there’s no environment/health gating or
graceful degradation path. Add a clear abstraction to disable or swap retrieval
backends and gate all index/pipeline creation and queries behind connectivity
checks with fail-closed behavior to avoid startup/task crashes and partial data
corruption.

Examples:

backend/knowledge_base/tasks.py [90-126]
@shared_task
def initializer():
    vector_store_type = getattr(settings, "KB_VECTOR_STORE_TYPE", "opensearch")
    if vector_store_type == "opensearch":
        client = VectorStoreFactory.create_opensearch_client()
        if settings.DEBUG:
            # Set low watermark to 99%
            client.cluster.put_settings(
                body={
                    "persistent": {

 ... (clipped 27 lines)
backend/knowledge_base/tools/vectore_store.py [18-86]
    def __init__(
        self,
        opensearch_client: OpenSearch,
        index_name: str,
        embedding: Embeddings,
        retrieval_strategy: Optional[RetrievalStrategy] = None,
        text_field: str = "text",
        vector_field: str = "vector_field",
        similarity_metric: str = "l2",
    ):

 ... (clipped 59 lines)

Solution Walkthrough:

Before:

# backend/knowledge_base/tasks.py
@worker_ready.connect
def initializer_on_worker_ready(sender, **kwargs):
    client = VectorStoreFactory.create_opensearch_client()
    # This will crash the worker if OpenSearch is down
    client.transport.perform_request(
        "PUT",
        "/_search/pipeline/rrf-pipeline",
        body={...}
    )

# backend/knowledge_base/tools/vectore_store.py
class WaterCrawlOpenSearchVectorStore(VectorStore):
    def __init__(self, opensearch_client, ...):
        self.client = opensearch_client
        # This is called directly and can raise an exception
        self._create_index_if_not_exists()

    def _create_index_if_not_exists(self):
        try:
            if not self.client.indices.exists(...):
                self.client.indices.create(...)
        except Exception as e:
            logger.error(...)
            raise # Crashes the caller

After:

# backend/knowledge_base/health.py
class OpenSearchHealth:
    @staticmethod
    def is_available():
        try:
            client = VectorStoreFactory.create_opensearch_client()
            return client.ping()
        except Exception:
            return False

# backend/knowledge_base/tasks.py
@worker_ready.connect
def initializer_on_worker_ready(sender, **kwargs):
    if not OpenSearchHealth.is_available():
        logger.warning("OpenSearch is not available. Skipping pipeline creation.")
        return
    # ... create pipeline ...

# backend/knowledge_base/tools/vectore_store.py
class WaterCrawlOpenSearchVectorStore(VectorStore):
    def __init__(self, opensearch_client, ...):
        self.client = opensearch_client
        self.is_healthy = self.client.ping()
        if self.is_healthy:
            self._create_index_if_not_exists()
        else:
            logger.error("OpenSearch client is not healthy.")

    def similarity_search(self, query, ...):
        if not self.is_healthy:
            # Gracefully fail instead of crashing
            raise ServiceUnavailable("Vector store is currently unavailable.")
        # ... existing logic ...
Suggestion importance[1-10]: 9

__

Why: This suggestion correctly identifies a critical architectural flaw where the system's stability is tightly coupled to OpenSearch's availability, which can cause startup and runtime failures.

High
General
Let browser set multipart header

The endpoint returns 204 with no body; the function awaits but returns void,
which is fine, yet the explicit multipart header can break browser boundary
setting. Remove the manual Content-Type to let the browser set correct
boundaries, and return the response status to allow caller to detect completion.

frontend/src/services/api/knowledgeBase.ts [87-107]

 async importFromFiles(
   knowledgeBaseUuid: string,
   files: File[],
   onUploadProgress: (progressEvent: any) => void
-) {
+): Promise<number> {
   const formData = new FormData();
   files.forEach(file => {
     formData.append('files', file);
   });
 
-  await api.post(
+  const resp = await api.post(
     `/api/v1/knowledge-base/knowledge-bases/${knowledgeBaseUuid}/documents/from-files/`,
     formData,
     {
-      headers: {
-        'Content-Type': 'multipart/form-data',
-      },
       onUploadProgress,
     }
   );
+  return resp.status;
 },

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 7

__

Why: The suggestion correctly points out that manually setting the Content-Type for multipart/form-data is problematic and should be removed, which is a valid and important fix for file uploads.

Medium
Enforce status check before adding files

Validate the knowledge base status before accepting files to avoid ingesting
into inactive/deleted bases. Early-reject when can_add_documents() is false to
prevent inconsistent state and wasted processing.

backend/knowledge_base/services.py [87-103]

-class KnowledgeBaseService:
-    ...
-    def add_files(
-        self, files: List[TemporaryUploadedFile]
-    ) -> List[KnowledgeBaseDocument]:
-        documents = []
+def add_files(
+    self, files: List[TemporaryUploadedFile]
+) -> List[KnowledgeBaseDocument]:
+    if not self.can_add_documents():
+        raise ValueError("Cannot add documents to a non-active knowledge base.")
+    documents = []
+    storage_service = KnowledgeBaseStorageService.from_knowledge_base(self.knowledge_base)
+    for file in files:
+        storage_file = storage_service.save_file(file)
+        documents.append(
+            self.make_document(
+                title=storage_file.name,
+                source=storage_file.path,
+                source_type=consts.DOCUMENT_SOURCE_TYPE_FILE,
+            ),
+        )
+    return documents
 
-        for file in files:
-            storage_file = KnowledgeBaseStorageService.from_knowledge_base(
-                self.knowledge_base,
-            ).save_file(file)
-            documents.append(
-                self.make_document(
-                    title=storage_file.name,
-                    source=storage_file.path,
-                    source_type=consts.DOCUMENT_SOURCE_TYPE_FILE,
-                ),
-            )
-        return documents
-

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 7

__

Why: The suggestion correctly points out that add_files lacks a status check, which could lead to adding documents to an inactive knowledge base. Adding the can_add_documents() check improves the robustness and correctness of the business logic.

Medium
Remove misleading field name

The filter declares field_name="content_type_filter" but filtering is
implemented via a custom method that targets content_type__app_label/model. The
mismatched field_name is misleading and can cause double filtering; set
field_name to None (or remove it) to rely solely on the method-driven filtering.

backend/plan/filters.py [8-19]

 class UsageHistoryFilter(django_filters.FilterSet):
     content_type = django_filters.ChoiceFilter(
-        field_name="content_type_filter",
         label=_("Content type"),
         choices=[
             ("core.crawlrequest", "Crawl request"),
             ("core.searchrequest", "Search request"),
             ("core.sitemaprequest", "Sitemap request"),
             ("knowledge_base.knowledgebasedocument", "Knowledge base document"),
         ],
         method="filter_content_type",
     )

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 5

__

Why: The suggestion correctly points out that field_name is redundant and misleading when a custom method is used for filtering, improving code clarity and preventing potential future bugs.

Low
  • Update

amirasaran and others added 19 commits August 22, 2025 21:33
…ttings

- Introduce KNOWLEDGE_BASE_ENABLED flag to conditionally enable Knowledge Base features in backend and frontend.
- Rename and consolidate KB_* environment variables to KNOWLEDGE_BASE_*.
- Feature-gate API endpoints, signals, and background tasks for Knowledge Base using new flag.
- Conditionally render Knowledge Base navigation and routes in frontend based on settings.
- Update .env.example and documentation to reflect new variable names and startup instructions.
- Clean up variable usage for consistency and future maintainability.
Update Development branch
fix(SelectCrawlPage): handle empty crawl URL to prevent potential rendering issues
amirasaran and others added 30 commits November 4, 2025 00:22
Update with development branch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🧠 feat:workflow Smart crawl planning, route building Review effort 4/5 size:XL This PR changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants