-
Notifications
You must be signed in to change notification settings - Fork 220
Feature/knowledge base #116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…g and usage history - Add Knowledge Base as a new core feature: - Implement models, signals, and logic for managing knowledge bases and their documents. - Integrate KnowledgeBaseProcessor and related tooling. - Extend Plan model to support knowledge base quotas: - Add fields for max number of knowledge bases, max documents per knowledge base, and retrieval rate limits. - Refactor UsageHistory to support generic credit tracking: - Use GenericForeignKey to associate usage history with crawl, search, sitemap, and knowledge base document events. - Enforce UUID primary key for referenced models. - Update admin and serializer logic for new usage history structure. - Update plan enforcement and validators: - Modularize validation for crawl, search, sitemap, and knowledge base operations. - Enforce plan limits for knowledge base creation and document addition. - Improve error messages and validation feedback. - Add API endpoints and filters for usage history and knowledge base management. - Update signals to automate usage history creation for knowledge base document events. - Update .gitignore for new directories and artifacts. - Add new dependencies for knowledge base and search (elasticsearch, aiohttp, dataclasses-json, etc.). - Remove debugging code and improve formatting across updated files. This commit introduces the Knowledge Base system, enabling teams to create, manage, and track knowledge bases and their documents with full plan-based credit enforcement and usage history auditing.
… UI, reset migrations - Improved security by updating default OpenSearch password across configuration files (.env.example, .env.local, docker-compose files). - Refactored factories to centralize OpenSearch client creation and added decryption logic for embedding API keys. - Updated ProviderConfig model to use TextField for API keys, improving credential handling. - Enhanced document status service to always clear the error field. - Added Celery initializer task for OpenSearch pipeline setup on worker startup. - Deleted all knowledge_base and llm migrations for migration reset and schema changes. - Increased top_k default/max values for knowledge base queries in serializers and UI forms. - Improved validation and filetype checking for uploads. - Major frontend improvements: - Revamped KnowledgeBase pages with beta notices and clarified document labels. - Redesigned crawl, sitemap, and crawl results selection pages; switched to table layouts and added pagination. - Enhanced selection state handling for crawl import, supporting cross-page result selection and total counters. - Updated API usage patterns and documentation to reflect new query signature. - Refined feedback messages, added enterprise-only fields, and improved loading/empty states in history and selection pages. - Removed legacy/commented code and improved code formatting in several backend/frontend files. BREAKING CHANGE: - All Django migration files for knowledge_base and llm were deleted.
…RL, update temperature handling in models
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Explore these optional code suggestions:
|
|||||||||||||||||||||||||||
…ttings - Introduce KNOWLEDGE_BASE_ENABLED flag to conditionally enable Knowledge Base features in backend and frontend. - Rename and consolidate KB_* environment variables to KNOWLEDGE_BASE_*. - Feature-gate API endpoints, signals, and background tasks for Knowledge Base using new flag. - Conditionally render Knowledge Base navigation and routes in frontend based on settings. - Update .env.example and documentation to reflect new variable names and startup instructions. - Clean up variable usage for consistency and future maintainability.
…xt-aware enhancer
Update Development branch
Feature/knowledge base
fix(SelectCrawlPage): handle empty crawl URL to prevent potential rendering issues
Feature/knowledge base
Merge with main
Feature/knowledge base
Merge With main
Merge With Main
New Crowdin updates
Update with development branch
User description
Description
feat: Introduce Knowledge Base feature with plan-based credit tracking and usage history
This commit introduces the Knowledge Base system, enabling teams to create, manage, and track knowledge bases and their documents with full plan-based credit enforcement and usage history auditing.
Type of Change
UI Changes
Testing
Checklist
PR Type
Enhancement
Description
• Knowledge Base System: Complete implementation of a new knowledge base feature with document management, vector search, and AI-powered querying capabilities
• LLM Provider Management: Add comprehensive provider configuration system supporting OpenAI and WaterCrawl with model discovery and API key management
• Generic Usage Tracking: Refactor usage history to support multiple content types (crawl, search, sitemap, knowledge base) using GenericForeignKey
• Plan Extensions: Extend subscription plans with knowledge base quotas (max knowledge bases, documents per KB, retrieval rate limits)
• Vector Store Integration: Implement OpenSearch-based vector store with multiple retrieval strategies (similarity, MMR, hybrid search)
• Document Processing Pipeline: Add comprehensive document processing with text splitting, embedding, summarization, and keyword extraction
• Frontend Components: Complete React frontend with knowledge base management, provider configuration, and usage history interfaces
• Admin Interface: Add superuser-only admin panels for managing LLM providers, models, and system configurations
• API Endpoints: Comprehensive REST API for knowledge base operations, document import, querying, and provider management
• Background Tasks: Celery-based async processing for document indexing, vector store operations, and content processing
Diagram Walkthrough
File Walkthrough
14 files
knowledgeBase.ts
Knowledge Base API Service Implementationfrontend/src/services/api/knowledgeBase.ts
• Add comprehensive API service for knowledge base operations
including CRUD operations, document management, and querying
•
Implement methods for importing documents from URLs, crawl results,
and files with upload progress tracking
• Add support for
context-aware enhancement, chunk retrieval, and retry indexing
functionality
provider.ts
Admin Provider Management API Servicefrontend/src/services/api/admin/provider.ts
• Add admin API service for managing provider configurations, LLM
models, and embedding models
• Implement CRUD operations with
pagination support for all provider-related entities
• Add provider
synchronization and configuration testing endpoints
knowledge.ts
Knowledge Base TypeScript Type Definitionsfrontend/src/types/knowledge.ts
• Define comprehensive TypeScript interfaces for knowledge base
entities including status enums and form data types
• Add interfaces
for documents, chunks, context-aware enhancement, and import
operations
• Include default values and utility functions for chunk
size calculations
provider.ts
Provider Configuration API Servicefrontend/src/services/api/provider.ts
• Add API service for provider configuration management with CRUD
operations
• Implement provider listing, configuration testing, and
model management endpoints
• Support both paginated and non-paginated
provider configuration retrieval
provider.ts
Provider and Model Type Definitionsfrontend/src/types/provider.ts
• Define TypeScript interfaces for providers, models, embeddings, and
configurations
• Add enums for option requirements and form data
structures
• Include comprehensive type definitions for LLM and
embedding model properties
provider.ts
Admin Provider Type Definitionsfrontend/src/types/admin/provider.ts
• Add admin-specific TypeScript interfaces for provider management
•
Define visibility level enums and request/response types for LLM and
embedding models
• Include comprehensive admin provider configuration
interfaces
usage_history.ts
Usage History API Servicefrontend/src/services/api/usage_history.ts
• Add API service for retrieving usage history with pagination and
filtering support
• Implement filtering by team API key and content
type parameters
usage_history.ts
Usage History Type Definitionsfrontend/src/types/usage_history.ts
• Define TypeScript interfaces for usage history tracking across
different content types
• Add content type enum for crawl requests,
sitemaps, searches, and knowledge base documents
• Include team API
key summary interface for usage attribution
vectore_store.py
OpenSearch Vector Store Implementationbackend/knowledge_base/tools/vectore_store.py
• Implement comprehensive OpenSearch vector store with pluggable
retrieval strategies
• Add support for similarity search, MMR search,
and hybrid retrieval methods
• Include automatic index creation,
document management, and keyword-based scoring
views.py
Knowledge Base API ViewSetsbackend/knowledge_base/views.py
• Add comprehensive ViewSets for knowledge bases, documents, and
chunks with full CRUD operations
• Implement document import from
URLs, crawl results, files, and context-aware enhancement
• Add query
functionality with rate limiting and plan validation integration
retrieval_strategies.py
Vector Store Retrieval Strategiesbackend/knowledge_base/tools/retrieval_strategies.py
• Implement pluggable retrieval strategies for different search
approaches (dense, content, keyword)
• Add BM25-optimized queries with
hybrid search support and keyword boosting
• Include comprehensive
OpenSearch index mapping generation with similarity metrics
serializers.py
Knowledge Base API Serializersbackend/knowledge_base/serializers.py
• Add comprehensive serializers for knowledge base entities with
validation
• Implement form data serializers for document import from
various sources
• Include context-aware enhancement and query
serializers with plan validation integration
factories.py
Knowledge Base Component Factoriesbackend/knowledge_base/factories.py
• Implement factory pattern for creating knowledge base components
(embedders, vector stores, summarizers)
• Add support for multiple
providers (OpenAI, WaterCrawl) and file format converters
• Include
configurable text splitters and keyword extractors with knowledge base
integration
serializers.py
LLM Provider and Model Serializersbackend/llm/serializers.py
• Add serializers for LLM models, embedding models, and provider
configurations
• Implement API key encryption/decryption and provider
configuration testing
• Include comprehensive validation for provider
settings and model parameters
49 files
utils.ts
CSS Class Name Utility Functionfrontend/src/lib/utils.ts
• Add
classnamesutility function for conditional CSS class generation• Implement object-based class name filtering and joining
functionality
subscription.ts
Knowledge Base Subscription Fieldsfrontend/src/types/subscription.ts
• Extend
CurrentSubscriptioninterface with knowledge base quotafields
• Add properties for number of knowledge bases, documents per
knowledge base, and retrieval rate limits
user.ts
User Profile Superuser Fieldfrontend/src/types/user.ts
• Add
is_superuserboolean field to theProfileinterface• Extend
user profile with administrative privilege indicator
validators.py
Modular Plan Validation Systembackend/plan/validators.py
• Refactor plan validation into modular mixins for different request
types
• Add comprehensive knowledge base validation for creation and
document limits
• Implement generic credit validation with daily and
total credit checking
services.py
Generic Usage Tracking and Knowledge Base Quotasbackend/plan/services.py
• Extend plan services with knowledge base quota support and generic
usage tracking
• Refactor usage history to support multiple content
types via GenericForeignKey
• Add credit calculation and validation
for knowledge base operations
views.py
Add admin API views for LLM managementbackend/llm/admin_api/views.py
• Add comprehensive admin API views for LLM provider configurations,
models, and embeddings
• Implement CRUD operations with OpenAPI
documentation and proper permissions
• Include custom actions for
syncing models/embeddings and testing configurations
services.py
Implement knowledge base core servicesbackend/knowledge_base/services.py
• Implement core knowledge base service classes for managing knowledge
bases and documents
• Add methods for adding URLs, files, crawl
results and processing documents
• Include document indexing, vector
store operations and content processing logic
views.py
Add LLM provider configuration API viewsbackend/llm/views.py
• Add team-specific provider configuration API endpoints with full
CRUD operations
• Implement provider testing, model listing, and
configuration validation
• Include comprehensive OpenAPI documentation
and proper authentication
models.py
Define knowledge base data modelsbackend/knowledge_base/models.py
• Define core knowledge base models:
KnowledgeBase,KnowledgeBaseDocument,KnowledgeBaseChunk• Add fields for chunking
configuration, embedding models, and summarization settings
• Include
status tracking and metadata fields with proper relationships
processor.py
Add knowledge base processing enginebackend/knowledge_base/tools/processor.py
• Implement main knowledge base processor for text splitting,
embedding, and vector storage
• Add methods for document persistence,
search operations, and vector store management
• Include factory
pattern integration for various processing components
0001_initial.py
Initial LLM database schema migrationbackend/llm/migrations/0001_initial.py
• Create initial database schema for LLM models, provider configs, and
embedding models
• Define relationships between models and teams with
proper constraints
• Add visibility levels and temperature
configuration fields
serializers.py
Add admin API serializers for LLM managementbackend/llm/admin_api/serializers.py
• Add serializers for admin API with validation and encryption
handling
• Implement provider configuration testing and API key
encryption
• Include comprehensive field validation and error handling
models.py
Define LLM and provider configuration modelsbackend/llm/models.py
• Define LLM model classes:
LLMModel,ProviderConfig,EmbeddingModel•
Add provider configuration with team relationships and
global/team-specific logic
• Include temperature settings, visibility
levels, and model metadata
0001_initial.py
Initial knowledge base database schemabackend/knowledge_base/migrations/0001_initial.py
• Create initial database schema for knowledge base models
• Define
tables for knowledge bases, documents, and chunks with proper indexing
• Add status tracking and configuration fields for processing
tasks.py
Add knowledge base background tasksbackend/knowledge_base/tasks.py
• Implement Celery tasks for knowledge base operations: creation,
deletion, crawling
• Add document processing pipeline with error
handling and status updates
• Include vector store initialization and
OpenSearch configuration
models.py
Extend plan model for knowledge base supportbackend/plan/models.py
• Extend
Planmodel with knowledge base quotas and rate limitingfields
• Refactor
UsageHistoryto use generic foreign keys forflexible content association
• Add UUID validation for referenced
models and team API key tracking
services.py
Add LLM provider service implementationsbackend/llm/services.py
• Implement provider service classes for configuration management and
testing
• Add OpenAI provider validation and model temperature
handling
• Include team-specific provider configuration retrieval
logic
summarizers.py
Add document summarization toolsbackend/knowledge_base/tools/summarizers.py
• Implement LLM-based summarizers with standard and context-aware
variants
• Add context enhancement service for improving user-provided
goals
• Include prompt templates and temperature configuration
0006_auto_20250807_1344.py
Migrate usage history to generic relationshipsbackend/plan/migrations/0006_auto_20250807_1344.py
• Migrate existing usage history foreign key relationships to generic
foreign keys
• Populate
content_typeandcontent_idfields fromexisting data
• Include reverse migration for rollback capability
admin.py
Add knowledge base Django admin interfacebackend/knowledge_base/admin.py
• Add Django admin interface for knowledge base models
• Configure
fieldsets, search fields, and list displays for better management
•
Include proper field organization and readonly configurations
providers.py
Add LLM provider implementationsbackend/llm/providers.py
• Implement provider classes for OpenAI and WaterCrawl with model
discovery
• Add temperature configuration logic and embedding model
definitions
• Include client initialization and API interaction
methods
keyword_extractors.py
Add keyword extraction toolsbackend/knowledge_base/tools/keyword_extractors.py
• Implement keyword extraction using Jieba and LLM-based approaches
•
Add configurable keyword count and filtering logic
• Include Pydantic
schema for structured LLM output parsing
factories.py
Add LLM factory classesbackend/llm/factories.py
• Implement factory classes for creating chat models and providers
•
Add support for OpenAI and WaterCrawl provider configurations
•
Include temperature validation and API key decryption
0002_initial.py
Add knowledge base model relationshipsbackend/knowledge_base/migrations/0002_initial.py
• Add foreign key relationships between knowledge base and LLM models
• Create proper constraints and indexes for model relationships
•
Include team associations and provider configuration links
signals.py
Update usage tracking signals for generic approachbackend/plan/signals.py
• Refactor signal handlers to use generic usage history service
methods
• Add knowledge base document usage tracking with credit
calculation
• Update method names for consistency across different
request types
file_to_markdown.py
Add file format conversion toolsbackend/knowledge_base/tools/file_to_markdown.py
• Implement file-to-markdown converters for various formats (HTML,
DOCX, CSV)
• Add base converter class with storage integration
•
Include PyPandoc integration for document format conversion
helpers.py
Add content cleaning utilitiesbackend/knowledge_base/helpers.py
• Implement noise removal utility for cleaning markdown content
• Add
methods for removing SVG, base64 images, HTML tags, and fixing
relative URLs
• Include URL parsing and absolute path conversion logic
views.py
Add usage history API endpointbackend/plan/views.py
• Add usage history API endpoint with filtering and team-based access
• Include proper queryset definitions for existing viewsets
• Add
comprehensive OpenAPI documentation for new endpoints
models.py
Define agent system data modelsbackend/agent/models.py
• Define agent system models:
Agent,Tool,Conversation,Message• Add
relationships with LLM models, provider configs, and teams
• Include
configuration fields for agent behavior and tool integration
interfaces.py
Define knowledge base component interfacesbackend/knowledge_base/interfaces.py
• Define abstract base classes for knowledge base components
• Add
interfaces for summarizers, keyword extractors, and file converters
•
Include proper inheritance structure and method signatures
throttle.py
Add team-based rate throttlingbackend/plan/throttle.py
• Implement team-based throttling for knowledge base operations
• Add
configurable rate limiting based on team plan settings
• Include cache
key generation and rate limit enforcement
admin.py
Add LLM Django admin interfacebackend/llm/admin.py
• Add Django admin interface for LLM models and configurations
•
Configure list displays, search fields, and fieldset organization
•
Include proper field grouping and readonly configurations
serializers.py
Extend plan serializers for knowledge basebackend/plan/serializers.py
• Extend team plan serializer with knowledge base quota fields
• Add
usage history serializer with content type and API key information
•
Include proper field serialization and relationship handling
services.py
Add admin services for provider managementbackend/llm/admin_api/services.py
• Implement admin service for provider configuration management
• Add
automatic model and embedding synchronization logic
• Include provider
factory integration for model discovery
storage.py
Add knowledge base file storage servicebackend/knowledge_base/tools/storage.py
• Implement storage service for knowledge base file management
• Add
file path generation and storage abstraction
• Include unique ID
generation and file extension handling
0005_usagehistory_content_id_usagehistory_content_type_and_more.py
Add generic foreign key fields to usage historybackend/plan/migrations/0005_usagehistory_content_id_usagehistory_content_type_and_more.py
• Add generic foreign key fields to usage history model
• Include
content type and content ID fields for flexible associations
• Add
team API key relationship for tracking usage attribution
filters.py
Add usage history filtering capabilitiesbackend/plan/filters.py
• Implement usage history filtering by content type and API key
• Add
validation for content type format and proper error handling
• Include
support for filtering across different model types
0004_plan_number_of_each_knowledge_base_documents_and_more.py
Add knowledge base quotas to plan modelbackend/plan/migrations/0004_plan_number_of_each_knowledge_base_documents_and_more.py
• Add knowledge base quota fields to plan model
• Include number of
knowledge bases and documents per knowledge base limits
• Set default
values for new plan configuration fields
serializers.py
Add validation to spider option serializersbackend/core/serializers.py
• Add minimum value validation to spider option fields
• Ensure
max_depthandpage_limithave minimum value of 1• Improve input
validation for crawling parameters
0008_plan_knowledge_base_retrival_rate_limit.py
Add knowledge base retrieval rate limitingbackend/plan/migrations/0008_plan_knowledge_base_retrival_rate_limit.py
• Add rate limiting field to plan model for knowledge base retrieval
•
Include DRF-style rate string format for flexible rate configuration
•
Set default value to None for optional rate limiting
0007_remove_usagehistory_crawl_request_and_more.py
Remove deprecated usage history foreign keysbackend/plan/migrations/0007_remove_usagehistory_crawl_request_and_more.py
• Remove old foreign key fields from usage history model
• Clean up
deprecated crawl_request, search_request, and sitemap_request fields
•
Complete migration to generic foreign key approach
admin.py
Update plan admin for knowledge base featuresbackend/plan/admin.py
• Add knowledge base quota fields to plan admin interface
• Update
usage history admin display to show generic content
• Include new
fields in plan fieldset organization
decorators.py
Add API key context tracking in authentication decoratorbackend/user/decorators.py
• Import and call
set_application_context_api_keyfunction to storeAPI key in application context
• Add context tracking for API key
usage during authentication
0002_alter_providerconfig_api_key.py
Expand API key field size in ProviderConfig modelbackend/llm/migrations/0002_alter_providerconfig_api_key.py
• Change
api_keyfield type from CharField to TextField inProviderConfig model
• Allow for longer API keys storage
application_context.py
Implement application context for API key trackingbackend/common/application_context.py
• Create application context management using Django's Local storage
•
Add functions to set, get, and clear API key context
permissions.py
Add superuser permission classbackend/user/permissions.py
• Add
IsSuperUserpermission class for superuser-only accessserializers.py
Include superuser status in user serializationbackend/user/serializers.py
• Add
is_superuserfield to user serializer fields listmarkdown.css
Add markdown styling with syntax highlighting supportfrontend/src/styles/markdown.css
• Add comprehensive CSS styling for markdown content with syntax
highlighting
• Include both light and dark theme support for code
blocks and markdown elements
SelectCrawlResultsPage.tsx
Add crawl results selection page for knowledge base importfrontend/src/pages/dashboard/knowledge-base/SelectCrawlResultsPage.tsx
• Create comprehensive page for selecting crawl results to import into
knowledge base
• Implement pagination, bulk selection, and import
functionality
• Add breadcrumb navigation and loading states
2 files
views.py
Fix viewset queryset definitions and cleanupbackend/core/views.py
• Add missing
querysetattributes to existing viewsets for consistency• Remove debugging print statement from proxy server testing
• Import
additional models for proper type hints
views.py
Fix user viewset queryset definitionsbackend/user/views.py
• Add missing
querysetattributes to existing viewsets for consistency• Import additional models for proper type hints
• Maintain existing
functionality while fixing queryset definitions
10 files
consts.py
Add LLM configuration constantsbackend/llm/consts.py
• Define LLM provider constants, choices, and configuration options
•
Add visibility levels, truncation options, and provider information
•
Include structured provider configuration with required/optional
fields
consts.py
Add knowledge base configuration constantsbackend/knowledge_base/consts.py
• Define knowledge base status constants and document source types
•
Add summarizer type choices and processing status options
• Include
comprehensive choice definitions for model fields
urls.py
Add knowledge base URL routingbackend/knowledge_base/urls.py
• Define URL routing for knowledge base API endpoints
• Add nested
routing for documents and chunks within knowledge bases
• Include
proper UUID parameter handling in URL patterns
urls.py
Add usage history URL routingbackend/plan/urls.py
• Add usage history endpoint to plan URL routing
• Include proper
router registration for new viewset
• Import required view classes for
URL configuration
urls.py
Add LLM admin API URL routingbackend/llm/admin_api/urls.py
• Define URL routing for LLM admin API endpoints
• Add router
registration for provider configs, models, and embeddings
• Include
proper basename configuration for API endpoints
urls.py
Add knowledge base and LLM URL routingbackend/watercrawl/urls.py
• Add URL patterns for knowledge base and LLM endpoints
• Include
admin API routes for LLM management
apps.py
Add knowledge base Django app configurationbackend/knowledge_base/apps.py
• Create Django app configuration for knowledge base module
• Import
signals module in ready method
apps.py
Add LLM Django app configurationbackend/llm/apps.py
• Create Django app configuration for LLM module
apps.py
Add agent Django app configurationbackend/agent/apps.py
• Create Django app configuration for agent module
.env.example
Add OpenSearch configuration to environment templatedocker/.env.example
• Add OpenSearch configuration settings with password and dashboard
port
• Fix trailing whitespace formatting
4 files
test.py
Add noise removal test scriptbackend/test.py
• Add test script for noise removal functionality
• Include sample
text processing and output verification
• Demonstrate usage of
NoiseRemoverhelper classtests.py
Add knowledge base test file placeholderbackend/knowledge_base/tests.py
• Create empty test file placeholder for knowledge base module
tests.py
Add LLM test file placeholderbackend/llm/tests.py
• Create empty test file placeholder for LLM module
tests.py
Add agent test file placeholderbackend/agent/tests.py
• Create empty test file placeholder for agent module
2 files
admin.py
Add agent admin file placeholderbackend/agent/admin.py
• Create empty admin file placeholder for agent module
views.py
Add agent views file placeholderbackend/agent/views.py
• Create empty views file placeholder for agent module
1 files
pnpm-lock.yaml
Add markdown rendering dependencies to package lockfrontend/pnpm-lock.yaml
• Add dependencies for markdown rendering:
@tailwindcss/typography,react-markdown,rehype-highlight,rehype-raw,remark-gfm• Include all
related transitive dependencies and type definitions
86 files