- System Overview
- Architecture Components
- Data Flow Diagrams
- Backend Services
- AI System Components
- Database Architecture
- API Endpoints
- Multilingual Support
- Configuration
- Deployment
Avokat AI is a legal document analysis system that combines PDF processing, knowledge graph construction, and multilingual AI-powered chat capabilities. The system is designed to provide grounded legal assistance by analyzing uploaded documents and creating session-isolated knowledge graphs.
- Session Isolation: Each chat session maintains its own knowledge graph and document context
- Multilingual Support: Automatic language detection for Arabic, English, and mixed-language documents
- Knowledge Graph: Neo4j-based graph storage with entity and relationship extraction
- Streaming Chat: Real-time response generation with Server-Sent Events (SSE)
- Document Processing: PyMuPDF-based PDF text extraction and chunking
- Legal Disclaimer: Built-in legal disclaimers in all responses
graph TB
subgraph "Client Layer"
UI[React Frontend]
end
subgraph "API Layer"
API[FastAPI Backend]
CORS[CORS Middleware]
end
subgraph "Service Layer"
PDF[PDF Processor]
LD[Language Detector]
KG[Knowledge Graph Builder]
LLM[LLM Service]
RET[Retrieval Service]
EMB[Embedding Service]
end
subgraph "Data Layer"
SQLITE[(SQLite Database)]
NEO4J[(Neo4j Aura Cloud)]
FILES[File Storage]
end
subgraph "External Services"
GEMINI[Gemini 2.5 Flash]
end
UI -->|HTTPS/SSE| API
API --> CORS
API --> PDF
API --> LD
API --> KG
API --> LLM
API --> RET
API --> EMB
PDF --> FILES
LD --> KG
KG --> NEO4J
KG --> GEMINI
LLM --> GEMINI
RET --> NEO4J
EMB --> KG
API --> SQLITE
classDef client fill:#e1f5fe
classDef api fill:#f3e5f5
classDef service fill:#e8f5e8
classDef data fill:#fff3e0
classDef external fill:#ffebee
class UI client
class API,CORS api
class PDF,LD,KG,LLM,RET,EMB service
class SQLITE,NEO4J,FILES data
class GEMINI external
sequenceDiagram
participant U as User
participant F as Frontend
participant A as FastAPI
participant S as SQLite
participant P as PDF Processor
participant L as Language Detector
participant K as KG Builder
participant N as Neo4j
participant G as Gemini LLM
participant R as Retrieval Service
participant E as Embedding Service
Note over U,E: Session Creation Flow
U->>F: Create Session
F->>A: POST /sessions
A->>S: Create session record
S-->>A: Session ID
A-->>F: {session_id}
F-->>U: Session created
Note over U,E: Document Upload & Processing Flow
U->>F: Upload PDF
F->>A: POST /ingest (multipart)
A->>S: Validate session
A->>P: Process PDF
P->>P: Extract text & chunk
P-->>A: Document chunks
A->>L: Detect language
L-->>A: Language (arabic/english/mixed)
A->>K: Extract knowledge graph
K->>G: Generate entities/relationships
G-->>K: Structured data
K->>N: Store nodes & relationships
K->>E: Generate embeddings
E-->>K: Embedding vectors
K->>N: Store document chunks with embeddings
A->>S: Record upload metadata
A-->>F: Processing complete
F-->>U: Upload successful
Note over U,E: Chat Flow
U->>F: Send message
F->>A: POST /chat
A->>S: Store user message
A->>R: Retrieve relevant context
R->>N: Query knowledge graph
N-->>R: Entities & relationships
R->>N: Get document chunks
N-->>R: Context chunks
R-->>A: Retrieval result
A->>L: Detect query language
L-->>A: Query language
A->>G: Generate response (streaming)
G-->>A: Response chunks
A->>S: Store assistant message
A-->>F: Stream response
F-->>U: Display response
flowchart TD
START[PDF Upload] --> VALIDATE{Validate File}
VALIDATE -->|Invalid| ERROR[Return Error]
VALIDATE -->|Valid| EXTRACT[Extract Text with PyMuPDF]
EXTRACT --> CHUNK[Chunk Documents]
CHUNK --> DETECT[Detect Language]
DETECT --> ARABIC{Language?}
ARABIC -->|Arabic| ARABIC_PROMPT[Enhanced Arabic Prompts]
ARABIC -->|English| ENGLISH_PROMPT[Standard English Prompts]
ARABIC -->|Mixed| MIXED_PROMPT[Mixed Language Prompts]
ARABIC_PROMPT --> LLM_EXTRACT[Gemini LLM Extraction]
ENGLISH_PROMPT --> LLM_EXTRACT
MIXED_PROMPT --> LLM_EXTRACT
LLM_EXTRACT --> PARSE[Parse JSON Response]
PARSE --> VALIDATE_JSON{Valid JSON?}
VALIDATE_JSON -->|No| FALLBACK[Use Fallback Extraction]
VALIDATE_JSON -->|Yes| CREATE_NODES[Create Graph Nodes]
FALLBACK --> CREATE_NODES
CREATE_NODES --> ADD_METADATA[Add Session & Language Metadata]
ADD_METADATA --> STORE_NEO4J[Store in Neo4j]
STORE_NEO4J --> GENERATE_EMB[Generate Embeddings]
GENERATE_EMB --> STORE_CHUNKS[Store Document Chunks]
STORE_CHUNKS --> SUCCESS[Processing Complete]
classDef process fill:#e3f2fd
classDef decision fill:#fff3e0
classDef error fill:#ffebee
classDef success fill:#e8f5e8
class EXTRACT,CHUNK,DETECT,LLM_EXTRACT,PARSE,CREATE_NODES,ADD_METADATA,STORE_NEO4J,GENERATE_EMB,STORE_CHUNKS process
class VALIDATE,ARABIC,VALIDATE_JSON decision
class ERROR,FALLBACK error
class SUCCESS success
flowchart TD
DOCUMENT[Document Chunk] --> LANGUAGE[Language Detection]
LANGUAGE --> PROMPT_ENHANCE[Prompt Enhancement]
PROMPT_ENHANCE --> GEMINI[Gemini 2.5 Flash]
GEMINI --> EXTRACT_ENTITIES[Extract Entities]
GEMINI --> EXTRACT_RELATIONS[Extract Relationships]
EXTRACT_ENTITIES --> VALIDATE_ENTITIES[Validate Entity Structure]
EXTRACT_RELATIONS --> VALIDATE_RELATIONS[Validate Relationship Structure]
VALIDATE_ENTITIES --> ADD_SESSION_META[Add Session Metadata]
VALIDATE_RELATIONS --> ADD_SESSION_META
ADD_SESSION_META --> CREATE_NODES[Create Neo4j Nodes]
ADD_SESSION_META --> CREATE_EDGES[Create Neo4j Relationships]
CREATE_NODES --> INDEX_NODES[Create Node Indexes]
CREATE_EDGES --> INDEX_EDGES[Create Relationship Indexes]
INDEX_NODES --> EMBED_CHUNK[Generate Chunk Embedding]
INDEX_EDGES --> EMBED_CHUNK
EMBED_CHUNK --> STORE_CHUNK[Store Document Chunk]
STORE_CHUNK --> COMPLETE[Knowledge Graph Complete]
classDef input fill:#e1f5fe
classDef process fill:#e3f2fd
classDef storage fill:#e8f5e8
classDef complete fill:#c8e6c9
class DOCUMENT input
class LANGUAGE,PROMPT_ENHANCE,GEMINI,EXTRACT_ENTITIES,EXTRACT_RELATIONS,VALIDATE_ENTITIES,VALIDATE_RELATIONS,ADD_SESSION_META,CREATE_NODES,CREATE_EDGES,INDEX_NODES,INDEX_EDGES,EMBED_CHUNK,STORE_CHUNK process
class COMPLETE complete
flowchart TD
USER_MSG[User Message] --> VALIDATE_SESSION[Validate Session]
VALIDATE_SESSION --> STORE_USER[Store User Message]
STORE_USER --> GET_HISTORY[Get Chat History]
GET_HISTORY --> RETRIEVE_CONTEXT[Retrieve Knowledge Context]
RETRIEVE_CONTEXT --> SEMANTIC_SEARCH[Semantic Search Chunks]
RETRIEVE_CONTEXT --> GRAPH_TRAVERSAL[Graph Traversal Search]
RETRIEVE_CONTEXT --> EXPAND_CONTEXT[Expand Context by Relationships]
SEMANTIC_SEARCH --> COMBINE_RESULTS[Combine Retrieval Results]
GRAPH_TRAVERSAL --> COMBINE_RESULTS
EXPAND_CONTEXT --> COMBINE_RESULTS
COMBINE_RESULTS --> DETECT_QUERY_LANG[Detect Query Language]
DETECT_QUERY_LANG --> BUILD_PROMPT[Build Multilingual Prompt]
BUILD_PROMPT --> SYSTEM_PROMPT[System Prompt with Disclaimer]
BUILD_PROMPT --> CONTEXT_PROMPT[Context from Knowledge Graph]
BUILD_PROMPT --> HISTORY_PROMPT[Recent Chat History]
BUILD_PROMPT --> USER_PROMPT[User Question]
SYSTEM_PROMPT --> GEMINI_STREAM[Gemini Streaming Response]
CONTEXT_PROMPT --> GEMINI_STREAM
HISTORY_PROMPT --> GEMINI_STREAM
USER_PROMPT --> GEMINI_STREAM
GEMINI_STREAM --> STREAM_CHUNKS[Stream Response Chunks]
STREAM_CHUNKS --> STORE_ASSISTANT[Store Assistant Response]
STORE_ASSISTANT --> EXTRACT_SOURCES[Extract Sources for Citations]
EXTRACT_SOURCES --> RETURN_RESPONSE[Return Response with Sources]
classDef input fill:#e1f5fe
classDef process fill:#e3f2fd
classDef ai fill:#f3e5f5
classDef output fill:#e8f5e8
class USER_MSG input
class VALIDATE_SESSION,STORE_USER,GET_HISTORY,RETRIEVE_CONTEXT,SEMANTIC_SEARCH,GRAPH_TRAVERSAL,EXPAND_CONTEXT,COMBINE_RESULTS,DETECT_QUERY_LANG,BUILD_PROMPT,STORE_ASSISTANT,EXTRACT_SOURCES process
class SYSTEM_PROMPT,CONTEXT_PROMPT,HISTORY_PROMPT,USER_PROMPT,GEMINI_STREAM,STREAM_CHUNKS ai
class RETURN_RESPONSE output
The main FastAPI application orchestrates all services and provides the REST API endpoints.
Key Features:
- CORS middleware for cross-origin requests
- Lifespan management for service initialization
- Router inclusion for modular API structure
- Health check endpoints
Service Initialization Order:
- SQLite database initialization
- Neo4j connection establishment
- Retrieval service setup
- Embedding service initialization
- LLM service configuration
Manages chat sessions with full CRUD operations and session isolation.
Endpoints:
POST /sessions- Create new sessionGET /sessions- List all sessionsGET /sessions/{id}- Get specific sessionPUT /sessions/{id}- Update sessionDELETE /sessions/{id}- Delete session and all associated data
Session Isolation:
- Each session maintains separate knowledge graphs
- Cascade deletion removes all related data
- Neo4j data clearing on session deletion
Handles PDF upload, processing, and knowledge graph creation.
Processing Pipeline:
- File validation and storage
- PDF text extraction with PyMuPDF
- Document chunking with configurable parameters
- Language detection for multilingual support
- Knowledge graph extraction with Gemini LLM
- Neo4j storage with session isolation
- Embedding generation and storage
Rate Limiting:
- 4-second delay between LLM requests
- Configurable chunk processing limits
- Error handling with graceful degradation
Provides streaming and non-streaming chat capabilities with knowledge graph integration.
Features:
- Server-Sent Events (SSE) for real-time streaming
- Non-streaming fallback option
- Chat history management
- Source extraction for citations
- Error handling with user-friendly messages
Response Generation:
- Store user message in SQLite
- Retrieve recent chat history
- Query knowledge graph for relevant context
- Generate response with Gemini LLM
- Stream response chunks to client
- Store assistant response
Purpose: Automatic language detection for multilingual document processing.
Supported Languages:
- Arabic (Unicode ranges: \u0600-\u06FF, \u0750-\u077F, etc.)
- English (Latin characters)
- Mixed (combination of Arabic and English)
Detection Algorithm:
def detect_language(self, text: str) -> str:
arabic_chars = len(arabic_pattern.findall(text))
english_chars = len(english_pattern.findall(text))
total_chars = arabic_chars + english_chars
arabic_ratio = arabic_chars / total_chars
english_ratio = english_chars / total_chars
if arabic_ratio > 0.3:
return 'arabic' if english_ratio <= 0.2 else 'mixed'
elif english_ratio > 0.5:
return 'english'
else:
return 'mixed'Language-Specific Enhancements:
- Arabic: Enhanced prompts with Arabic legal terminology
- Mixed: Preservation of both language contexts
- English: Standard processing
Technology: PyMuPDF (fitz) for high-quality text extraction.
Features:
- Page-by-page text extraction
- Metadata preservation (page numbers, file info)
- Document chunking with RecursiveCharacterTextSplitter
- Error handling for corrupted PDFs
Chunking Configuration:
- Default chunk size: 1000 characters
- Default overlap: 100 characters
- Separators: ["\n\n", "\n", " ", ""]
Technology: LangChain + Neo4j + Gemini LLM
Entity Extraction:
- Legal entities (persons, organizations, contracts, cases)
- Legal relationships (agreements, obligations, rights)
- Key legal concepts and terms
- Dates, amounts, and important details
Graph Construction Process:
- Language-specific prompt enhancement
- Gemini LLM-based entity extraction
- JSON response parsing and validation
- Neo4j node and relationship creation
- Session and language metadata addition
- Document chunk storage with embeddings
Error Handling:
- JSON parsing fallbacks
- Graceful degradation on LLM failures
- Comprehensive logging for debugging
Model: Gemini 2.5 Flash Lite
Features:
- Multilingual prompt building
- Streaming and non-streaming responses
- Context-aware response generation
- Legal disclaimer integration
Prompt Structure:
System Prompt (with disclaimer)
+
Context from Knowledge Graph
+
Recent Chat History
+
User Question
Language-Specific Enhancements:
- Arabic: Cultural and legal context awareness
- Mixed: Preservation of both language contexts
- English: Standard legal assistance
Purpose: Enhanced knowledge graph querying with multilingual support.
Retrieval Strategy:
- Semantic Search: Retrieve all document chunks for comprehensive context
- Graph Traversal: Search entities and relationships based on query terms
- Context Expansion: Follow relationships to find connected entities
- Language Filtering: Support for language-specific queries
Search Features:
- Meaningful term extraction with Arabic compound word handling
- Comprehensive Cypher queries across all node properties
- Relevance scoring based on content type
- Relationship traversal for context expansion
Technology: Sentence Transformers with fallback options
Models (in order of preference):
paraphrase-MiniLM-L6-v2(384 dimensions)distilbert-base-nli-mean-tokens(768 dimensions)all-mpnet-base-v2(768 dimensions)all-MiniLM-L12-v2(384 dimensions)
Fallback Strategy:
- Simple local embedding using word hashing
- Fixed 100-dimensional vectors
- Basic text processing for reliability
Features:
- Batch embedding generation
- Cosine similarity computation
- Text cleaning and normalization
- Embedding serialization for storage
Tables:
sessions: Chat session metadatamessages: Chat history with token countsuploads: File upload records
Features:
- WAL mode for better concurrency
- Async SQLAlchemy integration
- Automatic timestamp management
- Cascade deletion for data integrity
Node Types:
Entity: Legal entities with propertiesFact: Legal facts and evidenceDocument: Document metadataLegalConcept: Legal terms and conceptsCase: Legal cases and proceedingsDocumentChunk: Text chunks with embeddings
Relationship Types:
ABOUT: Facts about entitiesCONTAINS: Documents containing factsMENTIONS: Documents mentioning entitiesRELATED_TO: Entity relationshipsAPPLIES_TO: Legal concepts applying to entitiesINVOLVES: Cases involving entities
Indexes:
- Session isolation indexes on all node types
- Language-specific indexes for multilingual support
- Entity type and relationship type indexes
- Performance optimization indexes
Session Isolation:
- All nodes and relationships tagged with
session_id - All queries filtered by session ID
- Automatic cleanup on session deletion
POST /sessions # Create session
GET /sessions # List sessions
GET /sessions/{id} # Get session
PUT /sessions/{id} # Update session
DELETE /sessions/{id} # Delete session
GET /sessions/{id}/messages # Get session messages
GET /sessions/{id}/uploads # Get session uploads
GET /sessions/{id}/full # Get complete session data
POST /ingest # Upload and process PDF
POST /chat # Streaming chat
POST /chat/non-streaming # Non-streaming chat
GET /chat/history/{session_id} # Get chat history
POST /neo4j/query # Execute Cypher query
GET /neo4j/stats/{session_id} # Get session statistics
- Automatic detection using Unicode character analysis
- Support for Arabic, English, and mixed-language content
- Language-specific prompt enhancements
- Enhanced prompts with Arabic legal terminology
- Cultural context awareness
- Right-to-left text handling considerations
- Arabic compound word mapping
- Preservation of original language in entities
- Cross-language relationship detection
- Language-specific indexing for efficient queries
- Arabic documents receive enhanced prompts
- English documents use standard processing
- Mixed-language documents maintain both contexts
# Database Configuration
DATABASE_URL=sqlite+aiosqlite:///./avokat.db
NEO4J_URI=neo4j+s://your-instance.databases.neo4j.io
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=your-password
NEO4J_DATABASE=neo4j
# AI Services
GEMINI_API_KEY=your-gemini-api-key
# API Configuration
API_TITLE=Avokat AI API
API_VERSION=1.0.0
DEBUG=false- PDF Processing: Configurable chunk size and overlap
- LLM Service: Rate limiting and error handling
- Embedding Service: Model selection and fallback options
- Retrieval Service: Search limits and language filtering
- Python 3.8+
- Neo4j Aura Cloud instance
- Gemini API key
- PyMuPDF installation
-
Environment Setup:
py -m venv venv venv\Scripts\Activate.ps1 pip install -r requirements.txt
-
Configuration:
- Set environment variables
- Configure Neo4j Aura connection
- Set Gemini API key
-
Database Initialization:
- SQLite tables created automatically
- Neo4j indexes created on startup
-
Service Startup:
uvicorn backend.app.main:app --host 0.0.0.0 --port 8000
/healthendpoint for service status- Database connectivity verification
- External service availability checks
- Comprehensive logging throughout the system
- Error tracking and graceful degradation
- Performance metrics for key operations
This documentation provides a comprehensive overview of the Avokat AI backend and AI system architecture. The system is designed for scalability, multilingual support, and robust error handling while maintaining session isolation and legal compliance through built-in disclaimers.