Production-ready (work-in-progress) Flask microservice that exposes APIs for:
- Upload-free (URL-based) document ingestion
- Text chunking + embedding generation + persistence
- Semantic similarity search over stored document embeddings
- Retrieval-Augmented Generation (RAG) chat endpoint
- (Pluggable) Large Language Model (LLM) inference via a GenAI hub / OpenAI compatible proxy
- Programmatic model configuration + deployment (helper utilities)
- Supabase object storage file fetching
- Modular Flask architecture using Blueprints:
/api/documents– (demo placeholder) returns sample documents./api/genai– token retrieval, document embedding + storage, semantic search./api/rag-pipeline– end‑to‑end Retrieval → Augmentation → Generation chat API.
- Asynchronous (threaded) background ingestion for remote files (by URL).
- Chunked document processing with overlap for better embedding recall.
- Batched embedding computation for efficiency (
get_embeddings_batch). - Vector similarity search backed by (presumably) SAP HANA or similar DB (per naming).
- Supabase storage integration for downloading protected assets.
- Utilities for dynamic model configuration and deployment via an external AI API.
- Environment-driven configuration for portability (Cloud Run / container friendly).
- Clear layering:
- API layer (Flask Blueprints)
- Processing layer (
document_processing) - Embedding + persistence layer (
hana_db_connection,embedding) - LLM inference layer (
utils/LLM.pyvia GenAI hub proxy) - Deployment/config utilities (
model_config.py,model_deployment.py)
+------------------------------+
Client -------> | Flask App (app.py) |
| Registers Blueprints |
+---------------+--------------+
|
+----------------------------+-------------------------------+
| | |
/api/genai /api/rag-pipeline /api/documents
| | |
| (Token, Embed, Search) | (RAG Chat) | (Demo JSON)
| | |
v v
+----------------+ +-----------------+
| Document Proc. | | RAG Pipeline |
| Download & | | 1. Embed Query |
| Chunk Text | | 2. Retrieve |
| Batch Embed | | 3. Augment |
+--------+-------+ | 4. Generate |
| +--------+--------+
v |
+-------------+ v
| Embeddings | <--- Similarity Search (DB / Vector Store)
| (DB/HANA) |
+------+------+
^
|
External File Sources (Supabase / URL)
GET /
Returns: "home" (basic liveness check).
GET /api/documents/
Returns static sample JSON.
Issues:
- Duplicate route decorators; one has
methods=[''](invalid) – should be removed or corrected.
| Endpoint | Method | Purpose | Request Body (JSON) | Notes |
|---|---|---|---|---|
/token |
POST | Obtain access token | none | Uses get_access_token() |
/create-store-embedding |
POST | Download file by file_url, process, embed, store |
file_url (str), username (str), doc_type (str) |
Metadata is created but not actually passed into worker thread (bug) |
/search-similiar-documents |
POST | Semantic search | query (str), k (int, opt), username (opt) |
Several variable / method call bugs (see Issues) |
POST /api/rag-pipeline/chat
Request JSON:
{
"query": "What is ...?",
"k": 3,
"username": "optional-filter",
"temperature": 0.1,
"max_tokens": 500
}
Response:
{
"success": true,
"query": "...",
"answer": "Generated answer..."
}
Flow:
- Embed user query.
- Retrieve top-K similar documents.
- Concatenate text into context.
- Construct prompt and call LLM (
gpt-4o-miniviaget_llm_response). - Return generated answer (no source citations yet).
File: utils/document_processing.py
Key steps:
process_and_embed_file_from_url(file_url)orchestrates:- Download + extract text (PDF / DOC / DOCX).
split_text_into_chunkswith overlap (default 1000 chars, 200 overlap).preprocess_text_chunks(not shown in partial; assumed cleanup).- Batch embeddings via
get_embeddings_batch. - Build rows and persist via
batch_insertion_embedding.
- Chunk metadata includes
chunk_index,chunk_size,source_url. - Resilient HTTP downloads with retries (
urllib3 Retry+requestssession). - Temporary file usage to handle large files efficiently.
(Sections after the visible truncation should be documented similarly if they include PDF/DOCX parsing logic.)
File: utils/LLM.py
- Wrapper around a GenAI hub / OpenAI-like proxy.
- Uses
chat.completions.createwith:- System prompt: "You are a helpful assistant."
- Deterministic generation (
temperature=0.0in base helper; pipeline overrides conceptually with custom temp).
- Model currently hardcoded in RAG pipeline to
"gpt-4o-mini".
utils/model_config.py: Creates a configuration with a remote AI API (/v2/lm/configurations).utils/model_deployment.py: Deploys a configuration (/v2/lm/deployments).- Likely used in a provisioning workflow (not yet exposed via API endpoints).
| Integration | Purpose | Notes |
|---|---|---|
Supabase (supabase_connection.py) |
Download files from storage bucket | Needs NEXT_PUBLIC_SUPABASE_URL + NEXT_PUBLIC_SUPABASE_ANON_KEY |
SAP HANA / Vector DB (hana_db_connection placeholder) |
Store embeddings & similarity searches | Functions referenced: insert_embedding, batch_insertion_embedding, search_similiar_documents, get_all_data |
OAuth Token (utils.oauth_token) |
Access token for AI API | Used by /api/genai/token |
GenAI Hub / OpenAI Proxy (gen_ai_hub.proxy.native.openai) |
LLM & Embeddings | Abstraction layer for multiple models |
Create a .env file (example):
# Flask / Server
PORT=5000
# Supabase
NEXT_PUBLIC_SUPABASE_URL=your_supabase_url
NEXT_PUBLIC_SUPABASE_ANON_KEY=your_supabase_anon_key
# AI Platform
AI_API_URL=https://your-ai-api-base
AI_RESOURCE_GROUP=demo
# Auth / Tokens (examples — adapt to your real naming)
GENAI_API_KEY=...
OPENAI_API_KEY=...
OAUTH_CLIENT_ID=...
OAUTH_CLIENT_SECRET=...
OAUTH_TOKEN_URL=...
# Database (HANA / Vector Store)
HANA_HOST=...
HANA_PORT=...
HANA_USER=...
HANA_PASSWORD=...
HANA_SCHEMA=...
(Adjust to the actual variable names used in missing modules.)
Prerequisites:
- Python 3.10+
- (Optional) SAP HANA client / driver libraries installed
- (Optional) Supabase credentials
- (Optional) Proper AI platform access
Steps:
git clone https://github.com/kunal-kumar-chaudhary/AI_FLASK_SERVICE.git
cd AI_FLASK_SERVICE
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt # (Assuming this file exists; create if not)
cp .env.example .env # populate with real values
python app.py
The service listens on PORT (defaults to 5000 if unset).
AI_FLASK_SERVICE/
├── app.py
├── api/
│ ├── __init__.py
│ ├── document.py
│ ├── genai_hub.py
│ └── RAG_pipeline.py
├── utils/
│ ├── LLM.py
│ ├── document_processing.py
│ ├── supabase_connection.py
│ ├── model_config.py
│ ├── model_deployment.py
│ ├── embedding.py # (not listed, inferred)
│ ├── hana_db_connection.py # (not listed, inferred)
│ ├── oauth_token.py # (not listed, inferred)
│ └── ...
└── requirements.txt (add if missing)
- Client posts query →
/api/rag-pipeline/chat - Embed query →
get_embedding - Retrieve similar documents →
search_similiar_documents - Concatenate retrieved texts → context
- Build augmented prompt → LLM
- Generate final response → return JSON
| Area | Issue | Recommendation |
|---|---|---|
api/document.py |
Duplicate route decorator; methods=[''] invalid |
Remove second decorator or specify proper HTTP methods |
api/genai_hub.py (search_documents) |
Uses request.json() instead of request.get_json() |
Replace with request.get_json() |
Variable mismatch: username_filter vs username when filtering |
Standardize variable naming | |
Calls search_similiar_documents(query=query, k=k, username=username) but earlier parameter naming in RAG uses query_embedding |
Align function signatures (decide on embedding vs raw text) | |
| Spelling: "similiar" | Rename to search_similar_documents (function + route) |
|
/create-store-embedding |
Metadata created but not passed into thread (process_and_embed_file_from_url accepts only file_url) |
Extend function signature and pass metadata |
| Threading | Fire-and-forget thread; no error callback | Consider a task queue (Celery / RQ) or logging of failures |
| Embeddings persistence | Embedding stored as str(embedding_vector) |
Prefer binary / JSON / array columns; ensure consistent decoding for search |
| RAG pipeline | No source citation returned | Return list of chunk IDs / metadata for transparency |
| Security | No auth or rate limiting on endpoints | Add authentication (token / JWT) + throttling |
| Error handling | Generic except Exception as e → 500 with string |
Standardize error schema & logging |
| Input validation | Limited schema checking | Introduce pydantic / marshmallow models |
| Observability | Minimal logging; no correlation IDs | Add structured logging (JSON) and request IDs |
| Testing | No tests present | Add unit + integration tests (pytest) |
| Config | Scattered os.getenv calls |
Centralize configuration module |
Potential enhancements:
- Add
/api/genai/statusendpoint for background job tracking. - Implement source attribution in RAG responses:
- Return list of
(chunk_index, similarity_score, source_url).
- Return list of
- Add hybrid search (semantic + keyword).
- Pluggable embedding backends (OpenAI, HuggingFace, Azure).
- Caching layer (Redis) for repeated queries.
- Streaming responses for large LLM outputs.
- Structured prompt templates with variable injection + guardrails.
- Add retry/backoff and circuit breaker around external API calls.
Create & store embeddings:
POST /api/genai/create-store-embedding
Content-Type: application/json
{
"file_url": "https://your-supabase-url/storage/v1/object/public/docs/sample.pdf",
"username": "alice",
"doc_type": "policy"
}
RAG chat:
POST /api/rag-pipeline/chat
Content-Type: application/json
{
"query": "Summarize the policy document",
"k": 4,
"temperature": 0.2
}
| Concern | Recommendation |
|---|---|
| Gunicorn / Production Server | Wrap Flask app with Gunicorn / uvicorn workers (if ASGI migration) |
| Concurrency | Replace raw threads with a task queue for robust scaling |
| Containerization | Provide Dockerfile + multi-stage build |
| Secrets | Use a secret manager (GCP Secret Manager / AWS Secrets Manager) |
| Migrations | If DB schema evolves, adopt Alembic or similar |
| Monitoring | Integrate Prometheus metrics & health checks |
| Symptom | Possible Cause | Fix |
|---|---|---|
| 500 on embedding creation | Missing env vars / DB connection failure | Verify .env + DB credentials |
| Empty RAG answers | No documents retrieved / embedding mismatch | Check embedding model consistency |
| Token endpoint failure | OAuth provider unreachable | Validate network + credentials |
| Slow ingestion | Large file + single-threaded processing | Increase worker pool / chunk size tuning |
(Choose a license, e.g. MIT, Apache-2.0, add a LICENSE file.)
- Fork + branch (
feat/your-feature) - Write tests
- Open PR with clear description
- Ensure endpoints follow consistent error schema
- Fix route & parameter naming inconsistencies
- Add source citations to RAG responses
- Introduce authentication & rate limiting
- Add Dockerfile + CI pipeline
- Implement retry logic & structured logging
- Add unit/integration tests
- Support streaming LLM responses
- Provide OpenAPI/Swagger spec
- Implement background job queue
- Add evaluation harness for RAG quality