Agentic AI backend that turns local and cloud LLMs into tool-using, web-aware personalized assistants
API Reference • Download Android APK
An Agentic AI backend that orchestrates local and cloud LLMs through LangChain agents with real-time web access, multi-modal vision, autonomous tool use, and adaptive context management. The system is fully model-agnostic -- it works with any Ollama-compatible model and automatically adapts its behavior (search depth, observation budgets, tool strategy) to the model's size and capabilities.
Web Client Demo: ollama-web-gold.vercel.app
Agentic Tool Orchestration -- A LangChain agent (create_agent) with custom middleware handles autonomous tool selection and multi-step reasoning. The agent can chain tools sequentially (e.g., search the web, then check the weather for a location found in search results). A custom ReActParsingMiddleware enables tool calling even for models that lack native function-calling support by detecting pseudo-JSON tool calls in the LLM's text output and converting them to structured tool invocations.
Vision and Multi-Modal Support -- When images are detected in a request, the system automatically routes to a vision-capable model (e.g., Qwen3-VL, Llama 3.2 Vision) via a "vision bridge." If the primary model is not vision-capable, images are translated to structured text descriptions using a dedicated vision model, and those descriptions are injected back into the conversation. Supports base64 image processing, resizing/compression, and multi-image analysis across conversation history.
RAG Pipeline with Smart Caching -- The search pipeline performs multi-step retrieval: intent detection, query generation, external search (Tavily for Google/News/Wikipedia, Reddit RSS, OpenWeatherMap), and result synthesis. A similarity-aware cache uses Jaccard/Cosine similarity, SimHash, and WordNet synonyms to avoid redundant searches.
Dynamic Context Scaling -- The backend inspects each model's parameter count and context window to dynamically adjust scraping depth, observation budget, and search result volume. Smaller models receive shorter, more focused content; larger models get deeper scrapes and larger context windows. If a prompt overflows the context, the middleware automatically retries with progressively reduced observation limits.
Streaming with Real-Time Status -- Responses are delivered via Server-Sent Events (SSE) with granular status updates throughout the pipeline. A StreamingTokenSanitizer filters raw JSON tool-call artifacts and code fences from the token stream before they reach the client.
User Memory -- A dedicated /memory/update endpoint accepts recent conversation history and uses the LLM itself to synthesize an updated, concise user profile (preferences, facts, context). The agent can also save user preferences mid-conversation via a save_user_preference tool. Memory is injected into the system prompt on subsequent requests.
Task Lifecycle Management -- Each streaming request is assigned a UUID (request_id). A singleton TaskManager tracks active asyncio tasks, enabling the /chat/stop endpoint to cancel in-progress generations. The system also detects client disconnects and cleans up resources automatically.
File Attachment Processing -- Supports images, PDF, text and code file uploads through both json and multipart/form-data requests.
Reasoning / Thinking Stream -- A ThinkingTokenProcessor state machine extracts Reasoning blocks from model output in real time. Thinking tokens are streamed as a separate SSE event type (event: thought), keeping internal reasoning visible to the client without polluting the final response. This provides first-class support for reasoning models.
Centralized Error Handling -- Custom exception hierarchy maps errors to structured JSON responses with specific error codes both in REST responses and within the SSE stream.
| Layer | Technologies |
|---|---|
| Framework | FastAPI, LangChain, Uvicorn, Pydantic v2 |
| LLM Providers | Ollama (local), Ollama Cloud API |
| Vision | Pillow (image processing), multi-model vision bridge |
| Search and Data | Tavily Search API (Google/News/Wikipedia), Reddit RSS, OpenWeatherMap One Call 3.0 |
| Web Scraping | httpx (async, HTTP/2, Brotli), BeautifulSoup4, lxml, Jina Reader API (JS fallback) |
| Caching | Redis (primary) + SQLite (disk), Jaccard/Cosine/SimHash similarity |
| NLP | Open English WordNet (synonyms), Jellyfish, Levenshtein |
| Streaming | Server-Sent Events (SSE), asyncio queues |
| Auth | API key middleware |
| Testing | pytest (unit + integration) |
| Deployment | Docker |
sequenceDiagram
participant C as Client
participant API as FastAPI
participant AG as Agent
participant VLLM as Vision LLM
participant LLM as Primary LLM
participant T as Tools
participant Cache as Smart Cache
C->>API: POST /chat/stream
API-->>C: event: status (initializing)
API->>AG: Build agent with model + history + tools
opt Images in payload and primary model is not vision-capable
AG-->>C: event: status (analyzing images...)
AG->>VLLM: Images + prompt context
VLLM-->>AG: JSON — image descriptions + observations
Note over AG: Strips images, injects<br/>text descriptions into prompt
end
AG->>LLM: System prompt + history + user message
opt Model lacks native tool support
Note over AG,LLM: ReAct middleware injects tool<br/>instructions into system prompt.<br/>Parses JSON action blocks from<br/>text response at runtime.
end
LLM-->>AG: Tool call (native or ReAct-parsed JSON)
AG-->>C: event: status (searching...)
AG->>Cache: Lookup — exact match, then fuzzy similarity
alt Cache HIT
Cache-->>AG: Cached results (truncated to model context size)
else Cache MISS
Cache-->>AG: Miss
AG->>T: Execute tool (web search / weather)
T-->>AG: Raw results
AG->>Cache: Store results with TTL
end
AG->>LLM: Observation + full context
loop Streaming response
LLM-->>AG: Token or thought chunk
AG-->>C: event: thought (reasoning trace)
AG-->>C: event: token (response text)
end
API-->>C: event: done (sources + usage metadata)
Note over C,API: Asynchronous — triggered post-conversation
C->>API: POST /memory/update (recent conversations)
API->>LLM: Synthesize user profile from conversation history
LLM-->>API: Updated memory string
API-->>C: Updated memory
See more in diagrams/
Ollama-Mobile-Bridge/
├── main.py # FastAPI app, lifespan, exception handlers
├── auth.py # X-API-Key authentication middleware
├── config.py # Dynamic config, scraping tables, model detection
├── models/
│ └── api_models.py # Pydantic models (ChatRequest, MemoryUpdateRequest, etc.)
├── routes/
│ ├── chat.py # Standard (non-streaming) chat endpoint
│ ├── chat_stream.py # SSE streaming endpoint + /chat/stop
│ ├── chat_debug.py # Debug endpoints (dev only)
│ ├── memory.py # POST /memory/update - LLM-synthesized memory
│ └── models_route.py # GET /list - available models
├── services/
│ ├── langchain_agent.py # Agent orchestration, middleware, vision bridge
│ ├── tools.py # Tool definitions (web_search, extract_page, recall, weather, save_pref)
│ ├── search.py # Tavily search, Reddit RSS, Wikipedia, scraping pipeline
│ └── weather.py # OpenWeatherMap One Call 3.0 integration
├── utils/
│ ├── streaming.py # StreamingCallbackHandler, ThinkingTokenProcessor
│ ├── response_cleaner.py # Token sanitization, JSON/fence filtering
│ ├── vision_helpers.py # Vision bridge, image-to-text translation
│ ├── image_processor.py # Base64 image resize/compression (Pillow)
│ ├── model_registry.py # Per-device model metadata from Ollama API
│ ├── task_manager.py # Active task tracking, cancellation
│ ├── cache.py # Redis + SQLite cache with similarity matching
│ ├── text_similarity.py # Jaccard, Cosine, SimHash algorithms
│ ├── file_parser.py # PDF and text file extraction
│ ├── html_parser.py # HTML content extraction
│ ├── http_client.py # httpx connection pool management
│ ├── prompts.py # System prompt templates
│ ├── errors.py # Exception hierarchy and error codes
│ ├── context.py # ToolContext runtime helpers
│ └── auth_helpers.py # API key chain resolution
├── tests/
│ ├── unit/ # Unit tests (auth, cache, streaming, vision, etc.)
│ └── integration/ # API route and usage limit tests
├── Dockerfile # Production Docker image
└── requirements.txt # Python dependencies
- Python 3.12+
- Ollama installed (for local models)
- Ollama Cloud account (optional, for cloud models)
Optional API keys for extended functionality:
- Tavily API -- web search (Google, News, Wikipedia)
- OpenWeatherMap API -- weather data (One Call 3.0)
- Jina Reader API -- JavaScript-rendered page fallback
- Redis -- distributed instant cache
# Clone the repository
git clone https://github.com/atpritam/Ollama-Mobile-Bridge.git
cd Ollama-Mobile-Bridge
# Create and activate virtual environment
python3 -m venv .venv
source .venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Configure environment
cp .env.example .envEdit .env with your API keys:
API_KEY=your_app_api_key # Required: authenticates client requests
TAVILY_KEY=tvly-dev-... # Optional: enables web search
OPENWEATHER_API_KEY=your_key # Optional: enables weather tool
OLLAMA_API_KEY=your_ollama_key # Optional: enables Ollama Cloud models
JINA_READER_API_KEY=your_key # Optional: JS-rendered content fallback
REDIS_URL=redis://localhost:6379 # Optional: distributed instant cacheOption A: With local Ollama
# Start Ollama
ollama serve
# Pull a model
ollama pull llama3.2:3b
# Start the bridge
python main.pyOption B: With Ollama Cloud API
Set OLLAMA_API_KEY in your .env file. The bridge automatically discovers and routes to cloud models.
python main.pyThe server starts on http://0.0.0.0:8000 by default.
Local network:
hostname -I # Find your LAN IP
# Connect from your device: http://<YOUR_IP>:8000Public URL (ngrok):
ngrok http 8080
# Use the generated https://...ngrok-free.app URLpytestAll endpoints require the X-API-Key header.
Returns all available Ollama models (local and cloud). Requires X-API-Key and X-Device-Id headers.
curl -H "X-API-Key: your_key" -H "X-Device-Id: device-001" http://localhost:8000/listPrimary endpoint. Streams responses via SSE with real-time status updates, tool execution events, and token-by-token output.
{
"model": "llama3.2:3b",
"prompt": "What is the latest news on the Artemis program?",
"history": [
{ "role": "user", "content": "Tell me about NASA." },
{ "role": "assistant", "content": "NASA is the US space agency..." }
],
"images": ["base64_encoded_image_data"],
"user_memory": "I am interested in space exploration.",
"web_access": true,
"default_vision_model": "llama3.2-vision:11b",
"request_id": "optional-uuid"
}SSE Event Types:
| Event | Description |
|---|---|
event: status |
Pipeline stage updates (initializing, thinking, searching, generating, etc.) |
event: token |
Streamed response tokens |
event: thought |
Reasoning/thinking tokens |
event: done |
Final metadata (model, tools used, sources, search IDs) |
event: error |
Structured error with code |
Non-streaming variant. Returns the complete response as JSON.
{
"model": "llama3.2:3b",
"prompt": "Recommend a sci-fi book.",
"user_memory": "I have already read Dune and The Expanse series."
}The /chat and /chat/stream endpoints support multipart/form-data for uploading files.
POST /chat
Content-Type: multipart/form-data; boundary=WebAppBoundary
X-API-Key: your_key
X-Device-Id: device-001
--WebAppBoundary
Content-Disposition: form-data; name="model"
llama3.2:3b
--WebAppBoundary
Content-Disposition: form-data; name="prompt"
What is in the attached file and image?
--WebAppBoundary
Content-Disposition: form-data; name="files"; filename="test_sample.txt"
Content-Type: text/plain
<content of text file>
--WebAppBoundary
Content-Disposition: form-data; name="images"; filename="photo.jpg"
Content-Type: image/jpeg
<binary image data>
--WebAppBoundary--Note: For images, you can either upload them as files (field name
images) or send them as base64 strings in the JSON payload or as form fields.
Cancel an in-progress streaming generation.
{
"request_id": "uuid-from-stream-request"
}Synthesize an updated user profile from recent conversations using the LLM.
{
"model": "llama3.2:3b",
"recent_conversations": [
{
"title": "Trip planning",
"messages": [
{ "role": "user", "content": "I'm planning a trip to Japan." },
{ "role": "assistant", "content": "Great choice! When are you going?" }
]
}
],
"current_memory": "I am interested in travel."
}Response:
{
"updated_memory": "I am interested in travel, specifically planning a trip to Japan.",
"model": "llama3.2:3b"
}{
"full_response": "The Artemis II mission is scheduled for...",
"model": "llama3.2:3b",
"context_messages_count": 3,
"search_performed": true,
"search_ids": [42],
"request_id": "abc-123-def-456",
"conversation_title": "Artemis II Mission Update",
"tool_metadata": {
"total_tools_used": 1,
"tool_results": [
{
"search_id": 42,
"search_query": "Artemis program latest news 2026",
"search_type": "news",
"sources": [
{
"url": "https://www.nasa.gov/artemis/",
"favicon": "https://www.nasa.gov/favicon.ico"
},
{
"url": "https://spacenews.com/artemis-ii-update/",
"favicon": "https://spacenews.com/favicon.ico"
},
{
"url": "https://techcrunch.com/artemis/",
"favicon": null
}
],
"images": [
"https://www.nasa.gov/images/artemis2-crew.jpg",
"https://spacenews.com/img/sls-launch.jpg"
]
}
]
}
}Notes:
conversation_title— only present on the first message of a conversation (LLM-generated title)search_performed—truewhen any web tool was used (search, weather, extract, etc.)search_ids— list of cache IDs for the searches performed; used for follow-up context on future requeststool_metadata.tool_results[].sources—[{url, favicon}]arraytool_metadata.tool_results[].images— image URLs returned by Tavily for the search queryuser_memoryandattachmentsfields appear only when relevant to the request
docker build -t ollama-mobile-bridge .
docker run -d \
--env-file .env \
-p 8000:8000 \
ollama-mobile-bridgeThe Docker image uses Python 3.13-slim, runs as a non-root user, and pre-downloads the WordNet dataset at build time.
Note: If using local Ollama models with Docker, ensure the container can reach your Ollama instance (e.g.,
--network hoston Linux, or setOLLAMA_LOCAL_URLto your host IP).
A typical request flows through these stages:
- Authentication -- The
APIKeyMiddlewarevalidates theX-API-Keyheader. - Request Parsing -- Pydantic models validate the payload. File attachments (if any) are parsed.
- Model Resolution -- The
model_registrylooks up the model's source (local/cloud), context window, parameter size, and capabilities (vision, native tool support). - Vision Bridge -- If images are present and the primary model is not vision-capable, a vision model translates images to structured text descriptions.
- Agent Initialization -- A LangChain agent is created with model-appropriate prompts, registered tools, and custom middleware (ReAct parsing, context overflow retry, call limits).
- Streaming Execution -- The agent runs asynchronously. A
StreamingCallbackHandlercaptures tokens, thinking blocks, tool events, and errors into a queue consumed by the SSE event loop. - Tool Execution -- When the agent invokes a tool (web search, weather, recall, memory update), the system performs the operation, caches results, and returns observations to the agent for synthesis.
- Response Finalization -- The streamed response is sanitized (removing leaked JSON/tool artifacts), and a
doneevent with full metadata is emitted.
Contributions are welcome. Please open an issue to discuss what you would like to change before submitting a pull request.
For questions, issues, or feature requests, please open an issue.