Skip to content

atpritam/Ollama-Mobile-Bridge

Repository files navigation

Ollama Mobile Bridge

Agentic AI backend that turns local and cloud LLMs into tool-using, web-aware personalized assistants

API Reference Download Android APK

FastAPI LangChain AI Agents Vision Multi-Modal RAG Redis Multi Tool Turns


An Agentic AI backend that orchestrates local and cloud LLMs through LangChain agents with real-time web access, multi-modal vision, autonomous tool use, and adaptive context management. The system is fully model-agnostic -- it works with any Ollama-compatible model and automatically adapts its behavior (search depth, observation budgets, tool strategy) to the model's size and capabilities.

Web Client Demo: ollama-web-gold.vercel.app

Features

Agentic Tool Orchestration -- A LangChain agent (create_agent) with custom middleware handles autonomous tool selection and multi-step reasoning. The agent can chain tools sequentially (e.g., search the web, then check the weather for a location found in search results). A custom ReActParsingMiddleware enables tool calling even for models that lack native function-calling support by detecting pseudo-JSON tool calls in the LLM's text output and converting them to structured tool invocations.

Vision and Multi-Modal Support -- When images are detected in a request, the system automatically routes to a vision-capable model (e.g., Qwen3-VL, Llama 3.2 Vision) via a "vision bridge." If the primary model is not vision-capable, images are translated to structured text descriptions using a dedicated vision model, and those descriptions are injected back into the conversation. Supports base64 image processing, resizing/compression, and multi-image analysis across conversation history.

RAG Pipeline with Smart Caching -- The search pipeline performs multi-step retrieval: intent detection, query generation, external search (Tavily for Google/News/Wikipedia, Reddit RSS, OpenWeatherMap), and result synthesis. A similarity-aware cache uses Jaccard/Cosine similarity, SimHash, and WordNet synonyms to avoid redundant searches.

Dynamic Context Scaling -- The backend inspects each model's parameter count and context window to dynamically adjust scraping depth, observation budget, and search result volume. Smaller models receive shorter, more focused content; larger models get deeper scrapes and larger context windows. If a prompt overflows the context, the middleware automatically retries with progressively reduced observation limits.

Streaming with Real-Time Status -- Responses are delivered via Server-Sent Events (SSE) with granular status updates throughout the pipeline. A StreamingTokenSanitizer filters raw JSON tool-call artifacts and code fences from the token stream before they reach the client.

User Memory -- A dedicated /memory/update endpoint accepts recent conversation history and uses the LLM itself to synthesize an updated, concise user profile (preferences, facts, context). The agent can also save user preferences mid-conversation via a save_user_preference tool. Memory is injected into the system prompt on subsequent requests.

Task Lifecycle Management -- Each streaming request is assigned a UUID (request_id). A singleton TaskManager tracks active asyncio tasks, enabling the /chat/stop endpoint to cancel in-progress generations. The system also detects client disconnects and cleans up resources automatically.

File Attachment Processing -- Supports images, PDF, text and code file uploads through both json and multipart/form-data requests.

Reasoning / Thinking Stream -- A ThinkingTokenProcessor state machine extracts Reasoning blocks from model output in real time. Thinking tokens are streamed as a separate SSE event type (event: thought), keeping internal reasoning visible to the client without polluting the final response. This provides first-class support for reasoning models.

Centralized Error Handling -- Custom exception hierarchy maps errors to structured JSON responses with specific error codes both in REST responses and within the SSE stream.

Tech Stack

Layer Technologies
Framework FastAPI, LangChain, Uvicorn, Pydantic v2
LLM Providers Ollama (local), Ollama Cloud API
Vision Pillow (image processing), multi-model vision bridge
Search and Data Tavily Search API (Google/News/Wikipedia), Reddit RSS, OpenWeatherMap One Call 3.0
Web Scraping httpx (async, HTTP/2, Brotli), BeautifulSoup4, lxml, Jina Reader API (JS fallback)
Caching Redis (primary) + SQLite (disk), Jaccard/Cosine/SimHash similarity
NLP Open English WordNet (synonyms), Jellyfish, Levenshtein
Streaming Server-Sent Events (SSE), asyncio queues
Auth API key middleware
Testing pytest (unit + integration)
Deployment Docker

Architecture

sequenceDiagram
    participant C as Client
    participant API as FastAPI
    participant AG as Agent
    participant VLLM as Vision LLM
    participant LLM as Primary LLM
    participant T as Tools
    participant Cache as Smart Cache

    C->>API: POST /chat/stream
    API-->>C: event: status (initializing)
    API->>AG: Build agent with model + history + tools

    opt Images in payload and primary model is not vision-capable
        AG-->>C: event: status (analyzing images...)
        AG->>VLLM: Images + prompt context
        VLLM-->>AG: JSON — image descriptions + observations
        Note over AG: Strips images, injects<br/>text descriptions into prompt
    end

    AG->>LLM: System prompt + history + user message

    opt Model lacks native tool support
        Note over AG,LLM: ReAct middleware injects tool<br/>instructions into system prompt.<br/>Parses JSON action blocks from<br/>text response at runtime.
    end

    LLM-->>AG: Tool call (native or ReAct-parsed JSON)
    AG-->>C: event: status (searching...)

    AG->>Cache: Lookup — exact match, then fuzzy similarity
    alt Cache HIT
        Cache-->>AG: Cached results (truncated to model context size)
    else Cache MISS
        Cache-->>AG: Miss
        AG->>T: Execute tool (web search / weather)
        T-->>AG: Raw results
        AG->>Cache: Store results with TTL
    end

    AG->>LLM: Observation + full context

    loop Streaming response
        LLM-->>AG: Token or thought chunk
        AG-->>C: event: thought (reasoning trace)
        AG-->>C: event: token (response text)
    end

    API-->>C: event: done (sources + usage metadata)

    Note over C,API: Asynchronous — triggered post-conversation

    C->>API: POST /memory/update (recent conversations)
    API->>LLM: Synthesize user profile from conversation history
    LLM-->>API: Updated memory string
    API-->>C: Updated memory
Loading

See more in diagrams/

Key Modules

Ollama-Mobile-Bridge/
├── main.py                        # FastAPI app, lifespan, exception handlers
├── auth.py                        # X-API-Key authentication middleware
├── config.py                      # Dynamic config, scraping tables, model detection
├── models/
│   └── api_models.py              # Pydantic models (ChatRequest, MemoryUpdateRequest, etc.)
├── routes/
│   ├── chat.py                    # Standard (non-streaming) chat endpoint
│   ├── chat_stream.py             # SSE streaming endpoint + /chat/stop
│   ├── chat_debug.py              # Debug endpoints (dev only)
│   ├── memory.py                  # POST /memory/update - LLM-synthesized memory
│   └── models_route.py            # GET /list - available models
├── services/
│   ├── langchain_agent.py         # Agent orchestration, middleware, vision bridge
│   ├── tools.py                   # Tool definitions (web_search, extract_page, recall, weather, save_pref)
│   ├── search.py                  # Tavily search, Reddit RSS, Wikipedia, scraping pipeline
│   └── weather.py                 # OpenWeatherMap One Call 3.0 integration
├── utils/
│   ├── streaming.py               # StreamingCallbackHandler, ThinkingTokenProcessor
│   ├── response_cleaner.py        # Token sanitization, JSON/fence filtering
│   ├── vision_helpers.py          # Vision bridge, image-to-text translation
│   ├── image_processor.py         # Base64 image resize/compression (Pillow)
│   ├── model_registry.py          # Per-device model metadata from Ollama API
│   ├── task_manager.py            # Active task tracking, cancellation
│   ├── cache.py                   # Redis + SQLite cache with similarity matching
│   ├── text_similarity.py         # Jaccard, Cosine, SimHash algorithms
│   ├── file_parser.py             # PDF and text file extraction
│   ├── html_parser.py             # HTML content extraction
│   ├── http_client.py             # httpx connection pool management
│   ├── prompts.py                 # System prompt templates
│   ├── errors.py                  # Exception hierarchy and error codes
│   ├── context.py                 # ToolContext runtime helpers
│   └── auth_helpers.py            # API key chain resolution
├── tests/
│   ├── unit/                      # Unit tests (auth, cache, streaming, vision, etc.)
│   └── integration/               # API route and usage limit tests
├── Dockerfile                     # Production Docker image
└── requirements.txt               # Python dependencies

Getting Started

Prerequisites

  • Python 3.12+
  • Ollama installed (for local models)
  • Ollama Cloud account (optional, for cloud models)

Optional API keys for extended functionality:

Installation

# Clone the repository
git clone https://github.com/atpritam/Ollama-Mobile-Bridge.git
cd Ollama-Mobile-Bridge

# Create and activate virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Configure environment
cp .env.example .env

Edit .env with your API keys:

API_KEY=your_app_api_key             # Required: authenticates client requests
TAVILY_KEY=tvly-dev-...              # Optional: enables web search
OPENWEATHER_API_KEY=your_key         # Optional: enables weather tool
OLLAMA_API_KEY=your_ollama_key       # Optional: enables Ollama Cloud models
JINA_READER_API_KEY=your_key         # Optional: JS-rendered content fallback
REDIS_URL=redis://localhost:6379     # Optional: distributed instant cache

Running Locally

Option A: With local Ollama

# Start Ollama
ollama serve

# Pull a model
ollama pull llama3.2:3b

# Start the bridge
python main.py

Option B: With Ollama Cloud API

Set OLLAMA_API_KEY in your .env file. The bridge automatically discovers and routes to cloud models.

python main.py

The server starts on http://0.0.0.0:8000 by default.

Connecting from Other Devices

Local network:

hostname -I   # Find your LAN IP
# Connect from your device: http://<YOUR_IP>:8000

Public URL (ngrok):

ngrok http 8080
# Use the generated https://...ngrok-free.app URL

Running Tests

pytest

API Reference

All endpoints require the X-API-Key header.

GET /list

Returns all available Ollama models (local and cloud). Requires X-API-Key and X-Device-Id headers.

curl -H "X-API-Key: your_key" -H "X-Device-Id: device-001" http://localhost:8000/list

POST /chat/stream

Primary endpoint. Streams responses via SSE with real-time status updates, tool execution events, and token-by-token output.

{
  "model": "llama3.2:3b",
  "prompt": "What is the latest news on the Artemis program?",
  "history": [
    { "role": "user", "content": "Tell me about NASA." },
    { "role": "assistant", "content": "NASA is the US space agency..." }
  ],
  "images": ["base64_encoded_image_data"],
  "user_memory": "I am interested in space exploration.",
  "web_access": true,
  "default_vision_model": "llama3.2-vision:11b",
  "request_id": "optional-uuid"
}

SSE Event Types:

Event Description
event: status Pipeline stage updates (initializing, thinking, searching, generating, etc.)
event: token Streamed response tokens
event: thought Reasoning/thinking tokens
event: done Final metadata (model, tools used, sources, search IDs)
event: error Structured error with code

POST /chat

Non-streaming variant. Returns the complete response as JSON.

JSON Request

{
  "model": "llama3.2:3b",
  "prompt": "Recommend a sci-fi book.",
  "user_memory": "I have already read Dune and The Expanse series."
}

Attachments & Multi-part Requests

The /chat and /chat/stream endpoints support multipart/form-data for uploading files.

POST /chat
Content-Type: multipart/form-data; boundary=WebAppBoundary
X-API-Key: your_key
X-Device-Id: device-001

--WebAppBoundary
Content-Disposition: form-data; name="model"

llama3.2:3b
--WebAppBoundary
Content-Disposition: form-data; name="prompt"

What is in the attached file and image?
--WebAppBoundary
Content-Disposition: form-data; name="files"; filename="test_sample.txt"
Content-Type: text/plain

<content of text file>
--WebAppBoundary
Content-Disposition: form-data; name="images"; filename="photo.jpg"
Content-Type: image/jpeg

<binary image data>
--WebAppBoundary--

Note: For images, you can either upload them as files (field name images) or send them as base64 strings in the JSON payload or as form fields.

POST /chat/stop

Cancel an in-progress streaming generation.

{
  "request_id": "uuid-from-stream-request"
}

POST /memory/update

Synthesize an updated user profile from recent conversations using the LLM.

{
  "model": "llama3.2:3b",
  "recent_conversations": [
    {
      "title": "Trip planning",
      "messages": [
        { "role": "user", "content": "I'm planning a trip to Japan." },
        { "role": "assistant", "content": "Great choice! When are you going?" }
      ]
    }
  ],
  "current_memory": "I am interested in travel."
}

Response:

{
  "updated_memory": "I am interested in travel, specifically planning a trip to Japan.",
  "model": "llama3.2:3b"
}

Example done Event Metadata

{
  "full_response": "The Artemis II mission is scheduled for...",
  "model": "llama3.2:3b",
  "context_messages_count": 3,
  "search_performed": true,
  "search_ids": [42],
  "request_id": "abc-123-def-456",
  "conversation_title": "Artemis II Mission Update",
  "tool_metadata": {
    "total_tools_used": 1,
    "tool_results": [
      {
        "search_id": 42,
        "search_query": "Artemis program latest news 2026",
        "search_type": "news",
        "sources": [
          {
            "url": "https://www.nasa.gov/artemis/",
            "favicon": "https://www.nasa.gov/favicon.ico"
          },
          {
            "url": "https://spacenews.com/artemis-ii-update/",
            "favicon": "https://spacenews.com/favicon.ico"
          },
          {
            "url": "https://techcrunch.com/artemis/",
            "favicon": null
          }
        ],
        "images": [
          "https://www.nasa.gov/images/artemis2-crew.jpg",
          "https://spacenews.com/img/sls-launch.jpg"
        ]
      }
    ]
  }
}

Notes:

  • conversation_title — only present on the first message of a conversation (LLM-generated title)
  • search_performedtrue when any web tool was used (search, weather, extract, etc.)
  • search_ids — list of cache IDs for the searches performed; used for follow-up context on future requests
  • tool_metadata.tool_results[].sources[{url, favicon}] array
  • tool_metadata.tool_results[].images — image URLs returned by Tavily for the search query
  • user_memory and attachments fields appear only when relevant to the request

Deployment

Docker

docker build -t ollama-mobile-bridge .
docker run -d \
  --env-file .env \
  -p 8000:8000 \
  ollama-mobile-bridge

The Docker image uses Python 3.13-slim, runs as a non-root user, and pre-downloads the WordNet dataset at build time.

Note: If using local Ollama models with Docker, ensure the container can reach your Ollama instance (e.g., --network host on Linux, or set OLLAMA_LOCAL_URL to your host IP).

How It Works

A typical request flows through these stages:

  1. Authentication -- The APIKeyMiddleware validates the X-API-Key header.
  2. Request Parsing -- Pydantic models validate the payload. File attachments (if any) are parsed.
  3. Model Resolution -- The model_registry looks up the model's source (local/cloud), context window, parameter size, and capabilities (vision, native tool support).
  4. Vision Bridge -- If images are present and the primary model is not vision-capable, a vision model translates images to structured text descriptions.
  5. Agent Initialization -- A LangChain agent is created with model-appropriate prompts, registered tools, and custom middleware (ReAct parsing, context overflow retry, call limits).
  6. Streaming Execution -- The agent runs asynchronously. A StreamingCallbackHandler captures tokens, thinking blocks, tool events, and errors into a queue consumed by the SSE event loop.
  7. Tool Execution -- When the agent invokes a tool (web search, weather, recall, memory update), the system performs the operation, caches results, and returns observations to the agent for synthesis.
  8. Response Finalization -- The streamed response is sanitized (removing leaked JSON/tool artifacts), and a done event with full metadata is emitted.

Contributing

Contributions are welcome. Please open an issue to discuss what you would like to change before submitting a pull request.

Support

For questions, issues, or feature requests, please open an issue.

About

An Agentic AI backend that orchestrates local and cloud LLMs through LangChain agents with real-time web access, multi-modal vision, autonomous tool use, and adaptive context management.

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages