add text output format (#20)

Adityav369 · web-flow · commit ad9fa7361b6e · 2025-12-03T10:56:35.000-08:00
diff --git a/concepts/colpali.mdx b/concepts/colpali.mdx
@@ -68,6 +68,39 @@ db.query("At what time-step did we see the highest GDP growth rate?", use_colpal
 
 So instead of having to implement the ColPali pipeline from scratch, you can use Morphik to do it for you in a single line of code!
 
+## Controlling Output Format
+
+When retrieving ColPali chunks (which are page images), you can control how the images are returned using the `output_format` parameter:
+
+```python
+# Return as base64-encoded data (default)
+chunks = db.retrieve_chunks("quarterly results", use_colpali=True)
+
+# Return as presigned URLs (useful for web UIs)
+chunks = db.retrieve_chunks("quarterly results", use_colpali=True, output_format="url")
+
+# Convert images to markdown text via OCR
+chunks = db.retrieve_chunks("quarterly results", use_colpali=True, output_format="text")
+```
+
+The three output formats are:
+- **`"base64"`** (default): Returns base64-encoded image data
+- **`"url"`**: Returns presigned HTTPS URLs, convenient for LLMs and UIs that accept remote image URLs
+- **`"text"`**: Converts page images to markdown text via OCR
+
+### Choosing Between Formats
+
+**base64 vs url**: Both formats pass images to LLMs for visual understanding and produce similar inference results. However, `url` is lighter on network transfer since only the URL is sent to your application (the LLM fetches the image directly). This can result in faster response times, especially when working with multiple images.
+
+**When to use text**: Passing images to LLMs for inference can be slow and consume significant context tokens. Use `output_format="text"` when:
+- You need **faster inference** speeds
+- Your documents are **primarily text-based** (reports, articles, contracts)
+- You're hitting **context length limits**
+
+<Note>
+If you're experiencing context limit issues with image-based retrieval, it may be because images aren't being passed correctly to the model. See [Generating Completions with Retrieved Chunks](/cookbooks/generating-completions-with-retrieved-chunks) for examples of properly passing images (both base64 and URLs) to vision-capable models like GPT-4o.
+</Note>
+
 
 
 
diff --git a/python-sdk/retrieve_chunks.mdx b/python-sdk/retrieve_chunks.mdx
@@ -45,7 +45,10 @@ description: "Retrieve relevant chunks from Morphik"
 - `use_colpali` (bool, optional): Whether to use ColPali-style embedding model to retrieve the chunks (only works for documents ingested with `use_colpali=True`). Defaults to True.
 - `folder_name` (str | List[str], optional): Optional folder scope. Accepts a single folder name or a list of folder names.
 - `padding` (int, optional): Number of additional chunks/pages to retrieve before and after matched chunks (ColPali only). Defaults to 0.
-- `output_format` (str, optional): Controls how image chunks are returned. Set to `"url"` to receive presigned URLs; omit or set to `"base64"` (default) to receive base64 content.
+- `output_format` (str, optional): Controls how image chunks are returned:
+  - `"base64"` (default): Returns base64-encoded image data
+  - `"url"`: Returns presigned HTTPS URLs
+  - `"text"`: Converts images to markdown text via OCR
 - `query_image` (str, optional): Base64-encoded image for reverse image search. Mutually exclusive with `query`. Requires `use_colpali=True`.
 
 ## Metadata Filters
@@ -135,13 +138,30 @@ The `FinalChunkResult` objects returned by this method have the following proper
 - `filename` (Optional[str]): Original filename
 - `download_url` (Optional[str]): URL to download full document 
 
-## Image URL output
+## Output Format Options
 
-- When `output_format="url"` is provided, image chunks are returned as presigned HTTPS URLs in `content`. This is convenient for UIs and LLMs that accept remote image URLs (e.g., via `image_url`).
-- When `output_format` is omitted or set to `"base64"` (default), image chunks are returned as base64 data (the SDK attempts to decode these into a `PIL.Image` for `FinalChunkResult.content`).
+- **`"base64"` (default)**: Image chunks are returned as base64 data (the SDK attempts to decode these into a `PIL.Image` for `FinalChunkResult.content`).
+- **`"url"`**: Image chunks are returned as presigned HTTPS URLs in `content`. This is convenient for UIs and LLMs that accept remote image URLs (e.g., via `image_url`).
+- **`"text"`**: Image chunks are converted to markdown text via OCR. Use this when you need faster inference or when documents are mostly text-based.
 - Text chunks are unaffected by `output_format` and are always returned as strings.
 - The `download_url` field may be populated for image chunks. When using `output_format="url"`, it will typically match `content` for those chunks.
 
+### When to Use Each Format
+
+| Format | Best For |
+|--------|----------|
+| `base64` | Direct image processing, local applications |
+| `url` | Web UIs, LLMs with vision capabilities (lighter on network) |
+| `text` | Faster inference, text-heavy documents, context length concerns |
+
+<Note>
+**base64 vs url**: Both formats pass images to LLMs for visual understanding and produce similar results. However, `url` is lighter on network transfer since only the URL is sent to your application (the LLM fetches the image directly). This can result in faster response times, especially with multiple images.
+
+**When to use text**: Passing images to LLMs for inference can be slow and consume significant context tokens. Use `output_format="text"` when you need faster inference speeds or when your documents are primarily text-based.
+
+If you're hitting context limits with images, it may be because they aren't being passed correctly to the model. See [Generating Completions with Retrieved Chunks](/cookbooks/generating-completions-with-retrieved-chunks) for examples of properly passing images (both base64 and URLs) to vision-capable models like GPT-4o.
+</Note>
+
 Tip: To download the original raw file for a document, use [`get_document_download_url`](./get_document_download_url).
 
 ## Reverse Image Search
diff --git a/python-sdk/retrieve_chunks_grouped.mdx b/python-sdk/retrieve_chunks_grouped.mdx
@@ -57,7 +57,10 @@ description: "Retrieve relevant chunks with grouping for UI display"
 - `folder_name` (str | List[str], optional): Optional folder scope (single name or list of names)
 - `end_user_id` (str, optional): Optional end-user scope
 - `padding` (int, optional): Number of additional chunks/pages to retrieve before and after matched chunks. Defaults to 0.
-- `output_format` (str, optional): Controls how image chunks are returned. Set to `"url"` for presigned URLs or `"base64"` (default) for base64 content.
+- `output_format` (str, optional): Controls how image chunks are returned:
+  - `"base64"` (default): Returns base64-encoded image data
+  - `"url"`: Returns presigned HTTPS URLs
+  - `"text"`: Converts images to markdown text via OCR (faster inference, best for text-heavy documents)
 - `graph_name` (str, optional): Name of the graph to use for knowledge graph-enhanced retrieval
 - `hop_depth` (int, optional): Number of relationship hops to traverse in the graph. Defaults to 1.
 - `include_paths` (bool, optional): Whether to include relationship paths in the response. Defaults to False.
diff --git a/self-hosting.mdx b/self-hosting.mdx
@@ -103,22 +103,21 @@ For users who need to run Morphik on their own infrastructure, we provide two in
             <Tab title="macOS">
               ```bash
               # Install via Homebrew
-              brew install poppler tesseract libmagic
+              brew install poppler libmagic
               ```
             </Tab>
             <Tab title="Ubuntu/Debian">
               ```bash
               # Install via apt
               sudo apt-get update
-              sudo apt-get install -y poppler-utils tesseract-ocr libmagic-dev
+              sudo apt-get install -y poppler-utils libmagic-dev
               ```
             </Tab>
             <Tab title="Windows">
               For Windows, you may need to install these dependencies manually:
 
               1. **Poppler**: Download from [poppler for Windows](https://github.com/oschwartz10612/poppler-windows/releases/)
-              2. **Tesseract**: Download the installer from [UB Mannheim](https://github.com/UB-Mannheim/tesseract/wiki)
-              3. **libmagic**: This is included in the python-magic-bin package which will be installed with pip
+              2. **libmagic**: This is included in the python-magic-bin package which will be installed with pip
             </Tab>
           </Tabs>
           If you encounter database initialization issues within Docker, you may need to manually initialize the schema: