Skip to content

Commit 3336d0c

Browse files
committed
add text output format
1 parent e7524dd commit 3336d0c

File tree

4 files changed

+64
-9
lines changed

4 files changed

+64
-9
lines changed

concepts/colpali.mdx

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,39 @@ db.query("At what time-step did we see the highest GDP growth rate?", use_colpal
6868

6969
So instead of having to implement the ColPali pipeline from scratch, you can use Morphik to do it for you in a single line of code!
7070

71+
## Controlling Output Format
72+
73+
When retrieving ColPali chunks (which are page images), you can control how the images are returned using the `output_format` parameter:
74+
75+
```python
76+
# Return as base64-encoded data (default)
77+
chunks = db.retrieve_chunks("quarterly results", use_colpali=True)
78+
79+
# Return as presigned URLs (useful for web UIs)
80+
chunks = db.retrieve_chunks("quarterly results", use_colpali=True, output_format="url")
81+
82+
# Convert images to markdown text via OCR
83+
chunks = db.retrieve_chunks("quarterly results", use_colpali=True, output_format="text")
84+
```
85+
86+
The three output formats are:
87+
- **`"base64"`** (default): Returns base64-encoded image data
88+
- **`"url"`**: Returns presigned HTTPS URLs, convenient for LLMs and UIs that accept remote image URLs
89+
- **`"text"`**: Converts page images to markdown text via OCR
90+
91+
### Choosing Between Formats
92+
93+
**base64 vs url**: Both formats pass images to LLMs for visual understanding and produce similar inference results. However, `url` is lighter on network transfer since only the URL is sent to your application (the LLM fetches the image directly). This can result in faster response times, especially when working with multiple images.
94+
95+
**When to use text**: Passing images to LLMs for inference can be slow and consume significant context tokens. Use `output_format="text"` when:
96+
- You need **faster inference** speeds
97+
- Your documents are **primarily text-based** (reports, articles, contracts)
98+
- You're hitting **context length limits**
99+
100+
<Note>
101+
If you're experiencing context limit issues with image-based retrieval, it may be because images aren't being passed correctly to the model. See [Generating Completions with Retrieved Chunks](/cookbooks/generating-completions-with-retrieved-chunks) for examples of properly passing images (both base64 and URLs) to vision-capable models like GPT-4o.
102+
</Note>
103+
71104

72105

73106

python-sdk/retrieve_chunks.mdx

Lines changed: 24 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,10 @@ description: "Retrieve relevant chunks from Morphik"
4545
- `use_colpali` (bool, optional): Whether to use ColPali-style embedding model to retrieve the chunks (only works for documents ingested with `use_colpali=True`). Defaults to True.
4646
- `folder_name` (str | List[str], optional): Optional folder scope. Accepts a single folder name or a list of folder names.
4747
- `padding` (int, optional): Number of additional chunks/pages to retrieve before and after matched chunks (ColPali only). Defaults to 0.
48-
- `output_format` (str, optional): Controls how image chunks are returned. Set to `"url"` to receive presigned URLs; omit or set to `"base64"` (default) to receive base64 content.
48+
- `output_format` (str, optional): Controls how image chunks are returned:
49+
- `"base64"` (default): Returns base64-encoded image data
50+
- `"url"`: Returns presigned HTTPS URLs
51+
- `"text"`: Converts images to markdown text via OCR
4952
- `query_image` (str, optional): Base64-encoded image for reverse image search. Mutually exclusive with `query`. Requires `use_colpali=True`.
5053

5154
## Metadata Filters
@@ -135,13 +138,30 @@ The `FinalChunkResult` objects returned by this method have the following proper
135138
- `filename` (Optional[str]): Original filename
136139
- `download_url` (Optional[str]): URL to download full document
137140

138-
## Image URL output
141+
## Output Format Options
139142

140-
- When `output_format="url"` is provided, image chunks are returned as presigned HTTPS URLs in `content`. This is convenient for UIs and LLMs that accept remote image URLs (e.g., via `image_url`).
141-
- When `output_format` is omitted or set to `"base64"` (default), image chunks are returned as base64 data (the SDK attempts to decode these into a `PIL.Image` for `FinalChunkResult.content`).
143+
- **`"base64"` (default)**: Image chunks are returned as base64 data (the SDK attempts to decode these into a `PIL.Image` for `FinalChunkResult.content`).
144+
- **`"url"`**: Image chunks are returned as presigned HTTPS URLs in `content`. This is convenient for UIs and LLMs that accept remote image URLs (e.g., via `image_url`).
145+
- **`"text"`**: Image chunks are converted to markdown text via OCR. Use this when you need faster inference or when documents are mostly text-based.
142146
- Text chunks are unaffected by `output_format` and are always returned as strings.
143147
- The `download_url` field may be populated for image chunks. When using `output_format="url"`, it will typically match `content` for those chunks.
144148

149+
### When to Use Each Format
150+
151+
| Format | Best For |
152+
|--------|----------|
153+
| `base64` | Direct image processing, local applications |
154+
| `url` | Web UIs, LLMs with vision capabilities (lighter on network) |
155+
| `text` | Faster inference, text-heavy documents, context length concerns |
156+
157+
<Note>
158+
**base64 vs url**: Both formats pass images to LLMs for visual understanding and produce similar results. However, `url` is lighter on network transfer since only the URL is sent to your application (the LLM fetches the image directly). This can result in faster response times, especially with multiple images.
159+
160+
**When to use text**: Passing images to LLMs for inference can be slow and consume significant context tokens. Use `output_format="text"` when you need faster inference speeds or when your documents are primarily text-based.
161+
162+
If you're hitting context limits with images, it may be because they aren't being passed correctly to the model. See [Generating Completions with Retrieved Chunks](/cookbooks/generating-completions-with-retrieved-chunks) for examples of properly passing images (both base64 and URLs) to vision-capable models like GPT-4o.
163+
</Note>
164+
145165
Tip: To download the original raw file for a document, use [`get_document_download_url`](./get_document_download_url).
146166

147167
## Reverse Image Search

python-sdk/retrieve_chunks_grouped.mdx

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,10 @@ description: "Retrieve relevant chunks with grouping for UI display"
5757
- `folder_name` (str | List[str], optional): Optional folder scope (single name or list of names)
5858
- `end_user_id` (str, optional): Optional end-user scope
5959
- `padding` (int, optional): Number of additional chunks/pages to retrieve before and after matched chunks. Defaults to 0.
60-
- `output_format` (str, optional): Controls how image chunks are returned. Set to `"url"` for presigned URLs or `"base64"` (default) for base64 content.
60+
- `output_format` (str, optional): Controls how image chunks are returned:
61+
- `"base64"` (default): Returns base64-encoded image data
62+
- `"url"`: Returns presigned HTTPS URLs
63+
- `"text"`: Converts images to markdown text via OCR (faster inference, best for text-heavy documents)
6164
- `graph_name` (str, optional): Name of the graph to use for knowledge graph-enhanced retrieval
6265
- `hop_depth` (int, optional): Number of relationship hops to traverse in the graph. Defaults to 1.
6366
- `include_paths` (bool, optional): Whether to include relationship paths in the response. Defaults to False.

self-hosting.mdx

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -103,22 +103,21 @@ For users who need to run Morphik on their own infrastructure, we provide two in
103103
<Tab title="macOS">
104104
```bash
105105
# Install via Homebrew
106-
brew install poppler tesseract libmagic
106+
brew install poppler libmagic
107107
```
108108
</Tab>
109109
<Tab title="Ubuntu/Debian">
110110
```bash
111111
# Install via apt
112112
sudo apt-get update
113-
sudo apt-get install -y poppler-utils tesseract-ocr libmagic-dev
113+
sudo apt-get install -y poppler-utils libmagic-dev
114114
```
115115
</Tab>
116116
<Tab title="Windows">
117117
For Windows, you may need to install these dependencies manually:
118118

119119
1. **Poppler**: Download from [poppler for Windows](https://github.com/oschwartz10612/poppler-windows/releases/)
120-
2. **Tesseract**: Download the installer from [UB Mannheim](https://github.com/UB-Mannheim/tesseract/wiki)
121-
3. **libmagic**: This is included in the python-magic-bin package which will be installed with pip
120+
2. **libmagic**: This is included in the python-magic-bin package which will be installed with pip
122121
</Tab>
123122
</Tabs>
124123
If you encounter database initialization issues within Docker, you may need to manually initialize the schema:

0 commit comments

Comments
 (0)