Skip to content

Commit 0da6839

Browse files
committed
Add core functions section
1 parent 421d414 commit 0da6839

18 files changed

+641
-338
lines changed

api-reference/creating-apps.mdx

Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
---
2+
title: 'Creating Apps'
3+
description: 'Provision isolated Morphik apps and generate connection URIs.'
4+
---
5+
6+
Morphik apps are isolated data environments. Each app has its own documents, embeddings, and auth token, so data stays separated even when apps live on the same cluster. Think of an app as a separate Morphik instance with a shared control plane.
7+
8+
Common uses:
9+
- Create one app per customer or tenant to keep data segregated.
10+
- Split environments (prod, staging, sandbox) without running multiple clusters.
11+
- Separate projects with different data retention or access policies.
12+
13+
## Create a new app (cloud)
14+
15+
**POST** `/cloud/generate_uri`
16+
17+
This endpoint creates an app and returns a Morphik URI that clients use to connect to it.
18+
19+
### Authentication
20+
21+
Provide a Bearer token in `Authorization: Bearer <JWT>`.
22+
Use an existing Morphik API token to create apps and mint new URIs programmatically.
23+
24+
### Request Body
25+
26+
<Properties>
27+
<Property name="app_id" type="string">
28+
Optional client-generated app id (recommended: UUID). If omitted, the server generates one.
29+
</Property>
30+
<Property name="name" type="string" required={true}>
31+
Human-friendly app name. Used in the Morphik URI.
32+
</Property>
33+
<Property name="expiry_days" type="integer">
34+
Days until the token expires (default: 3650).
35+
</Property>
36+
</Properties>
37+
38+
### Example request
39+
40+
```bash
41+
curl -X POST \
42+
https://api.morphik.ai/cloud/generate_uri \
43+
-H 'Authorization: Bearer YOUR_JWT_TOKEN' \
44+
-H 'Content-Type: application/json' \
45+
-d '{
46+
"name": "customer-acme"
47+
}'
48+
```
49+
50+
### Response
51+
52+
<Properties>
53+
<Property name="uri" type="string">
54+
Connection URI in the format `morphik://name:token@host`.
55+
</Property>
56+
<Property name="app_id" type="string">
57+
The app id associated with the URI.
58+
</Property>
59+
</Properties>
60+
61+
**Example response:**
62+
63+
```json
64+
{
65+
"uri": "morphik://customer-acme:eyJhbGciOi...@api.morphik.ai",
66+
"app_id": "f5c5e51a-7a1b-4c8d-8d7e-3c5ed3c6c7b2"
67+
}
68+
```
69+
70+
### Notes
71+
72+
- The response always contains a newly minted token for the app.
73+
- If `app_id` is omitted, the server generates one.
74+
- `name` is required.
75+
- App names must be unique per owner or org; duplicates return 409.
76+
- If the account tier has reached its app limit, the API returns 403.

api-reference/management-api.mdx

Lines changed: 0 additions & 80 deletions
This file was deleted.

concepts/colpali.mdx

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ description: 'Using Late-interaction and Contrastive learning to achieve state-o
55

66
## Introduction
77

8-
Upto now, we've seen RAG techniques that **i)** parse a given document, **ii)** convert it to text, and **iii)** embed the text for retrieval. These techniques have been particualrly text-heavy. Embedding models expect text in, knowledge graphs expect text in, and prasers break down when provided with documents that aren't text-dominant. This motivates the question:
8+
Upto now, we've seen RAG techniques that **i)** parse a given document, **ii)** convert it to text, and **iii)** embed the text for retrieval. These techniques have been particularly text-heavy. Embedding models expect text in, knowledge graphs expect text in, and parsers break down when provided with documents that aren't text-dominant. This motivates the question:
99

1010
> When was the last time you looked at a document and only saw text?
1111
@@ -17,7 +17,7 @@ In this guide, we'll explore a series of models, starting with *ColPali* that ar
1717

1818
## What is ColPali?
1919

20-
The core idea behind ColPali is simple: the core bottleneck in retrieval is not the performance of the embedding model, but **prior data ingestion pipeline**. As a result, this new techniques proposes doing away with any data preprocessing - embedding the entire document as a list of images instead.
20+
The core idea behind ColPali is simple: the core bottleneck in retrieval is not the performance of the embedding model, but **prior data ingestion pipeline**. As a result, this new technique proposes doing away with any data preprocessing - embedding the entire document as a list of images instead.
2121

2222
![ColPali Architecture](/assets/colpali.png)
2323

@@ -26,7 +26,7 @@ The diagram above shows the ColPali pipeline when compared with traditional layo
2626
## How does it work?
2727

2828
### Embedding Process
29-
The embedding process for ColPali borrows heavily from models like CLIP. That is, the vision encoder part of the model (as seen in the diagram above) is trained via a technique called **Contrastive Learning**. As we've discussed in previous explainers, an encoder is a function (usually a neural network or a transformer) that maps a given input to a fixed-length vector. Contrastive learning is a technique that allows us to train two encoders of different input types (such as image and text) to produce vectors in the "same embedding space". That is, the embedding of the word "dog" would be very close the embedding of the image of a dog. The way we can achieve this is simple in theory:
29+
The embedding process for ColPali borrows heavily from models like CLIP. That is, the vision encoder part of the model (as seen in the diagram above) is trained via a technique called **Contrastive Learning**. As we've discussed in previous explainers, an encoder is a function (usually a neural network or a transformer) that maps a given input to a fixed-length vector. Contrastive learning is a technique that allows us to train two encoders of different input types (such as image and text) to produce vectors in the "same embedding space". That is, the embedding of the word "dog" would be very close to the embedding of the image of a dog. The way we can achieve this is simple in theory:
3030

3131
1) Take a large dataset of image and text pairs.
3232
2) Pass the image and text through the vision and text encoders respectively.
@@ -40,7 +40,7 @@ So, we have a system that, given an image, can provide a vector embedding that l
4040

4141
### Retrieval Process
4242

43-
The retrieval process for ColPali borrows from late-interaction based reranking techniques such as [ColBERT](https://arxiv.org/abs/2004.12832). The idea is that instead of directly embedding an image or an entire block of text, we can embed individual patches or tokens instead. Then, instead of using the regular dot product or the cosine similarity, we can employ a slightly different scoring function. This scoring funciton looks at the most similar patches and tokens, and then sums those similarities up to obtain a final score.
43+
The retrieval process for ColPali borrows from late-interaction based reranking techniques such as [ColBERT](https://arxiv.org/abs/2004.12832). The idea is that instead of directly embedding an image or an entire block of text, we can embed individual patches or tokens instead. Then, instead of using the regular dot product or the cosine similarity, we can employ a slightly different scoring function. This scoring function looks at the most similar patches and tokens, and then sums those similarities up to obtain a final score.
4444

4545
![ColBERT Architecture](/assets/colbert.png)
4646

concepts/metadata-filtering.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -196,4 +196,4 @@ response = scoped.list_documents(filters=filters, include_total_count=True)
196196
- **“Metadata field … expects type …”** – The server couldn’t coerce the operand to the declared type. Ensure numbers/dates are valid JSON scalars or native Python types before serialization.
197197
- **Range query returns nothing** – Confirm the target documents were ingested/updated with the corresponding `metadata_types`. Re-ingest or call `update_document_metadata` with the proper type hints if necessary.
198198

199-
Still stuck? Share your filter payload and endpoint at `founders@morphik.ai` or on [Discord](https://discord.gg/H7RN3XdGu3).
199+
Still stuck? Share your filter payload and endpoint at `founders@morphik.ai` or on [Discord](https://discord.com/invite/BwMtv3Zaju).

concepts/naive-rag.mdx

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -15,12 +15,12 @@ it would be something like
1515

1616
> "seems like you're assembling chair CX-184. You may have skipped step 8 in the assembly process, since the rear leg is screwed backwards. Here is a step-by-step solution from the assembly guide: ...".
1717
18-
Note how both answers recognized the issue correctly, but since the LLM had additional context in the second answer, it was also able to provide a solution and more specific details. That's the jist of RAG - LLMs provide **higher-quality responses** when provided with **more context** surrounding a query.
18+
Note how both answers recognized the issue correctly, but since the LLM had additional context in the second answer, it was also able to provide a solution and more specific details. That's the gist of RAG - LLMs provide **higher-quality responses** when provided with **more context** surrounding a query.
1919

2020
While the core concept itself is quite obvious, the complexity arises in _how_ we can effectively retrieve the correct information. In the following sections, we explain one way to effectively perform RAG based on the concept of vector embeddings and similarity search (we'll explain what these mean\!).
2121

2222
<Note>
23-
In reality, Morphik uses a combination of different RAG techniques to achieve the best solution. We intend to talk about each of the techniques we implement in the [concepts](/concepts/) section of our documentation. If you're looking for a particular RAG technique, such as [ColPali](/concepts/colpali.mdx) or [Knowledge Graphs](/concepts/knowledge-graphs.mdx), you'll find it there. In this explainer, however, we'll restrict ourselves to talk about single vector-search based retrieval.
23+
In reality, Morphik uses a combination of different RAG techniques to achieve the best solution. We intend to talk about each of the techniques we implement in the [concepts](/concepts/) section of our documentation. If you're looking for a particular RAG technique, such as [ColPali](/concepts/colpali) or [Knowledge Graphs](/concepts/knowledge-graphs), you'll find it there. In this explainer, however, we'll restrict ourselves to talk about single vector-search based retrieval.
2424
</Note>
2525

2626
## How does RAG work?
@@ -33,7 +33,7 @@ In order to help add context to a prompt, we first need that context to exist. T
3333

3434
**Chunking** involves breaking down documents into smaller, manageable pieces. While LLMs have context windows that can handle thousands of tokens, we want to retrieve only the most relevant information for a given query. Chunking strategies vary based on the content type - code documentation might be chunked by function or class, while textbooks might be chunked by section or paragraph. The ideal chunk size balances granularity (smaller chunks for precise retrieval) with context preservation (larger chunks for maintaining semantic meaning).
3535

36-
**Embedding** transforms these text chunks into vector representations - essentially converting semantic meaning into mathematical space. This is done using embedding models that distill the essence of text into dense vectors. The [math and ML behind embeddings](https://www.3blue1brown.com/lessons/gpt#embedding) is really interesting. They have a [long history](https://en.wikipedia.org/wiki/Word_embedding) of development - with origins as old as 1957. Over time, models that produce word embeddings have gone through mulitple iterations - different domains, novel neural network architectures, as well as different training paradigms.
36+
**Embedding** transforms these text chunks into vector representations - essentially converting semantic meaning into mathematical space. This is done using embedding models that distill the essence of text into dense vectors. The [math and ML behind embeddings](https://www.3blue1brown.com/lessons/gpt#embedding) is really interesting. They have a [long history](https://en.wikipedia.org/wiki/Word_embedding) of development - with origins as old as 1957. Over time, models that produce word embeddings have gone through multiple iterations - different domains, novel neural network architectures, as well as different training paradigms.
3737

3838
Here's a gif we made using [Manim](https://www.manim.community/) to explain word embeddings:
3939

configuration.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -82,5 +82,5 @@ When running Morphik in Docker:
8282

8383
## Need Help?
8484

85-
1. Join our [Discord community](https://discord.gg/BwMtv3Zaju)
85+
1. Join our [Discord community](https://discord.com/invite/BwMtv3Zaju)
8686
2. Check [GitHub](https://github.com/morphik-org/morphik-core) for issues
Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
---
2+
title: "Batch Get Chunks"
3+
description: "Retrieve specific chunks by document ID and chunk number"
4+
---
5+
6+
Retrieve specific chunks by their document ID and chunk number in a single batch operation. Useful for fetching exact chunks after retrieval or for building custom pipelines.
7+
8+
<Tabs>
9+
<Tab title="Python">
10+
```python
11+
from morphik import Morphik
12+
13+
db = Morphik("your-uri")
14+
15+
chunks = db.batch_get_chunks(
16+
sources=[
17+
{"document_id": "doc_abc123", "chunk_number": 0},
18+
{"document_id": "doc_abc123", "chunk_number": 1},
19+
{"document_id": "doc_xyz789", "chunk_number": 5}
20+
],
21+
folder_name="/reports",
22+
use_colpali=True,
23+
output_format="url"
24+
)
25+
26+
for chunk in chunks:
27+
print(f"Doc {chunk.document_id}, Chunk {chunk.chunk_number}")
28+
print(f"Content: {chunk.content[:200]}...")
29+
```
30+
</Tab>
31+
<Tab title="TypeScript">
32+
```typescript
33+
import Morphik from 'morphik';
34+
35+
// For Teams/Enterprise, use your dedicated host: https://companyname-api.morphik.ai
36+
const client = new Morphik({
37+
apiKey: process.env.MORPHIK_API_KEY,
38+
baseURL: 'https://api.morphik.ai'
39+
});
40+
41+
const chunks = await client.batch.retrieveChunks({
42+
sources: [
43+
{ document_id: 'doc_abc123', chunk_number: 0 },
44+
{ document_id: 'doc_abc123', chunk_number: 1 },
45+
{ document_id: 'doc_xyz789', chunk_number: 5 }
46+
],
47+
folder_name: '/reports',
48+
use_colpali: true,
49+
output_format: 'url'
50+
});
51+
52+
chunks.forEach(chunk => {
53+
console.log(`Doc ${chunk.document_id}, Chunk ${chunk.chunk_number}`);
54+
console.log(`Content: ${chunk.content.slice(0, 200)}...`);
55+
});
56+
```
57+
</Tab>
58+
<Tab title="cURL">
59+
```bash
60+
curl -X POST "https://api.morphik.ai/batch/chunks" \
61+
-H "Authorization: Bearer $MORPHIK_API_KEY" \
62+
-H "Content-Type: application/json" \
63+
-d '{
64+
"sources": [
65+
{"document_id": "doc_abc123", "chunk_number": 0},
66+
{"document_id": "doc_abc123", "chunk_number": 1},
67+
{"document_id": "doc_xyz789", "chunk_number": 5}
68+
],
69+
"folder_name": "/reports",
70+
"use_colpali": true,
71+
"output_format": "url"
72+
}'
73+
```
74+
</Tab>
75+
</Tabs>
76+
77+
## Parameters
78+
79+
| Parameter | Type | Default | Description |
80+
|-----------|------|---------|-------------|
81+
| `sources` | array | required | List of `{document_id, chunk_number}` objects |
82+
| `use_colpali` | boolean | `true` | Use Morphik multimodal embeddings when available |
83+
| `output_format` | string | `"base64"` | Image format: `base64`, `url`, or `text` |
84+
| `folder_name` | string | `null` | Optional folder scope |
85+
86+
## Response
87+
88+
```json
89+
[
90+
{
91+
"document_id": "doc_abc123",
92+
"chunk_number": 0,
93+
"content": "Introduction to the quarterly report...",
94+
"content_type": "text/plain",
95+
"score": 1.0,
96+
"metadata": { "department": "sales" }
97+
},
98+
{
99+
"document_id": "doc_abc123",
100+
"chunk_number": 1,
101+
"content": "Revenue highlights for Q4...",
102+
"content_type": "text/plain",
103+
"score": 1.0,
104+
"metadata": { "department": "sales" }
105+
}
106+
]
107+
```
108+
109+
<Note>
110+
This is useful when you already know which chunks you need (e.g., from a previous retrieval result) and want to fetch their full content efficiently.
111+
</Note>

0 commit comments

Comments
 (0)