Skip to content

Commit 0f7f7af

Browse files
[DOC]: web sync (#5798)
Co-authored-by: propel-code-bot[bot] <203372662+propel-code-bot[bot]@users.noreply.github.com>
1 parent 605fef4 commit 0f7f7af

File tree

1 file changed

+62
-10
lines changed
  • docs/docs.trychroma.com/markdoc/content/cloud/sync

1 file changed

+62
-10
lines changed

docs/docs.trychroma.com/markdoc/content/cloud/sync/overview.md

Lines changed: 62 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,10 @@ The platform tier requires you to grant Chroma access to a GitHub App that you o
3535

3636
The platform tier grants access to the Chroma Sync API and is ideal for companies and organizations that offer services which access their users’ codebases. The platform tier is available on Chroma’s Team plan. If you are interested in using it, please reach out to us at [[email protected]](mailto:[email protected]).
3737

38+
## Web
39+
40+
The web source type allows developers to scrape the contents of web pages into Chroma. Given a starting URL, Sync will crawl the page and its links up to a specified depth.
41+
3842
# Sources
3943

4044
A source is a specific instance of a source type configured according to the global and source type-specific configuration schema. The global source configuration schema refers to the configuration parameters that are required across sources of all types, while the source-type specific configuration schema refers to the configuration parameters required for a specific source type.
@@ -44,12 +48,16 @@ The global source configuration schema requires the following parameters:
4448
```json
4549
{
4650
"database_name": "string",
47-
"embedding_model": "Qwen/Qwen3-Embedding-0.6B"
51+
"embedding": {
52+
"dense": {
53+
"model": "Qwen/Qwen3-Embedding-0.6B"
54+
}
55+
}
4856
}
4957
```
5058

5159
- `database_name` defines the Chroma database in which collections should be created by invocations run on this source. A database must exist before creating sources that point to it.
52-
- `embedding_model` defines the embedding model that should be used to generate embeddings for chunked documents. Currently, only the [Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) model is supported, but if there is a model you would like to use, please let us know by reaching out to [[email protected]](mailto:[email protected]).
60+
- `embedding.dense.model` defines the embedding model that should be used to generate dense embeddings for chunked documents. Currently, only the [Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) model is supported, but if there is a model you would like to use, please let us know by reaching out to [[email protected]](mailto:[email protected]).
5361

5462
## GitHub Repositories
5563

@@ -67,6 +75,21 @@ A source of the GitHub repository type is an individual GitHub repository config
6775
- `app_id` defines the GitHub App ID of the GitHub App that has access to the provided `repository`. This parameter should only be supplied if the provided repository is private. If you are unsure of the GitHub App ID you should use, [see more](https://www.notion.so/Chroma-Sync-Docs-28b58a6d81918062b6ebf00deedde0ab?pvs=21) about the two tiers Chroma offers for the GitHub repository source type.
6876
- `include_globs` defines a set of glob patterns for which matching files should be synced. If this parameter is not provided, files matching `"*"` will be synced. Note that Chroma will not sync binary data, images, and other large or non-UTF-8 files.
6977

78+
## Web
79+
80+
A source of the web type is configured with a starting URL and a few other optional parameters:
81+
82+
```json
83+
{
84+
"starting_url": "https://docs.trychroma.com",
85+
// all below are optional
86+
"page_limit": 5,
87+
"include_path_regexes": ["/cloud/*"],
88+
"exclude_path_regexes": ["/blog/*"],
89+
"max_depth": 2
90+
}
91+
```
92+
7093
# Invocations
7194

7295
Invocations refer to runs of the Sync Function over the data in a source. One invocation corresponds to one sync pass through all of the data in a source. A single invocation will result in the creation of exactly one collection in the database specified by the invocation’s source. This collection will contain the chunked, embedded, and indexed data that represents the state of the source at the time of the invocation’s creation. Invocations, like sources, have some global configuration parameters, as well as parameters specific to the type of the source for which the invocation is being run.
@@ -118,10 +141,16 @@ Creates a new source of the specified type with the provided configuration.
118141

119142
**Request Body**
120143

144+
For a GitHub repository source:
145+
121146
```json
122147
{
123148
"database_name": "string",
124-
"embedding_model": "Qwen/Qwen3-Embedding-0.6B",
149+
"embedding": {
150+
"dense": {
151+
"model": "Qwen/Qwen3-Embedding-0.6B"
152+
}
153+
},
125154
"github": {
126155
"repository": "string",
127156
"app_id": "string" | null, // optional
@@ -130,6 +159,23 @@ Creates a new source of the specified type with the provided configuration.
130159
}
131160
```
132161

162+
For a web source:
163+
164+
```json
165+
{
166+
"database_name": "string",
167+
"embedding": {
168+
"dense": {
169+
"model": "Qwen/Qwen3-Embedding-0.6B"
170+
}
171+
},
172+
"web_scrape": {
173+
"starting_url": "https://docs.trychroma.com",
174+
"page_limit": 5
175+
}
176+
}
177+
```
178+
133179
**Responses**
134180
- `200 OK` If the source is successfully created.
135181

@@ -171,14 +217,16 @@ Retrieve a specific source by its ID.
171217
{
172218
"id": "string",
173219
"database_name": "string",
174-
"embedding_model": "string",
175-
"source_type": {
176-
"github": {
177-
"repository": "string",
178-
"app_id": "string" | null,
179-
"include_globs": ["string", ...]
220+
"embedding": {
221+
"dense": {
222+
"model": "string"
180223
}
181224
},
225+
"github": {
226+
"repository": "string",
227+
"app_id": "string" | null,
228+
"include_globs": ["string", ...]
229+
},
182230
"created_at": "string"
183231
}
184232
```
@@ -221,7 +269,11 @@ List sources with optional filtering.
221269
{
222270
"id": "string",
223271
"database_name": "string",
224-
"embedding_model": "string",
272+
"embedding": {
273+
"dense": {
274+
"model": "string"
275+
}
276+
},
225277
"source_type": {
226278
"github": {
227279
"repository": "string",

0 commit comments

Comments
 (0)