You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/docs.trychroma.com/markdoc/content/cloud/sync/overview.md
+62-10Lines changed: 62 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -35,6 +35,10 @@ The platform tier requires you to grant Chroma access to a GitHub App that you o
35
35
36
36
The platform tier grants access to the Chroma Sync API and is ideal for companies and organizations that offer services which access their users’ codebases. The platform tier is available on Chroma’s Team plan. If you are interested in using it, please reach out to us at [[email protected]](mailto:[email protected]).
37
37
38
+
## Web
39
+
40
+
The web source type allows developers to scrape the contents of web pages into Chroma. Given a starting URL, Sync will crawl the page and its links up to a specified depth.
41
+
38
42
# Sources
39
43
40
44
A source is a specific instance of a source type configured according to the global and source type-specific configuration schema. The global source configuration schema refers to the configuration parameters that are required across sources of all types, while the source-type specific configuration schema refers to the configuration parameters required for a specific source type.
@@ -44,12 +48,16 @@ The global source configuration schema requires the following parameters:
44
48
```json
45
49
{
46
50
"database_name": "string",
47
-
"embedding_model": "Qwen/Qwen3-Embedding-0.6B"
51
+
"embedding": {
52
+
"dense": {
53
+
"model": "Qwen/Qwen3-Embedding-0.6B"
54
+
}
55
+
}
48
56
}
49
57
```
50
58
51
59
-`database_name` defines the Chroma database in which collections should be created by invocations run on this source. A database must exist before creating sources that point to it.
52
-
-`embedding_model` defines the embedding model that should be used to generate embeddings for chunked documents. Currently, only the [Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) model is supported, but if there is a model you would like to use, please let us know by reaching out to [[email protected]](mailto:[email protected]).
60
+
-`embedding.dense.model` defines the embedding model that should be used to generate dense embeddings for chunked documents. Currently, only the [Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) model is supported, but if there is a model you would like to use, please let us know by reaching out to [[email protected]](mailto:[email protected]).
53
61
54
62
## GitHub Repositories
55
63
@@ -67,6 +75,21 @@ A source of the GitHub repository type is an individual GitHub repository config
67
75
-`app_id` defines the GitHub App ID of the GitHub App that has access to the provided `repository`. This parameter should only be supplied if the provided repository is private. If you are unsure of the GitHub App ID you should use, [see more](https://www.notion.so/Chroma-Sync-Docs-28b58a6d81918062b6ebf00deedde0ab?pvs=21) about the two tiers Chroma offers for the GitHub repository source type.
68
76
-`include_globs` defines a set of glob patterns for which matching files should be synced. If this parameter is not provided, files matching `"*"` will be synced. Note that Chroma will not sync binary data, images, and other large or non-UTF-8 files.
69
77
78
+
## Web
79
+
80
+
A source of the web type is configured with a starting URL and a few other optional parameters:
81
+
82
+
```json
83
+
{
84
+
"starting_url": "https://docs.trychroma.com",
85
+
// all below are optional
86
+
"page_limit": 5,
87
+
"include_path_regexes": ["/cloud/*"],
88
+
"exclude_path_regexes": ["/blog/*"],
89
+
"max_depth": 2
90
+
}
91
+
```
92
+
70
93
# Invocations
71
94
72
95
Invocations refer to runs of the Sync Function over the data in a source. One invocation corresponds to one sync pass through all of the data in a source. A single invocation will result in the creation of exactly one collection in the database specified by the invocation’s source. This collection will contain the chunked, embedded, and indexed data that represents the state of the source at the time of the invocation’s creation. Invocations, like sources, have some global configuration parameters, as well as parameters specific to the type of the source for which the invocation is being run.
@@ -118,10 +141,16 @@ Creates a new source of the specified type with the provided configuration.
118
141
119
142
**Request Body**
120
143
144
+
For a GitHub repository source:
145
+
121
146
```json
122
147
{
123
148
"database_name": "string",
124
-
"embedding_model": "Qwen/Qwen3-Embedding-0.6B",
149
+
"embedding": {
150
+
"dense": {
151
+
"model": "Qwen/Qwen3-Embedding-0.6B"
152
+
}
153
+
},
125
154
"github": {
126
155
"repository": "string",
127
156
"app_id": "string"| null, // optional
@@ -130,6 +159,23 @@ Creates a new source of the specified type with the provided configuration.
130
159
}
131
160
```
132
161
162
+
For a web source:
163
+
164
+
```json
165
+
{
166
+
"database_name": "string",
167
+
"embedding": {
168
+
"dense": {
169
+
"model": "Qwen/Qwen3-Embedding-0.6B"
170
+
}
171
+
},
172
+
"web_scrape": {
173
+
"starting_url": "https://docs.trychroma.com",
174
+
"page_limit": 5
175
+
}
176
+
}
177
+
```
178
+
133
179
**Responses**
134
180
-`200 OK` If the source is successfully created.
135
181
@@ -171,14 +217,16 @@ Retrieve a specific source by its ID.
171
217
{
172
218
"id": "string",
173
219
"database_name": "string",
174
-
"embedding_model": "string",
175
-
"source_type": {
176
-
"github": {
177
-
"repository": "string",
178
-
"app_id": "string"| null,
179
-
"include_globs": ["string", ...]
220
+
"embedding": {
221
+
"dense": {
222
+
"model": "string"
180
223
}
181
224
},
225
+
"github": {
226
+
"repository": "string",
227
+
"app_id": "string"| null,
228
+
"include_globs": ["string", ...]
229
+
},
182
230
"created_at": "string"
183
231
}
184
232
```
@@ -221,7 +269,11 @@ List sources with optional filtering.
0 commit comments