@@ -63,32 +63,218 @@ The BYOK approach offers more flexibility and control over your datasets. You ca
6363
64641 . Create an account at [ Trieve] ( https://trieve.ai )
65652 . Create a new dataset using Trieve's dashboard
66+
67+ ![ Create dataset in Trieve] ( ../../static/images/knowledge-base/create-dataset.png )
68+
69+ When creating your dataset in Trieve, selecting the right embedding model is crucial for optimizing performance and accuracy. Here are some of the available options:
70+
71+ ### jina-base-en
72+
73+ - ** Provider** : Jina AI (Hosted by Trieve)
74+ - ** Performance** : Fast
75+ - ** Description** : This model is designed for speed and efficiency, making it suitable for applications where quick response times are critical. It provides a good balance of performance and accuracy for general use cases.
76+
77+ ### text-embedding-3-small
78+
79+ - ** Provider** : OpenAI
80+ - ** Performance** : Moderate
81+ - ** Description** : A smaller model from OpenAI that offers a compromise between speed and accuracy. It is suitable for applications that require a balance between computational efficiency and the quality of embeddings.
82+
83+ ### text-embedding-3-large
84+
85+ - ** Provider** : OpenAI
86+ - ** Performance** : Slow
87+ - ** Description** : This larger model provides the highest accuracy among the options but at the cost of slower processing times. It is ideal for applications where the quality of embeddings is prioritized over speed.
88+
66893 . Add content through various methods:
6790
68- #### Document Upload
91+ #### Upload Documents
92+
93+ Upload documents directly through Trieve's interface:
94+
95+ ![ Upload files in Trieve] ( ../../static/images/knowledge-base/upload-files.png )
96+
97+ When uploading files, you can configure advanced chunking options:
98+
99+ ![ Upload files advanced options in Trieve] ( ../../static/images/knowledge-base/upload-files-advanced.png )
100+
101+ #### Edit Individual Chunks
102+
103+ After uploading documents, you can edit individual chunks to refine their content:
104+
105+ ![ Edit chunk interface in Trieve] ( ../../static/images/knowledge-base/edit-chunk.png )
106+
107+ ##### Editing Options
108+
109+ - ** Chunk Content** : Modify the text directly in the rich text editor
110+
111+ - Fix formatting issues
112+ - Correct errors or typos
113+ - Split or combine chunks manually
114+ - Add or remove content
115+
116+ - ** Metadata Fields** :
117+ - Date: Update document timestamps
118+ - Number Value: Adjust numeric metadata for filtering
119+ - Location: Set or modify geographical coordinates
120+ - Weight: Fine-tune search relevance with custom weights
121+ - Fulltext Boost: Add terms to enhance search visibility
122+ - Semantic Boost: Adjust vector embedding influence
123+
124+ ##### Best Practices for Chunk Editing
125+
126+ 1 . ** Content Length**
127+
128+ - Keep chunks between 200-1000 tokens
129+ - Maintain logical content boundaries
130+ - Ensure complete thoughts within each chunk
131+
132+ 2 . ** Metadata Optimization**
133+
134+ - Use consistent date formats
135+ - Add relevant numeric values for filtering
136+ - Apply weights strategically for important content
137+
138+ 3 . ** Search Enhancement**
139+ - Use boost terms for critical keywords
140+ - Balance semantic and fulltext boosts
141+ - Test search results after significant edits
142+
143+ ### Advanced Chunking Options
144+
145+ #### Metadata
146+
147+ - Add custom metadata as JSON to associate with your chunks
148+ - Useful for filtering and organizing content (e.g., ` {"author": "John Doe", "category": "technical"} ` )
149+ - Keep metadata concise and relevant to avoid storage overhead
150+ - Use consistent keys across related documents for better searchability
151+
152+ #### Date
153+
154+ - Specify the creation or relevant date for the document
155+ - Important for version control and content freshness
156+ - Helps with filtering outdated information
157+ - Use actual document creation dates when possible
158+
159+ #### Split Delimiters
160+
161+ - Define custom delimiters (e.g., ".,?\n") to control where chunks are split
162+ - Recommended defaults: ".,?\n" for general content
163+ - Add semicolons (;) for technical documentation
164+ - Use "\n\n" for markdown or structured content
165+ - Avoid over-aggressive splitting that might break context
166+
167+ #### Target Splits Per Chunk
69168
70- - Upload documents directly through Trieve's interface
71- - Supported formats: PDF, DOCX, TXT, MD
72- - Configure chunking parameters:
73- - Chunk size
74- - Overlap
75- - Split delimiters
169+ - Set the desired number of splits per chunk
170+ - Default: 20 splits
171+ - Recommended ranges:
172+ - 15-25 for general content
173+ - 10-15 for technical documentation
174+ - 25-30 for narrative content
175+ - Lower values create more granular chunks, better for precise retrieval
176+ - Higher values maintain more context but may retrieve irrelevant information
177+
178+ #### Rebalance Chunks
179+
180+ - Enable to redistribute content evenly across chunks
181+ - Recommended for documents with varying section lengths
182+ - Helps maintain consistent chunk sizes
183+ - May slightly impact natural content boundaries
184+ - Best used with technical documentation or structured content
185+
186+ #### Use gpt4o chunking
187+
188+ - Enable GPT-4 optimized chunking for improved semantic coherence
189+ - Recommended for:
190+ - Complex technical documentation
191+ - Content with intricate relationships
192+ - Documents where context preservation is crucial
193+ - Note: Increases processing time and cost
194+ - Best for high-value content where accuracy is paramount
195+
196+ #### Heading Based Chunking
197+
198+ - Split content based on document headings
199+ - Ideal for well-structured documents (e.g., documentation, reports)
200+ - Works best with consistent heading hierarchy
201+ - Consider enabling for:
202+ - Technical documentation
203+ - User manuals
204+ - Research papers
205+ - May create uneven chunk sizes based on section lengths
206+
207+ #### System Prompt
208+
209+ - Provide custom instructions for the chunking process
210+ - Optional but powerful for specific use cases
211+ - Example prompts:
212+ - "Preserve code blocks as single chunks"
213+ - "Keep API endpoint descriptions together"
214+ - "Maintain question-answer pairs in the same chunk"
215+ - Keep prompts clear and specific
216+ - Test different prompts with sample content to optimize results
76217
77218#### Website Crawling
78219
79- Trieve offers powerful website crawling capabilities:
220+ Trieve offers powerful website crawling capabilities with extensive configuration options :
80221
81- ``` json
82- {
83- "url" : " https://yourdomain.com" ,
84- "configuration" : {
85- "maxPages" : 100 ,
86- "allowedDomains" : [" yourdomain.com" ],
87- "excludePatterns" : [" /admin/*" , " /login" ],
88- "includePatterns" : [" /docs/*" , " /blog/*" ]
89- }
90- }
91- ```
222+ ![ Website crawling in Trieve] ( ../../static/images/knowledge-base/crawl.png )
223+
224+ ##### Crawl Configuration Options
225+
226+ - ** Crawl Interval** : Set how often to refresh content
227+
228+ - Options: Daily, Weekly, Monthly
229+ - Recommended: Daily for frequently updated content
230+
231+ - ** Page Limit** : Control the maximum number of pages to crawl
232+
233+ - Default: 1000 pages
234+ - Adjust based on your site size and content relevance
235+
236+ - ** URL Patterns**
237+
238+ - Include/Exclude specific URL patterns using regex
239+ - Example includes: ` https://docs.example.com/* `
240+ - Example excludes: ` https://example.com/internal/* `
241+
242+ - ** Query Selectors**
243+
244+ - Include specific HTML elements for targeted content extraction
245+ - Exclude navigation, footers, and other non-content elements
246+ - Common excludes: ` navbar ` , ` footer ` , ` aside ` , ` nav ` , ` form `
247+
248+ - ** Special Content Types**
249+
250+ - OpenAPI Spec: Toggle for API documentation crawling
251+ - Shopify: Enable for e-commerce content
252+ - YouTube Channel: Include video transcripts and descriptions
253+
254+ - ** Advanced Options**
255+ - Boost Titles: Increase weight of page titles in search results
256+ - Allow External Links: Include content from linked domains
257+ - Ignore Sitemap: Skip sitemap-based crawling
258+ - Remove Strings: Clean up headers and body content
259+
260+ ##### Best Practices for Crawling
261+
262+ 1 . ** Start Small**
263+
264+ - Begin with a low page limit
265+ - Test with specific sections of your site
266+ - Gradually expand coverage
267+
268+ 2 . ** Optimize Selectors**
269+
270+ - Remove navigation and UI elements
271+ - Focus on main content areas
272+ - Use browser inspector to identify key selectors
273+
274+ 3 . ** Monitor Performance**
275+ - Check crawl logs regularly
276+ - Adjust patterns based on results
277+ - Balance frequency with server load
92278
93279### Step 2: Test and Refine
94280
@@ -97,12 +283,18 @@ Use Trieve's search playground to:
97283- Test semantic search queries
98284- Adjust chunk sizes
99285- Edit chunks manually
286+
100287- Visualize vector embeddings
101288- Fine-tune relevance scores
102289
290+ ![ Search playground in Trieve] ( ../../static/images/knowledge-base/search-playground.png )
291+
103292### Step 3: Import to Vapi
104293
105- Once your dataset is optimized in Trieve, import it to Vapi:
294+ 1 . Create your Trieve API key from [ Trieve's dashboard] ( https://dashboard.trieve.ai/org/keys )
295+ 2 . Add your Trieve API key to Vapi [ Provider Credentials] ( https://dashboard.vapi.ai/keys )
296+ ![ Add Trieve API key in Vapi] ( ../../static/images/knowledge-base/trieve-credential.png )
297+ 3 . Once your dataset is optimized in Trieve, import it to Vapi:
106298
107299``` json
108300{
@@ -123,46 +315,46 @@ Once your dataset is optimized in Trieve, import it to Vapi:
123315
1243161 . ** Dataset Organization**
125317
126- - Keep datasets focused on specific topics
127- - Use meaningful dataset names
128- - Document your chunking configurations
318+ - Segment datasets by domain knowledge boundaries
319+ - Use semantic-based dataset naming (e.g., "api-docs-v2", "user-guides-2024")
320+ - Version control chunking configurations in your codebase
129321
1303222 . ** Content Quality**
131323
132- - Clean and preprocess documents before uploading
133- - Review and edit chunks in Trieve's interface
134- - Test search relevance before importing to Vapi
324+ - Implement text normalization (Unicode normalization, whitespace standardization)
325+ - Use regex patterns to clean formatting artifacts
326+ - Validate chunk semantic coherence through embedding similarity scores
135327
1363283 . ** Performance Optimization**
137329
138- - Monitor chunk sizes (recommended : 200-1000 tokens)
139- - Use appropriate search types for your use case
140- - Adjust score thresholds based on testing
330+ - Target chunk sizes: 200-1000 tokens (optimal for current embedding models )
331+ - Configure hybrid search with BM25 boost = 0.3 for technical content
332+ - Set score thresholds dynamically based on embedding model (0.2 for text-embedding-3-small, 0.25 for text-embedding-3-large)
141333
1423344 . ** Maintenance**
143- - Regularly update content in Trieve
144- - Monitor search performance
145- - Keep API keys secure and updated
335+ - Implement automated content refresh cycles via Trieve's API
336+ - Track search result relevance metrics (MRR, NDCG)
337+ - Rotate API keys on 90-day cycles
146338
147339## Troubleshooting
148340
149341Common issues and solutions:
150342
151- 1 . ** Poor Search Results **
343+ 1 . ** Search Relevance Issues **
152344
153- - Adjust score threshold
154- - Try different search types (semantic, hybrid, BM25 )
155- - Review chunk sizes and content quality
345+ - Implement cross-encoder reranking for critical queries
346+ - Fine-tune BM25 vs semantic weights (recommended ratio: 0.3:0.7 )
347+ - Analyze chunk boundary overlap percentage (aim for 15-20%)
156348
157- 2 . ** Integration Issues **
349+ 2 . ** Integration Errors **
158350
159- - Verify API keys are correct
160- - Ensure dataset IDs are valid
161- - Check network connectivity
351+ - Validate dataset permissions (READ_DATASET scope required)
352+ - Check for dataset ID format compliance (UUID v4)
353+ - Monitor rate limits (default: 100 requests/min)
162354
163- 3 . ** Performance Problems **
164- - Reduce chunk sizes
165- - Optimize search configurations
166- - Consider splitting large datasets
355+ 3 . ** Performance Optimization **
356+ - Implement chunk size normalization (max variance: 20%)
357+ - Enable query caching for frequent searches
358+ - Use batch operations for bulk updates (max 100 chunks/request)
167359
168360Need help? Contact
[ [email protected] ] ( mailto:[email protected] ) for assistance.
0 commit comments