|
| 1 | +# Document Chunking and Embedding Example |
| 2 | + |
| 3 | +This example demonstrates how to chunk a document, generate embeddings, and store them in Chroma Cloud for semantic search and retrieval. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +The example performs the following operations: |
| 8 | + |
| 9 | +1. **Ingestion Mode**: Chunks a document (`document.txt`) into smaller pieces, generates embeddings using Jina AI, and stores them in Chroma Cloud |
| 10 | +2. **Query Mode**: Performs semantic search on the stored documents using natural language queries |
| 11 | + |
| 12 | +## Prerequisites |
| 13 | + |
| 14 | +- PHP 8.1 or higher |
| 15 | +- Chroma Cloud account with API key |
| 16 | +- Jina AI API key (for embeddings) |
| 17 | +- Composer dependencies installed (`composer install`) |
| 18 | + |
| 19 | +## Setup |
| 20 | + |
| 21 | +1. Set your API keys as environment variables: |
| 22 | + |
| 23 | +```bash |
| 24 | +export CHROMA_API_KEY="your-chroma-cloud-api-key" |
| 25 | +export JINA_API_KEY="your-jina-api-key" |
| 26 | +``` |
| 27 | + |
| 28 | +Or pass them via CLI arguments (see Usage below). |
| 29 | + |
| 30 | +## Usage |
| 31 | + |
| 32 | +### Ingest Mode |
| 33 | + |
| 34 | +Chunk and store the document to Chroma Cloud: |
| 35 | + |
| 36 | +```bash |
| 37 | +php index.php -mode ingest |
| 38 | +``` |
| 39 | + |
| 40 | +With custom options: |
| 41 | + |
| 42 | +```bash |
| 43 | +php index.php -mode ingest \ |
| 44 | + --api-key "your-chroma-api-key" \ |
| 45 | + --jina-key "your-jina-api-key" \ |
| 46 | + --tenant "my-tenant" \ |
| 47 | + --database "my-database" |
| 48 | +``` |
| 49 | + |
| 50 | +### Query Mode |
| 51 | + |
| 52 | +Search the stored documents: |
| 53 | + |
| 54 | +```bash |
| 55 | +php index.php -mode query --query "What happened at the Dartmouth Workshop?" |
| 56 | +``` |
| 57 | + |
| 58 | +With custom options: |
| 59 | + |
| 60 | +```bash |
| 61 | +php index.php -mode query \ |
| 62 | + --query "Who proposed the Turing Test?" \ |
| 63 | + --api-key "your-chroma-api-key" \ |
| 64 | + --jina-key "your-jina-api-key" \ |
| 65 | + --tenant "my-tenant" \ |
| 66 | + --database "my-database" |
| 67 | +``` |
| 68 | + |
| 69 | +## CLI Arguments |
| 70 | + |
| 71 | +| Argument | Description | Default | Required | |
| 72 | +|----------|-------------|---------|----------| |
| 73 | +| `-mode` | Operation mode: `ingest` or `query` | - | Yes | |
| 74 | +| `--query` | Query text for search (query mode only) | "Which event marked the birth of symbolic AI?" | No | |
| 75 | +| `--api-key` | Chroma Cloud API key | `CHROMA_API_KEY` env var | Yes | |
| 76 | +| `--jina-key` | Jina AI API key for embeddings | `JINA_API_KEY` env var | Yes | |
| 77 | +| `--tenant` | Chroma Cloud tenant name | `default_tenant` | No | |
| 78 | +| `--database` | Chroma Cloud database name | `default_database` | No | |
| 79 | +| `--collection-name` | Collection name to use | `history_of_ai` | No | |
| 80 | + |
| 81 | +## Example Queries |
| 82 | + |
| 83 | +Try these example queries to test the semantic search: |
| 84 | + |
| 85 | +```bash |
| 86 | +# Historical events |
| 87 | +php index.php -mode query --query "What happened at the Dartmouth Workshop?" |
| 88 | + |
| 89 | +# People and contributions |
| 90 | +php index.php -mode query --query "Who proposed the Turing Test?" |
| 91 | + |
| 92 | +# Technical breakthroughs |
| 93 | +php index.php -mode query --query "What was the significance of AlexNet in 2012?" |
| 94 | + |
| 95 | +# Concepts and explanations |
| 96 | +php index.php -mode query --query "How do Large Language Models and Generative AI work?" |
| 97 | + |
| 98 | +# Historical figures |
| 99 | +php index.php -mode query --query "Who is considered the first computer programmer?" |
| 100 | +``` |
| 101 | + |
| 102 | +## How It Works |
| 103 | + |
| 104 | +### Document Chunking |
| 105 | + |
| 106 | +The document is chunked based on: |
| 107 | +- **CHAPTER markers**: New chapters create new chunks |
| 108 | +- **PAGE markers**: New pages create new chunks |
| 109 | +- **Text accumulation**: Text between markers is accumulated into chunks |
| 110 | + |
| 111 | +Each chunk includes: |
| 112 | +- Unique ID |
| 113 | +- Document text |
| 114 | +- Metadata (chapter and page information) |
| 115 | + |
| 116 | +### Embedding Generation |
| 117 | + |
| 118 | +- Uses Jina AI's embedding function to convert text chunks into vector embeddings |
| 119 | +- Embeddings are generated in batch for efficiency |
| 120 | +- All chunks are embedded before storage |
| 121 | + |
| 122 | +### Storage |
| 123 | + |
| 124 | +- Chunks are stored in a Chroma Cloud collection |
| 125 | +- The collection is recreated on each ingestion (previous data is deleted) |
| 126 | +- Each chunk maintains its metadata for filtering and context |
| 127 | + |
| 128 | +### Querying |
| 129 | + |
| 130 | +- Natural language queries are converted to embeddings using the same Jina AI function |
| 131 | +- Vector similarity search finds the most relevant chunks |
| 132 | +- Results include distance scores, documents, and metadata |
| 133 | + |
| 134 | +## Output |
| 135 | + |
| 136 | +### Ingest Mode |
| 137 | + |
| 138 | +``` |
| 139 | +--- Chroma Cloud Example: ingest Mode --- |
| 140 | +Tenant: default_tenant, Database: default_database |
| 141 | +Connected to Chroma Cloud version: 0.1.0 |
| 142 | +Starting Ingestion... |
| 143 | +Parsed 9 chunks from document. |
| 144 | +Embedding and adding 9 items... |
| 145 | +Ingestion Complete! |
| 146 | +``` |
| 147 | + |
| 148 | +### Query Mode |
| 149 | + |
| 150 | +``` |
| 151 | +--- Chroma Cloud Example: query Mode --- |
| 152 | +Tenant: default_tenant, Database: default_database |
| 153 | +Connected to Chroma Cloud version: 0.1.0 |
| 154 | +Querying: "What happened at the Dartmouth Workshop?" |
| 155 | +
|
| 156 | +--- Results --- |
| 157 | +[0] (Distance: 0.123) |
| 158 | +Location: CHAPTER 1: The Dawn of Thinking Machines, PAGE 3 |
| 159 | +Content: The 1956 Dartmouth Workshop is widely considered the founding event of AI as a field. John McCarthy, Marvin Minsky, Nathaniel Rochester, and Claude Shannon brought together... |
| 160 | +--------------------------- |
| 161 | +``` |
| 162 | + |
| 163 | +## Customization |
| 164 | + |
| 165 | +### Using a Different Document |
| 166 | + |
| 167 | +Replace `document.txt` with your own document. The chunking logic will automatically process it based on CHAPTER and PAGE markers. |
| 168 | + |
| 169 | +### Using a Different Embedding Function |
| 170 | + |
| 171 | +Modify `index.php` to use a different embedding function: |
| 172 | + |
| 173 | +```php |
| 174 | +use Codewithkyrian\ChromaDB\Embeddings\OpenAIEmbeddingFunction; |
| 175 | + |
| 176 | +$ef = new OpenAIEmbeddingFunction($config['openai_key']); |
| 177 | +``` |
| 178 | + |
| 179 | +### Custom Chunking Strategy |
| 180 | + |
| 181 | +Modify the `chunkDocument()` function to implement your own chunking logic (e.g., by sentence, by paragraph, fixed-size chunks, etc.). |
| 182 | + |
| 183 | +## Troubleshooting |
| 184 | + |
| 185 | +**Error: Chroma Cloud API Key is required** |
| 186 | +- Set `CHROMA_API_KEY` environment variable or use `--api-key` argument |
| 187 | + |
| 188 | +**Error: Jina API Key is required** |
| 189 | +- Set `JINA_API_KEY` environment variable or use `--jina-key` argument |
| 190 | + |
| 191 | +**Error: Collection not found** |
| 192 | +- Run ingestion mode first to create and populate the collection |
| 193 | + |
| 194 | +**No results returned** |
| 195 | +- Ensure the collection was successfully ingested |
| 196 | +- Try different query phrasings |
| 197 | +- Check that the query is related to the document content |
| 198 | + |
0 commit comments