|
| 1 | +# code-chunk |
| 2 | + |
| 3 | +AST-aware code chunking for semantic search and RAG pipelines. |
| 4 | + |
| 5 | +Uses tree-sitter to split source code at semantic boundaries (functions, classes, methods) rather than arbitrary character limits. Each chunk includes rich context: scope chain, imports, siblings, and entity signatures. |
| 6 | + |
| 7 | +## Table of Contents |
| 8 | + |
| 9 | +- [Features](#features) |
| 10 | +- [How It Works](#how-it-works) |
| 11 | +- [Installation](#installation) |
| 12 | +- [Quickstart](#quickstart) |
| 13 | +- [API Reference](#api-reference) |
| 14 | +- [License](#license) |
| 15 | + |
| 16 | +## Features |
| 17 | + |
| 18 | +- **AST-aware**: Splits at semantic boundaries, never mid-function |
| 19 | +- **Rich context**: Scope chain, imports, siblings, entity signatures |
| 20 | +- **Contextualized text**: Pre-formatted for embedding models |
| 21 | +- **Multi-language**: TypeScript, JavaScript, Python, Rust, Go, Java |
| 22 | +- **Streaming**: Process large files incrementally |
| 23 | +- **Effect support**: First-class Effect integration |
| 24 | + |
| 25 | +## How It Works |
| 26 | + |
| 27 | +Traditional text splitters chunk code by character count or line breaks, often cutting functions in half or separating related code. `code-chunk` takes a different approach: |
| 28 | + |
| 29 | +### 1. Parse |
| 30 | + |
| 31 | +Source code is parsed into an Abstract Syntax Tree (AST) using [tree-sitter](https://tree-sitter.github.io/tree-sitter/). This gives us a structured representation of the code that understands language grammar. |
| 32 | + |
| 33 | +### 2. Extract |
| 34 | + |
| 35 | +We traverse the AST to extract semantic entities: functions, methods, classes, interfaces, types, and imports. For each entity, we capture: |
| 36 | +- Name and type |
| 37 | +- Full signature (e.g., `async getUser(id: string): Promise<User>`) |
| 38 | +- Docstring/comments if present |
| 39 | +- Byte and line ranges |
| 40 | + |
| 41 | +### 3. Build Scope Tree |
| 42 | + |
| 43 | +Entities are organized into a hierarchical scope tree that captures nesting relationships. A method inside a class knows its parent; a nested function knows its containing function. This enables us to provide scope context like `UserService > getUser`. |
| 44 | + |
| 45 | +### 4. Chunk |
| 46 | + |
| 47 | +Code is split at semantic boundaries while respecting the `maxChunkSize` limit. The chunker: |
| 48 | +- Prefers to keep complete entities together |
| 49 | +- Splits oversized entities at logical points (statement boundaries) |
| 50 | +- Never cuts mid-expression or mid-statement |
| 51 | +- Merges small adjacent chunks to reduce fragmentation |
| 52 | + |
| 53 | +### 5. Enrich with Context |
| 54 | + |
| 55 | +Each chunk is enriched with contextual metadata: |
| 56 | +- **Scope chain**: Where this code lives (e.g., inside which class/function) |
| 57 | +- **Entities**: What's defined in this chunk |
| 58 | +- **Siblings**: What comes before/after (for continuity) |
| 59 | +- **Imports**: What dependencies are used |
| 60 | + |
| 61 | +This context is formatted into `contextualizedText`, optimized for embedding models to understand semantic relationships. |
| 62 | + |
| 63 | +## Installation |
| 64 | + |
| 65 | +```bash |
| 66 | +bun add code-chunk |
| 67 | +# or |
| 68 | +npm install code-chunk |
| 69 | +``` |
| 70 | + |
| 71 | +## Quickstart |
| 72 | + |
| 73 | +### Basic Usage |
| 74 | + |
| 75 | +```typescript |
| 76 | +import { chunk } from 'code-chunk' |
| 77 | + |
| 78 | +const chunks = await chunk('src/user.ts', sourceCode) |
| 79 | + |
| 80 | +for (const c of chunks) { |
| 81 | + console.log(c.text) |
| 82 | + console.log(c.context.scope) // [{ name: 'UserService', type: 'class' }] |
| 83 | + console.log(c.context.entities) // [{ name: 'getUser', type: 'method', ... }] |
| 84 | +} |
| 85 | +``` |
| 86 | + |
| 87 | +### Using Contextualized Text for Embeddings |
| 88 | + |
| 89 | +Use `contextualizedText` for better embedding quality in RAG systems: |
| 90 | + |
| 91 | +```typescript |
| 92 | +for (const c of chunks) { |
| 93 | + const embedding = await embed(c.contextualizedText) |
| 94 | + await vectorDB.upsert({ |
| 95 | + id: `${filepath}:${c.index}`, |
| 96 | + embedding, |
| 97 | + metadata: { filepath, lines: c.lineRange } |
| 98 | + }) |
| 99 | +} |
| 100 | +``` |
| 101 | + |
| 102 | +The `contextualizedText` prepends semantic context to the raw code: |
| 103 | + |
| 104 | +``` |
| 105 | +# src/services/user.ts |
| 106 | +# Scope: UserService |
| 107 | +# Defines: async getUser(id: string): Promise<User> |
| 108 | +# Uses: Database |
| 109 | +# After: constructor |
| 110 | +
|
| 111 | + async getUser(id: string): Promise<User> { |
| 112 | + return this.db.query('SELECT * FROM users WHERE id = ?', [id]) |
| 113 | + } |
| 114 | +``` |
| 115 | + |
| 116 | +### Streaming Large Files |
| 117 | + |
| 118 | +Process chunks incrementally without loading everything into memory: |
| 119 | + |
| 120 | +```typescript |
| 121 | +import { chunkStream } from 'code-chunk' |
| 122 | + |
| 123 | +for await (const c of chunkStream('src/large.ts', code)) { |
| 124 | + await process(c) |
| 125 | +} |
| 126 | +``` |
| 127 | + |
| 128 | +### Reusable Chunker |
| 129 | + |
| 130 | +Create a chunker instance when processing multiple files with the same config: |
| 131 | + |
| 132 | +```typescript |
| 133 | +import { createChunker } from 'code-chunk' |
| 134 | + |
| 135 | +const chunker = createChunker({ |
| 136 | + maxChunkSize: 2048, |
| 137 | + contextMode: 'full', |
| 138 | + siblingDetail: 'signatures', |
| 139 | +}) |
| 140 | + |
| 141 | +for (const file of files) { |
| 142 | + const chunks = await chunker.chunk(file.path, file.content) |
| 143 | +} |
| 144 | +``` |
| 145 | + |
| 146 | +### Effect Integration |
| 147 | + |
| 148 | +For Effect-based pipelines: |
| 149 | + |
| 150 | +```typescript |
| 151 | +import { chunkStreamEffect } from 'code-chunk' |
| 152 | +import { Effect, Stream } from 'effect' |
| 153 | + |
| 154 | +const program = Stream.runForEach( |
| 155 | + chunkStreamEffect('src/utils.ts', code), |
| 156 | + (chunk) => Effect.log(chunk.text) |
| 157 | +) |
| 158 | + |
| 159 | +await Effect.runPromise(program) |
| 160 | +``` |
| 161 | + |
| 162 | +## API Reference |
| 163 | + |
| 164 | +### `chunk(filepath, code, options?)` |
| 165 | + |
| 166 | +Chunk source code into semantic pieces with context. |
| 167 | + |
| 168 | +**Parameters:** |
| 169 | +- `filepath`: File path (used for language detection) |
| 170 | +- `code`: Source code string |
| 171 | +- `options`: Optional configuration |
| 172 | + |
| 173 | +**Returns:** `Promise<Chunk[]>` |
| 174 | + |
| 175 | +**Throws:** `ChunkingError`, `UnsupportedLanguageError` |
| 176 | + |
| 177 | +--- |
| 178 | + |
| 179 | +### `chunkStream(filepath, code, options?)` |
| 180 | + |
| 181 | +Stream chunks as they're generated. Useful for large files. |
| 182 | + |
| 183 | +**Returns:** `AsyncGenerator<Chunk>` |
| 184 | + |
| 185 | +Note: `chunk.totalChunks` is `-1` in streaming mode (unknown upfront). |
| 186 | + |
| 187 | +--- |
| 188 | + |
| 189 | +### `chunkStreamEffect(filepath, code, options?)` |
| 190 | + |
| 191 | +Effect-native streaming API for composable pipelines. |
| 192 | + |
| 193 | +**Returns:** `Stream.Stream<Chunk, ChunkingError | UnsupportedLanguageError>` |
| 194 | + |
| 195 | +--- |
| 196 | + |
| 197 | +### `createChunker(options?)` |
| 198 | + |
| 199 | +Create a reusable chunker instance with default options. |
| 200 | + |
| 201 | +**Returns:** `Chunker` with `chunk()` and `stream()` methods |
| 202 | + |
| 203 | +--- |
| 204 | + |
| 205 | +### `formatChunkWithContext(text, context, overlapText?)` |
| 206 | + |
| 207 | +Format chunk text with semantic context prepended. Useful for custom embedding pipelines. |
| 208 | + |
| 209 | +**Returns:** `string` |
| 210 | + |
| 211 | +--- |
| 212 | + |
| 213 | +### `detectLanguage(filepath)` |
| 214 | + |
| 215 | +Detect programming language from file extension. |
| 216 | + |
| 217 | +**Returns:** `Language | null` |
| 218 | + |
| 219 | +--- |
| 220 | + |
| 221 | +### Options |
| 222 | + |
| 223 | +| Option | Type | Default | Description | |
| 224 | +|--------|------|---------|-------------| |
| 225 | +| `maxChunkSize` | `number` | `1500` | Maximum chunk size in bytes | |
| 226 | +| `contextMode` | `'none' \| 'minimal' \| 'full'` | `'full'` | How much context to include | |
| 227 | +| `siblingDetail` | `'none' \| 'names' \| 'signatures'` | `'signatures'` | Level of sibling detail | |
| 228 | +| `filterImports` | `boolean` | `false` | Filter out import statements | |
| 229 | +| `language` | `Language` | auto | Override language detection | |
| 230 | +| `overlapLines` | `number` | `10` | Lines from previous chunk to include in `contextualizedText` | |
| 231 | + |
| 232 | +--- |
| 233 | + |
| 234 | +### Supported Languages |
| 235 | + |
| 236 | +| Language | Extensions | |
| 237 | +|----------|------------| |
| 238 | +| TypeScript | `.ts`, `.tsx`, `.mts`, `.cts` | |
| 239 | +| JavaScript | `.js`, `.jsx`, `.mjs`, `.cjs` | |
| 240 | +| Python | `.py`, `.pyi` | |
| 241 | +| Rust | `.rs` | |
| 242 | +| Go | `.go` | |
| 243 | +| Java | `.java` | |
| 244 | + |
| 245 | +--- |
| 246 | + |
| 247 | +### Errors |
| 248 | + |
| 249 | +**`ChunkingError`**: Thrown when chunking fails (parsing error, extraction error, etc.) |
| 250 | + |
| 251 | +**`UnsupportedLanguageError`**: Thrown when the file extension is not supported |
| 252 | + |
| 253 | +Both errors have a `_tag` property for Effect-style error handling. |
| 254 | + |
| 255 | +## License |
| 256 | + |
| 257 | +MIT |
0 commit comments