Skip to content

Commit 6ad9efa

Browse files
committed
bump version
1 parent d797f19 commit 6ad9efa

File tree

2 files changed

+258
-1
lines changed

2 files changed

+258
-1
lines changed

packages/code-chunk/README.md

Lines changed: 257 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,257 @@
1+
# code-chunk
2+
3+
AST-aware code chunking for semantic search and RAG pipelines.
4+
5+
Uses tree-sitter to split source code at semantic boundaries (functions, classes, methods) rather than arbitrary character limits. Each chunk includes rich context: scope chain, imports, siblings, and entity signatures.
6+
7+
## Table of Contents
8+
9+
- [Features](#features)
10+
- [How It Works](#how-it-works)
11+
- [Installation](#installation)
12+
- [Quickstart](#quickstart)
13+
- [API Reference](#api-reference)
14+
- [License](#license)
15+
16+
## Features
17+
18+
- **AST-aware**: Splits at semantic boundaries, never mid-function
19+
- **Rich context**: Scope chain, imports, siblings, entity signatures
20+
- **Contextualized text**: Pre-formatted for embedding models
21+
- **Multi-language**: TypeScript, JavaScript, Python, Rust, Go, Java
22+
- **Streaming**: Process large files incrementally
23+
- **Effect support**: First-class Effect integration
24+
25+
## How It Works
26+
27+
Traditional text splitters chunk code by character count or line breaks, often cutting functions in half or separating related code. `code-chunk` takes a different approach:
28+
29+
### 1. Parse
30+
31+
Source code is parsed into an Abstract Syntax Tree (AST) using [tree-sitter](https://tree-sitter.github.io/tree-sitter/). This gives us a structured representation of the code that understands language grammar.
32+
33+
### 2. Extract
34+
35+
We traverse the AST to extract semantic entities: functions, methods, classes, interfaces, types, and imports. For each entity, we capture:
36+
- Name and type
37+
- Full signature (e.g., `async getUser(id: string): Promise<User>`)
38+
- Docstring/comments if present
39+
- Byte and line ranges
40+
41+
### 3. Build Scope Tree
42+
43+
Entities are organized into a hierarchical scope tree that captures nesting relationships. A method inside a class knows its parent; a nested function knows its containing function. This enables us to provide scope context like `UserService > getUser`.
44+
45+
### 4. Chunk
46+
47+
Code is split at semantic boundaries while respecting the `maxChunkSize` limit. The chunker:
48+
- Prefers to keep complete entities together
49+
- Splits oversized entities at logical points (statement boundaries)
50+
- Never cuts mid-expression or mid-statement
51+
- Merges small adjacent chunks to reduce fragmentation
52+
53+
### 5. Enrich with Context
54+
55+
Each chunk is enriched with contextual metadata:
56+
- **Scope chain**: Where this code lives (e.g., inside which class/function)
57+
- **Entities**: What's defined in this chunk
58+
- **Siblings**: What comes before/after (for continuity)
59+
- **Imports**: What dependencies are used
60+
61+
This context is formatted into `contextualizedText`, optimized for embedding models to understand semantic relationships.
62+
63+
## Installation
64+
65+
```bash
66+
bun add code-chunk
67+
# or
68+
npm install code-chunk
69+
```
70+
71+
## Quickstart
72+
73+
### Basic Usage
74+
75+
```typescript
76+
import { chunk } from 'code-chunk'
77+
78+
const chunks = await chunk('src/user.ts', sourceCode)
79+
80+
for (const c of chunks) {
81+
console.log(c.text)
82+
console.log(c.context.scope) // [{ name: 'UserService', type: 'class' }]
83+
console.log(c.context.entities) // [{ name: 'getUser', type: 'method', ... }]
84+
}
85+
```
86+
87+
### Using Contextualized Text for Embeddings
88+
89+
Use `contextualizedText` for better embedding quality in RAG systems:
90+
91+
```typescript
92+
for (const c of chunks) {
93+
const embedding = await embed(c.contextualizedText)
94+
await vectorDB.upsert({
95+
id: `${filepath}:${c.index}`,
96+
embedding,
97+
metadata: { filepath, lines: c.lineRange }
98+
})
99+
}
100+
```
101+
102+
The `contextualizedText` prepends semantic context to the raw code:
103+
104+
```
105+
# src/services/user.ts
106+
# Scope: UserService
107+
# Defines: async getUser(id: string): Promise<User>
108+
# Uses: Database
109+
# After: constructor
110+
111+
async getUser(id: string): Promise<User> {
112+
return this.db.query('SELECT * FROM users WHERE id = ?', [id])
113+
}
114+
```
115+
116+
### Streaming Large Files
117+
118+
Process chunks incrementally without loading everything into memory:
119+
120+
```typescript
121+
import { chunkStream } from 'code-chunk'
122+
123+
for await (const c of chunkStream('src/large.ts', code)) {
124+
await process(c)
125+
}
126+
```
127+
128+
### Reusable Chunker
129+
130+
Create a chunker instance when processing multiple files with the same config:
131+
132+
```typescript
133+
import { createChunker } from 'code-chunk'
134+
135+
const chunker = createChunker({
136+
maxChunkSize: 2048,
137+
contextMode: 'full',
138+
siblingDetail: 'signatures',
139+
})
140+
141+
for (const file of files) {
142+
const chunks = await chunker.chunk(file.path, file.content)
143+
}
144+
```
145+
146+
### Effect Integration
147+
148+
For Effect-based pipelines:
149+
150+
```typescript
151+
import { chunkStreamEffect } from 'code-chunk'
152+
import { Effect, Stream } from 'effect'
153+
154+
const program = Stream.runForEach(
155+
chunkStreamEffect('src/utils.ts', code),
156+
(chunk) => Effect.log(chunk.text)
157+
)
158+
159+
await Effect.runPromise(program)
160+
```
161+
162+
## API Reference
163+
164+
### `chunk(filepath, code, options?)`
165+
166+
Chunk source code into semantic pieces with context.
167+
168+
**Parameters:**
169+
- `filepath`: File path (used for language detection)
170+
- `code`: Source code string
171+
- `options`: Optional configuration
172+
173+
**Returns:** `Promise<Chunk[]>`
174+
175+
**Throws:** `ChunkingError`, `UnsupportedLanguageError`
176+
177+
---
178+
179+
### `chunkStream(filepath, code, options?)`
180+
181+
Stream chunks as they're generated. Useful for large files.
182+
183+
**Returns:** `AsyncGenerator<Chunk>`
184+
185+
Note: `chunk.totalChunks` is `-1` in streaming mode (unknown upfront).
186+
187+
---
188+
189+
### `chunkStreamEffect(filepath, code, options?)`
190+
191+
Effect-native streaming API for composable pipelines.
192+
193+
**Returns:** `Stream.Stream<Chunk, ChunkingError | UnsupportedLanguageError>`
194+
195+
---
196+
197+
### `createChunker(options?)`
198+
199+
Create a reusable chunker instance with default options.
200+
201+
**Returns:** `Chunker` with `chunk()` and `stream()` methods
202+
203+
---
204+
205+
### `formatChunkWithContext(text, context, overlapText?)`
206+
207+
Format chunk text with semantic context prepended. Useful for custom embedding pipelines.
208+
209+
**Returns:** `string`
210+
211+
---
212+
213+
### `detectLanguage(filepath)`
214+
215+
Detect programming language from file extension.
216+
217+
**Returns:** `Language | null`
218+
219+
---
220+
221+
### Options
222+
223+
| Option | Type | Default | Description |
224+
|--------|------|---------|-------------|
225+
| `maxChunkSize` | `number` | `1500` | Maximum chunk size in bytes |
226+
| `contextMode` | `'none' \| 'minimal' \| 'full'` | `'full'` | How much context to include |
227+
| `siblingDetail` | `'none' \| 'names' \| 'signatures'` | `'signatures'` | Level of sibling detail |
228+
| `filterImports` | `boolean` | `false` | Filter out import statements |
229+
| `language` | `Language` | auto | Override language detection |
230+
| `overlapLines` | `number` | `10` | Lines from previous chunk to include in `contextualizedText` |
231+
232+
---
233+
234+
### Supported Languages
235+
236+
| Language | Extensions |
237+
|----------|------------|
238+
| TypeScript | `.ts`, `.tsx`, `.mts`, `.cts` |
239+
| JavaScript | `.js`, `.jsx`, `.mjs`, `.cjs` |
240+
| Python | `.py`, `.pyi` |
241+
| Rust | `.rs` |
242+
| Go | `.go` |
243+
| Java | `.java` |
244+
245+
---
246+
247+
### Errors
248+
249+
**`ChunkingError`**: Thrown when chunking fails (parsing error, extraction error, etc.)
250+
251+
**`UnsupportedLanguageError`**: Thrown when the file extension is not supported
252+
253+
Both errors have a `_tag` property for Effect-style error handling.
254+
255+
## License
256+
257+
MIT

packages/code-chunk/package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"name": "code-chunk",
3-
"version": "0.1.0",
3+
"version": "0.1.11",
44
"description": "AST-aware code chunking for semantic search and RAG",
55
"homepage": "https://github.com/supermemoryai/code-chunk#readme",
66
"bugs": {

0 commit comments

Comments
 (0)