Skip to content

Commit 3fecc7b

Browse files
committed
add docs
1 parent b3051ce commit 3fecc7b

File tree

2 files changed

+241
-10
lines changed

2 files changed

+241
-10
lines changed

README.md

Lines changed: 240 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,256 @@
1-
# astchunk
1+
# code-chunk
22

3-
Type-safe components for scalable applications
3+
AST-aware code chunking for semantic search and RAG pipelines.
4+
5+
Uses tree-sitter to split source code at semantic boundaries (functions, classes, methods) rather than arbitrary character limits. Each chunk includes rich context: scope chain, imports, siblings, and entity signatures.
6+
7+
## Table of Contents
8+
9+
- [Features](#features)
10+
- [How It Works](#how-it-works)
11+
- [Installation](#installation)
12+
- [Quickstart](#quickstart)
13+
- [API Reference](#api-reference)
14+
- [License](#license)
15+
16+
## Features
17+
18+
- **AST-aware** — Splits at semantic boundaries, never mid-function
19+
- **Rich context** — Scope chain, imports, siblings, entity signatures
20+
- **Contextualized text** — Pre-formatted for embedding models
21+
- **Multi-language** — TypeScript, JavaScript, Python, Rust, Go, Java
22+
- **Streaming** — Process large files incrementally
23+
- **Effect support** — First-class Effect integration
24+
25+
## How It Works
26+
27+
Traditional text splitters chunk code by character count or line breaks, often cutting functions in half or separating related code. `code-chunk` takes a different approach:
28+
29+
### 1. Parse
30+
31+
Source code is parsed into an Abstract Syntax Tree (AST) using [tree-sitter](https://tree-sitter.github.io/tree-sitter/). This gives us a structured representation of the code that understands language grammar.
32+
33+
### 2. Extract
34+
35+
We traverse the AST to extract semantic entities: functions, methods, classes, interfaces, types, and imports. For each entity, we capture:
36+
- Name and type
37+
- Full signature (e.g., `async getUser(id: string): Promise<User>`)
38+
- Docstring/comments if present
39+
- Byte and line ranges
40+
41+
### 3. Build Scope Tree
42+
43+
Entities are organized into a hierarchical scope tree that captures nesting relationships. A method inside a class knows its parent; a nested function knows its containing function. This enables us to provide scope context like `UserService > getUser`.
44+
45+
### 4. Chunk
46+
47+
Code is split at semantic boundaries while respecting the `maxChunkSize` limit. The chunker:
48+
- Prefers to keep complete entities together
49+
- Splits oversized entities at logical points (statement boundaries)
50+
- Never cuts mid-expression or mid-statement
51+
- Merges small adjacent chunks to reduce fragmentation
52+
53+
### 5. Enrich with Context
54+
55+
Each chunk is enriched with contextual metadata:
56+
- **Scope chain** — Where this code lives (e.g., inside which class/function)
57+
- **Entities** — What's defined in this chunk
58+
- **Siblings** — What comes before/after (for continuity)
59+
- **Imports** — What dependencies are used
60+
61+
This context is formatted into `contextualizedText`, optimized for embedding models to understand semantic relationships.
462

563
## Installation
664

765
```bash
8-
bun add astchunk
66+
bun add code-chunk
67+
# or
68+
npm install code-chunk
69+
```
70+
71+
## Quickstart
72+
73+
### Basic Usage
74+
75+
```typescript
76+
import { chunk } from 'code-chunk'
77+
78+
const chunks = await chunk('src/user.ts', sourceCode)
79+
80+
for (const c of chunks) {
81+
console.log(c.text)
82+
console.log(c.context.scope) // [{ name: 'UserService', type: 'class' }]
83+
console.log(c.context.entities) // [{ name: 'getUser', type: 'method', ... }]
84+
}
85+
```
86+
87+
### Using Contextualized Text for Embeddings
88+
89+
Use `contextualizedText` for better embedding quality in RAG systems:
90+
91+
```typescript
92+
for (const c of chunks) {
93+
const embedding = await embed(c.contextualizedText)
94+
await vectorDB.upsert({
95+
id: `${filepath}:${c.index}`,
96+
embedding,
97+
metadata: { filepath, lines: c.lineRange }
98+
})
99+
}
100+
```
101+
102+
The `contextualizedText` prepends semantic context to the raw code:
103+
104+
```
105+
# src/services/user.ts
106+
# Scope: UserService
107+
# Defines: async getUser(id: string): Promise<User>
108+
# Uses: Database
109+
# After: constructor
110+
111+
async getUser(id: string): Promise<User> {
112+
return this.db.query('SELECT * FROM users WHERE id = ?', [id])
113+
}
114+
```
115+
116+
### Streaming Large Files
117+
118+
Process chunks incrementally without loading everything into memory:
119+
120+
```typescript
121+
import { chunkStream } from 'code-chunk'
122+
123+
for await (const c of chunkStream('src/large.ts', code)) {
124+
await process(c)
125+
}
126+
```
127+
128+
### Reusable Chunker
129+
130+
Create a chunker instance when processing multiple files with the same config:
131+
132+
```typescript
133+
import { createChunker } from 'code-chunk'
134+
135+
const chunker = createChunker({
136+
maxChunkSize: 2048,
137+
contextMode: 'full',
138+
siblingDetail: 'signatures',
139+
})
140+
141+
for (const file of files) {
142+
const chunks = await chunker.chunk(file.path, file.content)
143+
}
9144
```
10145

11-
## Usage
146+
### Effect Integration
147+
148+
For Effect-based pipelines:
12149

13150
```typescript
14-
import { greet } from 'astchunk';
151+
import { chunkStreamEffect } from 'code-chunk'
152+
import { Effect, Stream } from 'effect'
15153

16-
console.log(greet('World')); // Hello, World!
154+
const program = Stream.runForEach(
155+
chunkStreamEffect('src/utils.ts', code),
156+
(chunk) => Effect.log(chunk.text)
157+
)
158+
159+
await Effect.runPromise(program)
17160
```
18161

19-
## Contributing
162+
## API Reference
163+
164+
### `chunk(filepath, code, options?)`
165+
166+
Chunk source code into semantic pieces with context.
167+
168+
**Parameters:**
169+
- `filepath` — File path (used for language detection)
170+
- `code` — Source code string
171+
- `options` — Optional configuration
172+
173+
**Returns:** `Promise<Chunk[]>`
174+
175+
**Throws:** `ChunkingError`, `UnsupportedLanguageError`
176+
177+
---
178+
179+
### `chunkStream(filepath, code, options?)`
180+
181+
Stream chunks as they're generated. Useful for large files.
182+
183+
**Returns:** `AsyncGenerator<Chunk>`
184+
185+
Note: `chunk.totalChunks` is `-1` in streaming mode (unknown upfront).
186+
187+
---
188+
189+
### `chunkStreamEffect(filepath, code, options?)`
190+
191+
Effect-native streaming API for composable pipelines.
192+
193+
**Returns:** `Stream.Stream<Chunk, ChunkingError | UnsupportedLanguageError>`
194+
195+
---
196+
197+
### `createChunker(options?)`
198+
199+
Create a reusable chunker instance with default options.
200+
201+
**Returns:** `Chunker` with `chunk()` and `stream()` methods
202+
203+
---
204+
205+
### `formatChunkWithContext(text, context, overlapText?)`
206+
207+
Format chunk text with semantic context prepended. Useful for custom embedding pipelines.
208+
209+
**Returns:** `string`
210+
211+
---
212+
213+
### `detectLanguage(filepath)`
214+
215+
Detect programming language from file extension.
216+
217+
**Returns:** `Language | null`
218+
219+
---
220+
221+
### Options
222+
223+
| Option | Type | Default | Description |
224+
|--------|------|---------|-------------|
225+
| `maxChunkSize` | `number` | `1500` | Maximum chunk size in bytes |
226+
| `contextMode` | `'none' \| 'minimal' \| 'full'` | `'full'` | How much context to include |
227+
| `siblingDetail` | `'none' \| 'names' \| 'signatures'` | `'signatures'` | Level of sibling detail |
228+
| `filterImports` | `boolean` | `false` | Filter out import statements |
229+
| `language` | `Language` | auto | Override language detection |
230+
| `overlapLines` | `number` | `10` | Lines from previous chunk to include in `contextualizedText` |
231+
232+
---
233+
234+
### Supported Languages
235+
236+
| Language | Extensions |
237+
|----------|------------|
238+
| TypeScript | `.ts`, `.tsx`, `.mts`, `.cts` |
239+
| JavaScript | `.js`, `.jsx`, `.mjs`, `.cjs` |
240+
| Python | `.py`, `.pyi` |
241+
| Rust | `.rs` |
242+
| Go | `.go` |
243+
| Java | `.java` |
244+
245+
---
246+
247+
### Errors
248+
249+
**`ChunkingError`** — Thrown when chunking fails (parsing error, extraction error, etc.)
250+
251+
**`UnsupportedLanguageError`** — Thrown when the file extension is not supported
20252

21-
Please see [CONTRIBUTING.md](./CONTRIBUTING.md) for contribution guidelines.
253+
Both errors have a `_tag` property for Effect-style error handling.
22254

23255
## License
24256

packages/astchunk/src/chunking/index.ts

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -38,12 +38,11 @@ export class ChunkError extends Error {
3838
/**
3939
* Default chunk options
4040
*/
41-
export const DEFAULT_CHUNK_OPTIONS: Required<ChunkOptions> = {
41+
export const DEFAULT_CHUNK_OPTIONS: Omit<Required<ChunkOptions>, 'language'> = {
4242
maxChunkSize: 1500,
4343
contextMode: 'full',
4444
siblingDetail: 'signatures',
4545
filterImports: false,
46-
language: 'typescript',
4746
overlapLines: 10,
4847
}
4948

0 commit comments

Comments
 (0)