bump version

Dhravya · Dhravya · commit 6ad9efaa9563 · 2025-12-19T16:59:54.000-08:00
diff --git a/packages/code-chunk/README.md b/packages/code-chunk/README.md
@@ -0,0 +1,257 @@
+# code-chunk
+
+AST-aware code chunking for semantic search and RAG pipelines.
+
+Uses tree-sitter to split source code at semantic boundaries (functions, classes, methods) rather than arbitrary character limits. Each chunk includes rich context: scope chain, imports, siblings, and entity signatures.
+
+## Table of Contents
+
+- [Features](#features)
+- [How It Works](#how-it-works)
+- [Installation](#installation)
+- [Quickstart](#quickstart)
+- [API Reference](#api-reference)
+- [License](#license)
+
+## Features
+
+- **AST-aware**: Splits at semantic boundaries, never mid-function
+- **Rich context**: Scope chain, imports, siblings, entity signatures
+- **Contextualized text**: Pre-formatted for embedding models
+- **Multi-language**: TypeScript, JavaScript, Python, Rust, Go, Java
+- **Streaming**: Process large files incrementally
+- **Effect support**: First-class Effect integration
+
+## How It Works
+
+Traditional text splitters chunk code by character count or line breaks, often cutting functions in half or separating related code. `code-chunk` takes a different approach:
+
+### 1. Parse
+
+Source code is parsed into an Abstract Syntax Tree (AST) using [tree-sitter](https://tree-sitter.github.io/tree-sitter/). This gives us a structured representation of the code that understands language grammar.
+
+### 2. Extract
+
+We traverse the AST to extract semantic entities: functions, methods, classes, interfaces, types, and imports. For each entity, we capture:
+- Name and type
+- Full signature (e.g., `async getUser(id: string): Promise<User>`)
+- Docstring/comments if present
+- Byte and line ranges
+
+### 3. Build Scope Tree
+
+Entities are organized into a hierarchical scope tree that captures nesting relationships. A method inside a class knows its parent; a nested function knows its containing function. This enables us to provide scope context like `UserService > getUser`.
+
+### 4. Chunk
+
+Code is split at semantic boundaries while respecting the `maxChunkSize` limit. The chunker:
+- Prefers to keep complete entities together
+- Splits oversized entities at logical points (statement boundaries)
+- Never cuts mid-expression or mid-statement
+- Merges small adjacent chunks to reduce fragmentation
+
+### 5. Enrich with Context
+
+Each chunk is enriched with contextual metadata:
+- **Scope chain**: Where this code lives (e.g., inside which class/function)
+- **Entities**: What's defined in this chunk
+- **Siblings**: What comes before/after (for continuity)
+- **Imports**: What dependencies are used
+
+This context is formatted into `contextualizedText`, optimized for embedding models to understand semantic relationships.
+
+## Installation
+
+```bash
+bun add code-chunk
+# or
+npm install code-chunk
+```
+
+## Quickstart
+
+### Basic Usage
+
+```typescript
+import { chunk } from 'code-chunk'
+
+const chunks = await chunk('src/user.ts', sourceCode)
+
+for (const c of chunks) {
+  console.log(c.text)
+  console.log(c.context.scope)    // [{ name: 'UserService', type: 'class' }]
+  console.log(c.context.entities) // [{ name: 'getUser', type: 'method', ... }]
+}
+```
+
+### Using Contextualized Text for Embeddings
+
+Use `contextualizedText` for better embedding quality in RAG systems:
+
+```typescript
+for (const c of chunks) {
+  const embedding = await embed(c.contextualizedText)
+  await vectorDB.upsert({
+    id: `${filepath}:${c.index}`,
+    embedding,
+    metadata: { filepath, lines: c.lineRange }
+  })
+}
+```
+
+The `contextualizedText` prepends semantic context to the raw code:
+
+```
+# src/services/user.ts
+# Scope: UserService
+# Defines: async getUser(id: string): Promise<User>
+# Uses: Database
+# After: constructor
+
+  async getUser(id: string): Promise<User> {
+    return this.db.query('SELECT * FROM users WHERE id = ?', [id])
+  }
+```
+
+### Streaming Large Files
+
+Process chunks incrementally without loading everything into memory:
+
+```typescript
+import { chunkStream } from 'code-chunk'
+
+for await (const c of chunkStream('src/large.ts', code)) {
+  await process(c)
+}
+```
+
+### Reusable Chunker
+
+Create a chunker instance when processing multiple files with the same config:
+
+```typescript
+import { createChunker } from 'code-chunk'
+
+const chunker = createChunker({
+  maxChunkSize: 2048,
+  contextMode: 'full',
+  siblingDetail: 'signatures',
+})
+
+for (const file of files) {
+  const chunks = await chunker.chunk(file.path, file.content)
+}
+```
+
+### Effect Integration
+
+For Effect-based pipelines:
+
+```typescript
+import { chunkStreamEffect } from 'code-chunk'
+import { Effect, Stream } from 'effect'
+
+const program = Stream.runForEach(
+  chunkStreamEffect('src/utils.ts', code),
+  (chunk) => Effect.log(chunk.text)
+)
+
+await Effect.runPromise(program)
+```
+
+## API Reference
+
+### `chunk(filepath, code, options?)`
+
+Chunk source code into semantic pieces with context.
+
+**Parameters:**
+- `filepath`: File path (used for language detection)
+- `code`: Source code string
+- `options`: Optional configuration
+
+**Returns:** `Promise<Chunk[]>`
+
+**Throws:** `ChunkingError`, `UnsupportedLanguageError`
+
+---
+
+### `chunkStream(filepath, code, options?)`
+
+Stream chunks as they're generated. Useful for large files.
+
+**Returns:** `AsyncGenerator<Chunk>`
+
+Note: `chunk.totalChunks` is `-1` in streaming mode (unknown upfront).
+
+---
+
+### `chunkStreamEffect(filepath, code, options?)`
+
+Effect-native streaming API for composable pipelines.
+
+**Returns:** `Stream.Stream<Chunk, ChunkingError | UnsupportedLanguageError>`
+
+---
+
+### `createChunker(options?)`
+
+Create a reusable chunker instance with default options.
+
+**Returns:** `Chunker` with `chunk()` and `stream()` methods
+
+---
+
+### `formatChunkWithContext(text, context, overlapText?)`
+
+Format chunk text with semantic context prepended. Useful for custom embedding pipelines.
+
+**Returns:** `string`
+
+---
+
+### `detectLanguage(filepath)`
+
+Detect programming language from file extension.
+
+**Returns:** `Language | null`
+
+---
+
+### Options
+
+| Option | Type | Default | Description |
+|--------|------|---------|-------------|
+| `maxChunkSize` | `number` | `1500` | Maximum chunk size in bytes |
+| `contextMode` | `'none' \| 'minimal' \| 'full'` | `'full'` | How much context to include |
+| `siblingDetail` | `'none' \| 'names' \| 'signatures'` | `'signatures'` | Level of sibling detail |
+| `filterImports` | `boolean` | `false` | Filter out import statements |
+| `language` | `Language` | auto | Override language detection |
+| `overlapLines` | `number` | `10` | Lines from previous chunk to include in `contextualizedText` |
+
+---
+
+### Supported Languages
+
+| Language | Extensions |
+|----------|------------|
+| TypeScript | `.ts`, `.tsx`, `.mts`, `.cts` |
+| JavaScript | `.js`, `.jsx`, `.mjs`, `.cjs` |
+| Python | `.py`, `.pyi` |
+| Rust | `.rs` |
+| Go | `.go` |
+| Java | `.java` |
+
+---
+
+### Errors
+
+**`ChunkingError`**: Thrown when chunking fails (parsing error, extraction error, etc.)
+
+**`UnsupportedLanguageError`**: Thrown when the file extension is not supported
+
+Both errors have a `_tag` property for Effect-style error handling.
+
+## License
+
+MIT
diff --git a/packages/code-chunk/package.json b/packages/code-chunk/package.json
@@ -1,6 +1,6 @@
 {
 	"name": "code-chunk",
-	"version": "0.1.0",
+	"version": "0.1.11",
 	"description": "AST-aware code chunking for semantic search and RAG",
 	"homepage": "https://github.com/supermemoryai/code-chunk#readme",
 	"bugs": {

Original file line number	Diff line number	Diff line change
`@@ -1,6 +1,6 @@`
`1`	`1`	`{`
`2`	`2`	`"name": "code-chunk",`
`3`		`- "version": "0.1.0",`
	`3`	`+ "version": "0.1.11",`
`4`	`4`	`"description": "AST-aware code chunking for semantic search and RAG",`
`5`	`5`	`"homepage": "https://github.com/supermemoryai/code-chunk#readme",`
`6`	`6`	`"bugs": {`