cognigraph-chunker

Fast text chunking toolkit with fixed-size, delimiter-based, semantic, and cognition-aware strategies.

Features

Four chunking strategies -- fixed-size with delimiter-aware boundaries, delimiter/pattern splitting, embedding-based semantic chunking, and cognition-aware chunking with multi-signal boundary scoring
Cognition-aware chunking -- 8-signal boundary scoring (semantic similarity, entity continuity, discourse continuation, heading context, structural affinity, topic shift, orphan risk, budget pressure), proposition-aware healing, cross-chunk entity tracking, and automatic quality metrics
Multilingual -- automatic language detection across 70+ languages with language-specific enrichment for 14 language groups (English, German, French, Spanish, Portuguese, Italian, Dutch, Russian, Turkish, Polish, Chinese, Japanese, Korean, Arabic)
Four interfaces -- CLI tool, REST API (Axum), Python bindings (PyO3), and Docker
Five embedding providers -- OpenAI, Ollama, ONNX Runtime (local), Cloudflare Workers AI, and OAuth-authenticated OpenAI-compatible endpoints
Markdown-aware -- parses markdown AST to preserve tables, code blocks, headings, and lists as atomic units
Optional LLM enrichment -- relation triple extraction and chunk synopsis generation via OpenAI-compatible API (post-assembly, no LLM needed for core chunking)
Graph export -- output chunks as nodes with adjacency and shared-entity edges, ready for graph databases
Ambiguous boundary refinement -- optional cross-encoder reranking for precision improvement on uncertain boundaries (NVIDIA NIM, Cohere, Cloudflare Workers AI, OAuth-authenticated endpoints, or local ONNX)
Merge post-processing -- combine small chunks into token-budget groups across all strategies
Output formats -- plain text, JSON, and JSONL

Installation

CLI (from crates.io)

cargo install cognigraph-chunker

Python (via maturin)

pip install cognigraph-chunker

From source

git clone https://github.com/gedankrayze/cognigraph-chunker.git
cd cognigraph-chunker
cargo build --release

The binary is at target/release/cognigraph-chunker.

Quick Start

CLI

# Fixed-size chunks of 1024 bytes
cognigraph-chunker chunk -i document.md -s 1024

# Split on sentence boundaries, JSON output
cognigraph-chunker split -i document.md -d ".?!" -f json

# Semantic chunking with Ollama
cognigraph-chunker semantic -i document.md

# Cognition-aware chunking (preserves entity chains, discourse structure, heading context)
cognigraph-chunker cognitive -i document.md -f json

# Cognitive chunking with graph export
cognigraph-chunker cognitive -i document.md --graph

# Cognitive chunking with LLM-based relation extraction
cognigraph-chunker cognitive -i document.md --relations -f json

REST API

# Start the server
cognigraph-chunker serve --api-key my-secret --port 3000

# Fixed-size chunking
curl -X POST http://localhost:3000/api/v1/chunk \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer my-secret" \
  -d '{"text": "Your long document text here...", "size": 1024}'

# Cognitive chunking
curl -X POST http://localhost:3000/api/v1/cognitive \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer my-secret" \
  -d '{"text": "Your long document text here...", "provider": "openai"}'

Python

from cognigraph_chunker import Chunker

for chunk in Chunker("Your long document text here...", size=1024):
    print(chunk)

CLI Reference

Global Options

These flags apply to all subcommands:

Flag	Default	Description
`--verbose`	off	Show detailed processing information
`--quiet`	off	Suppress all informational output (conflicts with `--verbose`)
`--stats`	off	Print chunk statistics after output (count, avg/min/max size)
`--max-input-size`	52428800 (50 MiB)	Maximum input size in bytes

`chunk` -- Fixed-size chunking

Split text into chunks of a target byte size, with delimiter-aware boundary detection.

Flag	Short	Default	Description
`--input`	`-i`	`-` (stdin)	Input file path, or `-` for stdin
`--size`	`-s`	4096	Target chunk size in bytes
`--delimiters`	`-d`	none	Single-byte delimiters to split on (e.g., `"\n.?"`)
`--pattern`	`-p`	none	Multi-byte pattern to split on
`--prefix`		off	Put delimiter at start of next chunk instead of end of current
`--consecutive`		off	Split at start of consecutive delimiter runs
`--forward-fallback`		off	Search forward if no boundary found in backward window
`--format`	`-f`	plain	Output format: `plain`, `json`, `jsonl`
`--merge`		off	Post-process by merging small chunks to fit a token budget
`--chunk-size`		512	Target token count per merged chunk (used with `--merge`)

Examples:

# 2 KB chunks with newline/period boundaries
cognigraph-chunker chunk -i input.txt -s 2048 -d "\n."

# Prefix mode: delimiters go to the start of the next chunk
cognigraph-chunker chunk -i input.txt -d "\n.?" --prefix

# Pipe from stdin, JSON output
cat file.txt | cognigraph-chunker chunk -s 1024 -f json

# Chunk then merge small pieces into ~256 token groups
cognigraph-chunker chunk -i doc.md -s 512 --merge --chunk-size 256

`split` -- Delimiter splitting

Split text at every occurrence of specified delimiters or patterns.

Flag	Short	Default	Description
`--input`	`-i`	`-` (stdin)	Input file path, or `-` for stdin
`--delimiters`	`-d`	`\n.?`	Single-byte delimiters to split on
`--patterns`	`-p`	none	Multi-byte patterns, comma-separated (e.g., `". ,? ,! "`)
`--include-delim`		`prev`	Where to attach delimiter: `prev`, `next`, or `none`
`--min-chars`		0	Minimum characters per segment; shorter segments are merged
`--format`	`-f`	plain	Output format: `plain`, `json`, `jsonl`
`--merge`		off	Post-process by merging small chunks to fit a token budget
`--chunk-size`		512	Target token count per merged chunk (used with `--merge`)

Examples:

# Split on sentence-ending punctuation
cognigraph-chunker split -i doc.md -d ".?!"

# Multi-byte patterns, attach delimiters to next segment
cognigraph-chunker split -i doc.md -p ". ,? " --include-delim next

# Minimum 100 chars per segment, JSONL output
cognigraph-chunker split -i doc.md --min-chars 100 -f jsonl

# Split then merge into ~512 token groups
cognigraph-chunker split -i doc.md --merge --chunk-size 512

`semantic` -- Semantic chunking

Split text based on embedding similarity using Savitzky-Golay smoothing to detect topic boundaries.

Flag	Short	Default	Description
`--input`	`-i`	`-` (stdin)	Input file path, or `-` for stdin
`--provider`	`-p`	`ollama`	Embedding provider: `ollama`, `openai`, `onnx`, `cloudflare`, `oauth`
`--model`	`-m`	provider default	Model name (provider-specific)
`--api-key`		none	API key for OpenAI (also reads env/file)
`--base-url`		none	Base URL override for the embedding API
`--model-path`		none	Path to ONNX model directory (required for `onnx` provider)
`--cf-auth-token`		none	Cloudflare auth token (also reads env/`.env.cloudflare`)
`--cf-account-id`		none	Cloudflare account ID (also reads env/`.env.cloudflare`)
`--cf-ai-gateway`		none	Cloudflare AI Gateway name (optional; routes through gateway)
`--oauth-token-url`		none	OAuth token endpoint URL (also reads env/`.env.oauth`)
`--oauth-client-id`		none	OAuth client ID (also reads env/`.env.oauth`)
`--oauth-client-secret`		none	OAuth client secret (also reads env/`.env.oauth`)
`--oauth-scope`		none	OAuth scope (optional; also reads env/`.env.oauth`)
`--oauth-base-url`		none	Base URL for the OpenAI-compatible API (also reads env/`.env.oauth`)
`--danger-accept-invalid-certs`		off	Accept invalid TLS certificates (for corporate proxies)
`--sim-window`		3	Window size for cross-similarity computation (must be odd, >= 3)
`--sg-window`		11	Savitzky-Golay smoothing window size (must be odd)
`--poly-order`		3	Savitzky-Golay polynomial order
`--threshold`		0.5	Percentile threshold for split point filtering (0.0--1.0)
`--min-distance`		2	Minimum block gap between split points
`--format`	`-f`	plain	Output format: `plain`, `json`, `jsonl`
`--emit-distances`		off	Emit raw and smoothed distance curves to stderr
`--no-markdown`		off	Treat input as plain text instead of markdown
`--merge`		off	Post-process by merging small chunks to fit a token budget
`--chunk-size`		512	Target token count per merged chunk (used with `--merge`)

Examples:

# Semantic chunking with Ollama (default)
cognigraph-chunker semantic -i document.md

# Use OpenAI embeddings, JSON output
cognigraph-chunker semantic -i doc.md -p openai -f json

# Tune signal processing parameters
cognigraph-chunker semantic -i doc.md --sg-window 15 --threshold 0.3

# Export distance curves for debugging
cognigraph-chunker semantic -i doc.md --emit-distances 2>distances.tsv

# Plain text mode (no markdown parsing)
cognigraph-chunker semantic -i doc.md --no-markdown

# Local ONNX model
cognigraph-chunker semantic -i doc.md -p onnx --model-path ./models/all-MiniLM-L6-v2

# Cloudflare Workers AI (reads credentials from .env.cloudflare)
cognigraph-chunker semantic -i doc.md -p cloudflare

# Cloudflare via AI Gateway
cognigraph-chunker semantic -i doc.md -p cloudflare --cf-ai-gateway my-gateway

# OAuth-authenticated endpoint (reads credentials from .env.oauth)
cognigraph-chunker semantic -i doc.md -p oauth

# OAuth with custom CA (corporate proxy)
cognigraph-chunker semantic -i doc.md -p oauth --danger-accept-invalid-certs

`cognitive` -- Cognition-aware chunking

Split text using multi-signal boundary scoring that preserves entity chains, discourse structure, and heading context. Extends semantic chunking with eight cognitive signals and proposition-aware healing.

Flag	Short	Default	Description
`--input`	`-i`	`-` (stdin)	Input file path, or `-` for stdin
`--provider`	`-p`	`ollama`	Embedding provider: `ollama`, `openai`, `onnx`, `cloudflare`, `oauth`
`--model`	`-m`	provider default	Model name (provider-specific)
`--api-key`		none	API key for OpenAI (also reads env/file)
`--base-url`		none	Base URL override for the embedding API
`--model-path`		none	Path to ONNX model directory (required for `onnx` provider)
`--cf-auth-token`		none	Cloudflare auth token (also reads env/`.env.cloudflare`)
`--cf-account-id`		none	Cloudflare account ID (also reads env/`.env.cloudflare`)
`--cf-ai-gateway`		none	Cloudflare AI Gateway name (optional)
`--oauth-token-url`		none	OAuth token endpoint URL (also reads env/`.env.oauth`)
`--oauth-client-id`		none	OAuth client ID (also reads env/`.env.oauth`)
`--oauth-client-secret`		none	OAuth client secret (also reads env/`.env.oauth`)
`--oauth-scope`		none	OAuth scope (optional)
`--oauth-base-url`		none	Base URL for the OpenAI-compatible API (also reads env/`.env.oauth`)
`--danger-accept-invalid-certs`		off	Accept invalid TLS certificates (for corporate proxies)
`--soft-budget`		512	Soft token budget per chunk (assembly prefers to stay under this)
`--hard-budget`		768	Hard token ceiling per chunk (never exceeded unless a single block is larger)
`--sim-window`		3	Window size for cross-similarity computation (must be odd, >= 3)
`--sg-window`		11	Savitzky-Golay smoothing window size (must be odd)
`--poly-order`		3	Savitzky-Golay polynomial order
`--language`		auto-detect	Language override (`en`, `de`, `fr`, `es`, `pt`, `it`, `nl`, `ru`, `zh`, `ja`, `ko`, `ar`, `tr`, `pl`) or `auto`
`--reranker`		none	Reranker for ambiguous boundary refinement: `nvidia`, `cohere`, `cloudflare`, `oauth`, `onnx:<path>`, or a bare path
`--relations`		off	Extract relation triples via LLM (requires OpenAI API key)
`--synopsis`		off	Generate LLM-based synopsis for each chunk (requires OpenAI API key)
`--graph`		off	Output as graph structure (nodes + edges) instead of flat chunks
`--emit-signals`		off	Emit full boundary signal diagnostics to stderr
`--no-markdown`		off	Treat input as plain text instead of markdown
`--format`	`-f`	plain	Output format: `plain`, `json`, `jsonl`

Examples:

# Cognitive chunking with Ollama (default)
cognigraph-chunker cognitive -i document.md

# Use OpenAI embeddings, JSON output
cognigraph-chunker cognitive -i doc.md -p openai -f json

# Custom token budgets
cognigraph-chunker cognitive -i doc.md --soft-budget 256 --hard-budget 512

# With NVIDIA NIM reranker (reads .env.nvidia for credentials)
cognigraph-chunker cognitive -i doc.md --reranker nvidia

# With Cohere reranker (reads .env.cohere for credentials)
cognigraph-chunker cognitive -i doc.md --reranker cohere

# With Cloudflare Workers AI reranker (reads .env.cloudflare for credentials)
cognigraph-chunker cognitive -i doc.md --reranker cloudflare

# With OAuth-authenticated reranker (reads .env.oauth for credentials)
cognigraph-chunker cognitive -i doc.md --reranker oauth

# With local ONNX cross-encoder reranker
cognigraph-chunker cognitive -i doc.md --reranker onnx:./models/ms-marco-MiniLM-L-6-v2

# OpenAI embeddings + NVIDIA reranking (best quality/speed combo)
cognigraph-chunker cognitive -i doc.md -p openai --reranker nvidia

# Extract relation triples via LLM
cognigraph-chunker cognitive -i doc.md --relations -f json

# Graph export (nodes + edges with entity links)
cognigraph-chunker cognitive -i doc.md --graph

# Generate chunk synopses via LLM
cognigraph-chunker cognitive -i doc.md --synopsis -f json

# Force language (skip auto-detection)
cognigraph-chunker cognitive -i doc.md --language de

# Full diagnostics with stats
cognigraph-chunker cognitive -i doc.md --emit-signals --stats -f json

# Plain text mode (no markdown parsing)
cognigraph-chunker cognitive -i doc.md --no-markdown

`serve` -- REST API server

Start an HTTP server exposing all chunking operations.

Flag	Short	Default	Description
`--host`		`0.0.0.0`	Host address to bind to
`--port`	`-p`	3000	Port to listen on
`--api-key`		none	API key for bearer token authentication
`--no-auth`		off	Run without authentication (insecure)
`--allow-private-urls`		off	Allow embedding provider base URLs pointing to private/loopback IPs
`--cors-origin`		none	Allowed CORS origins (repeatable; omit for same-origin only)

Examples:

# Start with authentication
cognigraph-chunker serve --api-key my-secret

# Custom port with CORS
cognigraph-chunker serve --port 8080 --api-key my-secret --cors-origin https://example.com

# Development mode (no auth, allow private URLs)
cognigraph-chunker serve --no-auth --allow-private-urls

`completions` -- Shell completions

cognigraph-chunker completions bash > ~/.bash_completions/cognigraph-chunker
cognigraph-chunker completions zsh > ~/.zfunc/_cognigraph-chunker
cognigraph-chunker completions fish > ~/.config/fish/completions/cognigraph-chunker.fish

REST API Reference

All endpoints are under /api/v1. When --api-key is configured, include Authorization: Bearer <key> in all requests (except health).

Request body limit: 10 MiB. Request timeout: 120 seconds.

`GET /api/v1/health`

Health check. Always open (no auth required).

Response:

{ "status": "ok" }

`POST /api/v1/chunk`

Fixed-size chunking.

Request body:

{
  "text": "string (required)",
  "size": 4096,
  "delimiters": "\n.",
  "pattern": null,
  "prefix": false,
  "consecutive": false,
  "forward_fallback": false,
  "merge": false,
  "chunk_size": 512
}

Response:

{
  "chunks": [
    { "index": 0, "text": "...", "offset": 0, "length": 1024 },
    { "index": 1, "text": "...", "offset": 1024, "length": 980 }
  ],
  "count": 2
}

`POST /api/v1/split`

Delimiter/pattern splitting.

Request body:

{
  "text": "string (required)",
  "delimiters": ".?!",
  "patterns": null,
  "include_delim": "prev",
  "min_chars": 0,
  "merge": false,
  "chunk_size": 512
}

include_delim: "prev" (default), "next", or "none"

Response: Same structure as /api/v1/chunk.

`POST /api/v1/semantic`

Semantic chunking with embeddings.

Request body:

{
  "text": "string (required)",
  "provider": "ollama",
  "model": null,
  "api_key": null,
  "base_url": null,
  "model_path": null,
  "cf_auth_token": null,
  "cf_account_id": null,
  "cf_ai_gateway": null,
  "oauth_token_url": null,
  "oauth_client_id": null,
  "oauth_client_secret": null,
  "oauth_scope": null,
  "oauth_base_url": null,
  "danger_accept_invalid_certs": false,
  "sim_window": 3,
  "sg_window": 11,
  "poly_order": 3,
  "threshold": 0.5,
  "min_distance": 2,
  "no_markdown": false,
  "merge": false,
  "chunk_size": 512
}

provider: "ollama" (default), "openai", "onnx", "cloudflare", or "oauth"
model_path is required when provider is "onnx"
cf_auth_token and cf_account_id are required for "cloudflare" (also reads env vars or .env.cloudflare)
cf_ai_gateway optionally routes requests through a Cloudflare AI Gateway
oauth_* fields are required for "oauth" (also reads env vars or .env.oauth)
danger_accept_invalid_certs disables TLS verification for corporate proxies with custom CAs
base_url is validated against SSRF (private IPs rejected unless --allow-private-urls is set)

Response: Same structure as /api/v1/chunk.

`POST /api/v1/cognitive`

Cognition-aware chunking with multi-signal boundary scoring.

Request body:

{
  "text": "string (required)",
  "provider": "ollama",
  "model": null,
  "api_key": null,
  "base_url": null,
  "model_path": null,
  "cf_auth_token": null,
  "cf_account_id": null,
  "cf_ai_gateway": null,
  "oauth_token_url": null,
  "oauth_client_id": null,
  "oauth_client_secret": null,
  "oauth_scope": null,
  "oauth_base_url": null,
  "danger_accept_invalid_certs": false,
  "soft_budget": 512,
  "hard_budget": 768,
  "sim_window": 3,
  "sg_window": 11,
  "poly_order": 3,
  "no_markdown": false,
  "emit_signals": false,
  "relations": false,
  "language": null,
  "reranker_path": null,
  "graph": false
}

soft_budget / hard_budget: token budget controls (assembly prefers soft, never exceeds hard)
language: override auto-detection ("en", "de", "fr", "es", "pt", "it", "nl", "ru", "zh", "ja", "ko", "ar", "tr", "pl", "auto" for explicit auto-detect)
reranker_path: reranker provider for ambiguous boundary refinement — "nvidia", "cohere", "cloudflare", "oauth", "onnx:<path>", or a bare path to an ONNX model directory
relations: extract relation triples via LLM (requires OpenAI API key)
graph: return graph-shaped output (nodes + edges) instead of flat chunks
All embedding provider fields work the same as /api/v1/semantic

Response (flat mode):

{
  "chunks": [
    {
      "index": 0,
      "text": "...",
      "offset_start": 0,
      "offset_end": 1024,
      "length": 1024,
      "heading_path": ["Architecture", "Scoring"],
      "dominant_entities": ["CogniGraph", "boundary scorer"],
      "token_estimate": 256,
      "continuity_confidence": 0.85,
      "prev_chunk": null,
      "next_chunk": 1
    }
  ],
  "count": 5,
  "block_count": 23,
  "evaluation": {
    "entity_orphan_rate": 0.0,
    "pronoun_boundary_rate": 0.0,
    "heading_attachment_rate": 1.0,
    "discourse_break_rate": 0.0,
    "triple_severance_rate": 0.0
  },
  "shared_entities": {
    "cognigraph": [0, 2, 4],
    "boundary scorer": [1, 3]
  }
}

Response (graph mode, "graph": true):

{
  "nodes": [
    { "id": 0, "text": "...", "heading_path": [...], "entities": [...], "token_estimate": 256 }
  ],
  "edges": [
    { "source": 0, "target": 1, "edge_type": "adjacency" },
    { "source": 0, "target": 3, "edge_type": "entity", "entity": "CogniGraph" }
  ],
  "metadata": { "node_count": 5, "edge_count": 12 }
}

`POST /api/v1/merge`

Merge pre-split chunks into token-budget groups.

Request body:

{
  "chunks": ["chunk one", "chunk two", "chunk three"],
  "chunk_size": 512
}

Response:

{
  "chunks": [
    { "index": 0, "text": "chunk one chunk two", "offset": 0, "length": 19 }
  ],
  "count": 1,
  "token_counts": [4]
}

Python API Reference

`Chunker`

Fixed-size chunking. Iterable.

from cognigraph_chunker import Chunker

chunker = Chunker(
    text,                       # str, required
    size=4096,                  # target chunk size in bytes
    delimiters=None,            # bytes, single-byte delimiters
    pattern=None,               # bytes, multi-byte pattern
    prefix=False,               # delimiter at start of next chunk
    consecutive=False,          # split at consecutive delimiter runs
    forward_fallback=False,     # search forward if no backward boundary
)

# Iterate
for chunk in chunker:
    print(chunk)

# Or collect all at once
chunker.reset()
chunks = chunker.collect_chunks()     # list[str]
offsets = chunker.collect_offsets()    # list[tuple[int, int]]

`split_at_delimiters` / `split_at_patterns`

Delimiter and pattern splitting functions.

from cognigraph_chunker import split_at_delimiters, split_at_patterns

# Split on single-byte delimiters
offsets = split_at_delimiters(
    text,                       # str
    delimiters,                 # bytes (e.g., b".?!")
    include_delim="prev",       # "prev", "next", or "none"
    min_chars=0,                # minimum chars per segment
)
# Returns list[tuple[int, int]] -- (start, end) byte offsets

# Split on multi-byte patterns
offsets = split_at_patterns(
    text,
    patterns,                   # list[bytes] (e.g., [b". ", b"? "])
    include_delim="prev",
    min_chars=0,
)

`PatternSplitter`

Reusable pattern splitter (compiles patterns once).

from cognigraph_chunker import PatternSplitter

splitter = PatternSplitter(patterns=[b". ", b"? ", b"! "])
offsets = splitter.split(text, include_delim="prev", min_chars=0)

`merge_splits` / `find_merge_indices`

Merge small chunks into token-budget groups.

from cognigraph_chunker import merge_splits, find_merge_indices

result = merge_splits(
    splits=["chunk one", "chunk two", "chunk three"],
    token_counts=[2, 2, 2],
    chunk_size=5,
)
print(result.merged)         # list[str]
print(result.token_counts)   # list[int]

# Just get merge boundary indices
indices = find_merge_indices(token_counts=[2, 2, 2], chunk_size=5)

Semantic Chunking

from cognigraph_chunker import (
    OllamaProvider, OpenAiProvider, OnnxProvider,
    SemanticConfig, semantic_chunk,
)

# Choose a provider
provider = OllamaProvider(model="nomic-embed-text")
# provider = OpenAiProvider("sk-...", model="text-embedding-3-small")
# provider = OnnxProvider("/path/to/model-dir")

config = SemanticConfig(
    sim_window=3,         # cross-similarity window (odd, >= 3)
    sg_window=11,         # Savitzky-Golay window (odd)
    poly_order=3,         # polynomial order
    threshold=0.5,        # percentile threshold (0.0-1.0)
    min_distance=2,       # minimum block gap between splits
    max_blocks=10000,     # maximum blocks to process
)

result = semantic_chunk(text, provider, config, markdown=True)
for chunk_text, offset in result.chunks:
    print(f"[offset={offset}] {chunk_text[:80]}...")

# Access signal data
print(result.similarities)              # list[float] -- raw distance curve
print(result.smoothed)                  # list[float] -- smoothed curve
print(result.split_indices.indices)     # list[int] -- split point indices
print(result.split_indices.values)      # list[float] -- values at split points

Signal Processing Functions

Low-level signal processing primitives used by the semantic chunker.

from cognigraph_chunker import (
    savgol_filter,
    windowed_cross_similarity,
    find_local_minima,
    filter_split_indices,
)

# Savitzky-Golay filter
smoothed = savgol_filter(data, window_length=11, poly_order=3, deriv=0)

# Cross-similarity between embedding windows
distances = windowed_cross_similarity(embeddings, n=num_blocks, d=dim, window_size=3)

# Find local minima in the distance curve
result = find_local_minima(data, window_size=11, poly_order=3, tolerance=0.1)
print(result.indices, result.values)

# Filter split indices by threshold and minimum distance
filtered = filter_split_indices(indices, values, threshold=0.5, min_distance=2)
print(filtered.indices, filtered.values)

Configuration

Environment Variables

Variable	Description
`OPENAI_API_KEY`	OpenAI API key (used by `openai` provider)
`OLLAMA_HOST`	Ollama server URL (default: `http://localhost:11434`)
`CLOUDFLARE_AUTH_TOKEN`	Cloudflare API token (used by `cloudflare` provider)
`CLOUDFLARE_ACCOUNT_ID`	Cloudflare account ID (used by `cloudflare` provider)
`CLOUDFLARE_AI_GATEWAY`	Cloudflare AI Gateway name (optional; routes through gateway)
`OAUTH_TOKEN_URL`	OAuth token endpoint URL (used by `oauth` provider)
`OAUTH_CLIENT_ID`	OAuth client ID (used by `oauth` provider)
`OAUTH_CLIENT_SECRET`	OAuth client secret (used by `oauth` provider)
`OAUTH_SCOPE`	OAuth scope (optional)
`OAUTH_BASE_URL`	Base URL for the OpenAI-compatible API (used by `oauth` provider)
`OAUTH_MODEL`	Model name (used by `oauth` provider)
`COGNIGRAPH_LLM_MODEL`	LLM model for relation extraction and synopsis (default: `gpt-4.1-mini`)
`NVIDIA_API_KEY`	NVIDIA NIM API key (used by `nvidia` reranker)
`NVIDIA_RERANK_MODEL`	NVIDIA reranker model (default: `nv-rerank-qa-mistral-4b:1`)
`NVIDIA_RERANK_BASE_URL`	NVIDIA reranker base URL (default: `https://ai.api.nvidia.com/v1`)
`COHERE_API_KEY`	Cohere API key (used by `cohere` reranker)
`COHERE_RERANK_MODEL`	Cohere reranker model (default: `rerank-v3.5`)
`COHERE_RERANK_BASE_URL`	Cohere reranker base URL (default: `https://api.cohere.com/v2`)
`CLOUDFLARE_RERANK_MODEL`	Cloudflare reranker model (default: `@cf/baai/bge-reranker-base`)
`OAUTH_RERANK_PATH`	Rerank endpoint path appended to `OAUTH_BASE_URL` (default: `/rerank`)
`OAUTH_RERANK_MODEL`	Model name for OAuth reranker

`.env.openai` File

The OpenAI provider reads API keys from a .env.openai file in the working directory:

OPENAI_API_KEY=sk-...

Key resolution order: --api-key flag / api_key field > OPENAI_API_KEY env var > .env.openai file.

`.env.cloudflare` File

The Cloudflare provider reads credentials from a .env.cloudflare file in the working directory. These credentials are shared between the embedding provider and the cloudflare reranker:

CLOUDFLARE_AUTH_TOKEN=your-token
CLOUDFLARE_ACCOUNT_ID=your-account-id
CLOUDFLARE_AI_GATEWAY=your-gateway-name
CLOUDFLARE_RERANK_MODEL=@cf/baai/bge-reranker-base

Key resolution order: CLI flags / request fields > environment variables > .env.cloudflare file.

`.env.oauth` File

The OAuth provider reads credentials from a .env.oauth file in the working directory. These credentials are shared between the embedding provider and the oauth reranker:

OAUTH_TOKEN_URL=https://auth.example.com/api/oauth/token
OAUTH_CLIENT_ID=your-client-id
OAUTH_CLIENT_SECRET=your-client-secret
OAUTH_SCOPE=embeddings
OAUTH_BASE_URL=https://api.example.com/llm-api
OAUTH_MODEL=text-embedding-3-small
OAUTH_RERANK_PATH=/rerank
OAUTH_RERANK_MODEL=rerank-model-name

The OAUTH_RERANK_PATH is appended to OAUTH_BASE_URL to form the rerank endpoint (default: /rerank). This accommodates corporate API gateways that expose reranking at non-standard paths.

Key resolution order: CLI flags / request fields > environment variables > .env.oauth file.

`.env.nvidia` File

The NVIDIA reranker reads credentials from a .env.nvidia file in the working directory:

NVIDIA_API_KEY=nvapi-...
NVIDIA_RERANK_MODEL=nvidia/llama-nemotron-rerank-1b-v2
NVIDIA_RERANK_BASE_URL=https://ai.api.nvidia.com/v1

Available models include nvidia/llama-nemotron-rerank-1b-v2 (recommended — fast, high quality), nv-rerank-qa-mistral-4b:1, and nvidia/rerank-qa-mistral-4b. The endpoint path is derived automatically from the model name.

Key resolution order: environment variables > .env.nvidia file.

`.env.cohere` File

The Cohere reranker reads credentials from a .env.cohere file in the working directory:

COHERE_API_KEY=your-key
COHERE_RERANK_MODEL=rerank-v3.5

Available models: rerank-v3.5, rerank-english-v3.0, rerank-multilingual-v3.0.

Key resolution order: environment variables > .env.cohere file.

Embedding Provider Setup

Ollama (default) -- Install Ollama and pull a model:

ollama pull nomic-embed-text

OpenAI -- Set your API key via any of the methods above. Default model: text-embedding-3-small.

ONNX -- Download a model directory containing model.onnx and tokenizer.json. Compatible with Hugging Face ONNX exports (e.g., all-MiniLM-L6-v2).

ONNX Runtime must be available at runtime when using ONNX providers. Install it first (for example, brew install onnxruntime), and set ORT_DYLIB_PATH only when needed.

cognigraph-chunker semantic -i doc.md -p onnx --model-path ./models/all-MiniLM-L6-v2

Cloudflare Workers AI -- Uses Cloudflare's hosted embedding models (e.g., @cf/baai/bge-m3, @cf/qwen/qwen3-embedding-0.6b). Set credentials via environment variables or .env.cloudflare file. The token is verified at startup. Optionally route requests through an AI Gateway for logging and rate limiting.

cognigraph-chunker semantic -i doc.md -p cloudflare
cognigraph-chunker semantic -i doc.md -p cloudflare --cf-ai-gateway my-gateway -m @cf/qwen/qwen3-embedding-0.6b

OAuth -- For OpenAI-compatible APIs behind OAuth2 client credentials authentication (e.g., corporate API gateways). Set credentials via environment variables or .env.oauth file. The token is acquired automatically, cached, and refreshed before expiry. Use --danger-accept-invalid-certs for endpoints behind corporate proxies with custom CAs.

cognigraph-chunker semantic -i doc.md -p oauth
cognigraph-chunker semantic -i doc.md -p oauth --danger-accept-invalid-certs

Docker

Build

docker build -t cognigraph-chunker .

Run

# With API key authentication
docker run -p 3000:3000 -e API_KEY=my-secret cognigraph-chunker

# Without authentication (development)
docker run -p 3000:3000 -e NO_AUTH=1 cognigraph-chunker

# With OpenAI embeddings and CORS
docker run -p 3000:3000 \
  -e API_KEY=my-secret \
  -e OPENAI_API_KEY=sk-... \
  -e CORS_ORIGINS=https://example.com \
  cognigraph-chunker

Environment Variables

Variable	Description
`PORT`	Server port (default: `3000`). Automatically set by Railway, Render, Fly.io.
`API_KEY`	Bearer token for API authentication
`NO_AUTH`	Set to `1` to disable authentication
`CORS_ORIGINS`	Allowed CORS origins
`OPENAI_API_KEY`	OpenAI API key for the `openai` embedding provider
`CLOUDFLARE_AUTH_TOKEN`	Cloudflare API token for the `cloudflare` embedding provider
`CLOUDFLARE_ACCOUNT_ID`	Cloudflare account ID for the `cloudflare` embedding provider
`CLOUDFLARE_AI_GATEWAY`	Cloudflare AI Gateway name (optional)
`OAUTH_TOKEN_URL`	OAuth token endpoint URL for the `oauth` embedding provider
`OAUTH_CLIENT_ID`	OAuth client ID for the `oauth` embedding provider
`OAUTH_CLIENT_SECRET`	OAuth client secret for the `oauth` embedding provider
`OAUTH_SCOPE`	OAuth scope (optional)
`OAUTH_BASE_URL`	Base URL for the OpenAI-compatible API
`OAUTH_MODEL`	Model name for the `oauth` embedding provider
`ORT_DYLIB_PATH`	Custom path to ONNX Runtime shared library (only used when the runtime is not on default system paths). Not bundled by this crate.
`COGNIGRAPH_LLM_MODEL`	LLM model for `--relations` and `--synopsis` (default: `gpt-4.1-mini`)
`NVIDIA_API_KEY`	NVIDIA NIM API key for the `nvidia` reranker
`NVIDIA_RERANK_MODEL`	NVIDIA reranker model (default: `nv-rerank-qa-mistral-4b:1`)
`NVIDIA_RERANK_BASE_URL`	NVIDIA reranker base URL
`COHERE_API_KEY`	Cohere API key for the `cohere` reranker
`COHERE_RERANK_MODEL`	Cohere reranker model (default: `rerank-v3.5`)

Deploy on Railway / Render / Fly.io

The Dockerfile is ready for container platforms that inject a PORT environment variable. Push to your Git repository and connect it to your platform of choice. Set API_KEY (or NO_AUTH=1) in the platform's environment variable settings.

Architecture

cognigraph-chunker/
  src/
    lib.rs              # Library root (public API)
    main.rs             # CLI entry point
    core/               # Core algorithms (chunk, split, merge, signal processing)
    embeddings/         # Embedding providers (OpenAI, Ollama, ONNX, Cloudflare, OAuth)
      reranker.rs       # Cross-encoder rerankers (NVIDIA NIM, Cohere, Cloudflare, OAuth, ONNX) for boundary refinement
    semantic/           # Semantic and cognitive chunking pipelines
      enrichment/       # Cognitive enrichment (entities, discourse, heading context, language)
      cognitive_*.rs    # Cognitive scoring, assembly, and reranking
      proposition_heal.rs # Proposition-aware chunk healing
      graph_export.rs   # Graph export format (nodes + edges)
      evaluation.rs     # Quality metrics
    llm/                # LLM integration (relation extraction, synopsis generation)
    api/                # REST API (Axum handlers, types, middleware)
    cli/                # CLI subcommands and options
    output/             # Output formatting (plain, json, jsonl)
  packages/
    python/             # Python bindings (PyO3 + maturin)

The core algorithms operate on byte slices for zero-copy performance. The semantic pipeline splits text into blocks (markdown-aware or sentence-based), computes embeddings, calculates cross-similarity distances, applies Savitzky-Golay smoothing, and detects topic boundaries at local minima.

The cognitive pipeline extends this with block-level enrichment (entity detection, discourse markers, heading context, continuation flags), weighted multi-signal boundary scoring, valley-based assembly with soft/hard token budgets, and proposition-aware healing that merges chunks with broken cross-references. Language detection runs automatically, selecting appropriate heuristics for 14 language groups.

License

MIT

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

cognigraph-chunker

Features

Installation

CLI (from crates.io)

Python (via maturin)

From source

Quick Start

CLI

REST API

Python

CLI Reference

Global Options

chunk -- Fixed-size chunking

split -- Delimiter splitting

semantic -- Semantic chunking

cognitive -- Cognition-aware chunking

serve -- REST API server

completions -- Shell completions

REST API Reference

GET /api/v1/health

POST /api/v1/chunk

POST /api/v1/split

POST /api/v1/semantic

POST /api/v1/cognitive

POST /api/v1/merge

Python API Reference

Chunker

split_at_delimiters / split_at_patterns

PatternSplitter

merge_splits / find_merge_indices

Semantic Chunking

Signal Processing Functions

Configuration

Environment Variables

.env.openai File

.env.cloudflare File

.env.oauth File

.env.nvidia File

.env.cohere File

Embedding Provider Setup

Docker

Build

Run

Environment Variables

Deploy on Railway / Render / Fly.io

Architecture

License

`chunk` -- Fixed-size chunking

`split` -- Delimiter splitting

`semantic` -- Semantic chunking

`cognitive` -- Cognition-aware chunking

`serve` -- REST API server

`completions` -- Shell completions

`GET /api/v1/health`

`POST /api/v1/chunk`

`POST /api/v1/split`

`POST /api/v1/semantic`

`POST /api/v1/cognitive`

`POST /api/v1/merge`

`Chunker`

`split_at_delimiters` / `split_at_patterns`

`PatternSplitter`

`merge_splits` / `find_merge_indices`

`.env.openai` File

`.env.cloudflare` File

`.env.oauth` File

`.env.nvidia` File

`.env.cohere` File