Skip to content

Latest commit

 

History

History
939 lines (732 loc) · 34.8 KB

File metadata and controls

939 lines (732 loc) · 34.8 KB

cognigraph-chunker

Fast text chunking toolkit with fixed-size, delimiter-based, semantic, and cognition-aware strategies.

License: MIT

Features

  • Four chunking strategies -- fixed-size with delimiter-aware boundaries, delimiter/pattern splitting, embedding-based semantic chunking, and cognition-aware chunking with multi-signal boundary scoring
  • Cognition-aware chunking -- 8-signal boundary scoring (semantic similarity, entity continuity, discourse continuation, heading context, structural affinity, topic shift, orphan risk, budget pressure), proposition-aware healing, cross-chunk entity tracking, and automatic quality metrics
  • Multilingual -- automatic language detection across 70+ languages with language-specific enrichment for 14 language groups (English, German, French, Spanish, Portuguese, Italian, Dutch, Russian, Turkish, Polish, Chinese, Japanese, Korean, Arabic)
  • Four interfaces -- CLI tool, REST API (Axum), Python bindings (PyO3), and Docker
  • Five embedding providers -- OpenAI, Ollama, ONNX Runtime (local), Cloudflare Workers AI, and OAuth-authenticated OpenAI-compatible endpoints
  • Markdown-aware -- parses markdown AST to preserve tables, code blocks, headings, and lists as atomic units
  • Optional LLM enrichment -- relation triple extraction and chunk synopsis generation via OpenAI-compatible API (post-assembly, no LLM needed for core chunking)
  • Graph export -- output chunks as nodes with adjacency and shared-entity edges, ready for graph databases
  • Ambiguous boundary refinement -- optional cross-encoder reranking for precision improvement on uncertain boundaries (NVIDIA NIM, Cohere, Cloudflare Workers AI, OAuth-authenticated endpoints, or local ONNX)
  • Merge post-processing -- combine small chunks into token-budget groups across all strategies
  • Output formats -- plain text, JSON, and JSONL

Installation

CLI (from crates.io)

cargo install cognigraph-chunker

Python (via maturin)

pip install cognigraph-chunker

From source

git clone https://github.com/gedankrayze/cognigraph-chunker.git
cd cognigraph-chunker
cargo build --release

The binary is at target/release/cognigraph-chunker.

Quick Start

CLI

# Fixed-size chunks of 1024 bytes
cognigraph-chunker chunk -i document.md -s 1024

# Split on sentence boundaries, JSON output
cognigraph-chunker split -i document.md -d ".?!" -f json

# Semantic chunking with Ollama
cognigraph-chunker semantic -i document.md

# Cognition-aware chunking (preserves entity chains, discourse structure, heading context)
cognigraph-chunker cognitive -i document.md -f json

# Cognitive chunking with graph export
cognigraph-chunker cognitive -i document.md --graph

# Cognitive chunking with LLM-based relation extraction
cognigraph-chunker cognitive -i document.md --relations -f json

REST API

# Start the server
cognigraph-chunker serve --api-key my-secret --port 3000

# Fixed-size chunking
curl -X POST http://localhost:3000/api/v1/chunk \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer my-secret" \
  -d '{"text": "Your long document text here...", "size": 1024}'

# Cognitive chunking
curl -X POST http://localhost:3000/api/v1/cognitive \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer my-secret" \
  -d '{"text": "Your long document text here...", "provider": "openai"}'

Python

from cognigraph_chunker import Chunker

for chunk in Chunker("Your long document text here...", size=1024):
    print(chunk)

CLI Reference

Global Options

These flags apply to all subcommands:

Flag Default Description
--verbose off Show detailed processing information
--quiet off Suppress all informational output (conflicts with --verbose)
--stats off Print chunk statistics after output (count, avg/min/max size)
--max-input-size 52428800 (50 MiB) Maximum input size in bytes

chunk -- Fixed-size chunking

Split text into chunks of a target byte size, with delimiter-aware boundary detection.

Flag Short Default Description
--input -i - (stdin) Input file path, or - for stdin
--size -s 4096 Target chunk size in bytes
--delimiters -d none Single-byte delimiters to split on (e.g., "\n.?")
--pattern -p none Multi-byte pattern to split on
--prefix off Put delimiter at start of next chunk instead of end of current
--consecutive off Split at start of consecutive delimiter runs
--forward-fallback off Search forward if no boundary found in backward window
--format -f plain Output format: plain, json, jsonl
--merge off Post-process by merging small chunks to fit a token budget
--chunk-size 512 Target token count per merged chunk (used with --merge)

Examples:

# 2 KB chunks with newline/period boundaries
cognigraph-chunker chunk -i input.txt -s 2048 -d "\n."

# Prefix mode: delimiters go to the start of the next chunk
cognigraph-chunker chunk -i input.txt -d "\n.?" --prefix

# Pipe from stdin, JSON output
cat file.txt | cognigraph-chunker chunk -s 1024 -f json

# Chunk then merge small pieces into ~256 token groups
cognigraph-chunker chunk -i doc.md -s 512 --merge --chunk-size 256

split -- Delimiter splitting

Split text at every occurrence of specified delimiters or patterns.

Flag Short Default Description
--input -i - (stdin) Input file path, or - for stdin
--delimiters -d \n.? Single-byte delimiters to split on
--patterns -p none Multi-byte patterns, comma-separated (e.g., ". ,? ,! ")
--include-delim prev Where to attach delimiter: prev, next, or none
--min-chars 0 Minimum characters per segment; shorter segments are merged
--format -f plain Output format: plain, json, jsonl
--merge off Post-process by merging small chunks to fit a token budget
--chunk-size 512 Target token count per merged chunk (used with --merge)

Examples:

# Split on sentence-ending punctuation
cognigraph-chunker split -i doc.md -d ".?!"

# Multi-byte patterns, attach delimiters to next segment
cognigraph-chunker split -i doc.md -p ". ,? " --include-delim next

# Minimum 100 chars per segment, JSONL output
cognigraph-chunker split -i doc.md --min-chars 100 -f jsonl

# Split then merge into ~512 token groups
cognigraph-chunker split -i doc.md --merge --chunk-size 512

semantic -- Semantic chunking

Split text based on embedding similarity using Savitzky-Golay smoothing to detect topic boundaries.

Flag Short Default Description
--input -i - (stdin) Input file path, or - for stdin
--provider -p ollama Embedding provider: ollama, openai, onnx, cloudflare, oauth
--model -m provider default Model name (provider-specific)
--api-key none API key for OpenAI (also reads env/file)
--base-url none Base URL override for the embedding API
--model-path none Path to ONNX model directory (required for onnx provider)
--cf-auth-token none Cloudflare auth token (also reads env/.env.cloudflare)
--cf-account-id none Cloudflare account ID (also reads env/.env.cloudflare)
--cf-ai-gateway none Cloudflare AI Gateway name (optional; routes through gateway)
--oauth-token-url none OAuth token endpoint URL (also reads env/.env.oauth)
--oauth-client-id none OAuth client ID (also reads env/.env.oauth)
--oauth-client-secret none OAuth client secret (also reads env/.env.oauth)
--oauth-scope none OAuth scope (optional; also reads env/.env.oauth)
--oauth-base-url none Base URL for the OpenAI-compatible API (also reads env/.env.oauth)
--danger-accept-invalid-certs off Accept invalid TLS certificates (for corporate proxies)
--sim-window 3 Window size for cross-similarity computation (must be odd, >= 3)
--sg-window 11 Savitzky-Golay smoothing window size (must be odd)
--poly-order 3 Savitzky-Golay polynomial order
--threshold 0.5 Percentile threshold for split point filtering (0.0--1.0)
--min-distance 2 Minimum block gap between split points
--format -f plain Output format: plain, json, jsonl
--emit-distances off Emit raw and smoothed distance curves to stderr
--no-markdown off Treat input as plain text instead of markdown
--merge off Post-process by merging small chunks to fit a token budget
--chunk-size 512 Target token count per merged chunk (used with --merge)

Examples:

# Semantic chunking with Ollama (default)
cognigraph-chunker semantic -i document.md

# Use OpenAI embeddings, JSON output
cognigraph-chunker semantic -i doc.md -p openai -f json

# Tune signal processing parameters
cognigraph-chunker semantic -i doc.md --sg-window 15 --threshold 0.3

# Export distance curves for debugging
cognigraph-chunker semantic -i doc.md --emit-distances 2>distances.tsv

# Plain text mode (no markdown parsing)
cognigraph-chunker semantic -i doc.md --no-markdown

# Local ONNX model
cognigraph-chunker semantic -i doc.md -p onnx --model-path ./models/all-MiniLM-L6-v2

# Cloudflare Workers AI (reads credentials from .env.cloudflare)
cognigraph-chunker semantic -i doc.md -p cloudflare

# Cloudflare via AI Gateway
cognigraph-chunker semantic -i doc.md -p cloudflare --cf-ai-gateway my-gateway

# OAuth-authenticated endpoint (reads credentials from .env.oauth)
cognigraph-chunker semantic -i doc.md -p oauth

# OAuth with custom CA (corporate proxy)
cognigraph-chunker semantic -i doc.md -p oauth --danger-accept-invalid-certs

cognitive -- Cognition-aware chunking

Split text using multi-signal boundary scoring that preserves entity chains, discourse structure, and heading context. Extends semantic chunking with eight cognitive signals and proposition-aware healing.

Flag Short Default Description
--input -i - (stdin) Input file path, or - for stdin
--provider -p ollama Embedding provider: ollama, openai, onnx, cloudflare, oauth
--model -m provider default Model name (provider-specific)
--api-key none API key for OpenAI (also reads env/file)
--base-url none Base URL override for the embedding API
--model-path none Path to ONNX model directory (required for onnx provider)
--cf-auth-token none Cloudflare auth token (also reads env/.env.cloudflare)
--cf-account-id none Cloudflare account ID (also reads env/.env.cloudflare)
--cf-ai-gateway none Cloudflare AI Gateway name (optional)
--oauth-token-url none OAuth token endpoint URL (also reads env/.env.oauth)
--oauth-client-id none OAuth client ID (also reads env/.env.oauth)
--oauth-client-secret none OAuth client secret (also reads env/.env.oauth)
--oauth-scope none OAuth scope (optional)
--oauth-base-url none Base URL for the OpenAI-compatible API (also reads env/.env.oauth)
--danger-accept-invalid-certs off Accept invalid TLS certificates (for corporate proxies)
--soft-budget 512 Soft token budget per chunk (assembly prefers to stay under this)
--hard-budget 768 Hard token ceiling per chunk (never exceeded unless a single block is larger)
--sim-window 3 Window size for cross-similarity computation (must be odd, >= 3)
--sg-window 11 Savitzky-Golay smoothing window size (must be odd)
--poly-order 3 Savitzky-Golay polynomial order
--language auto-detect Language override (en, de, fr, es, pt, it, nl, ru, zh, ja, ko, ar, tr, pl) or auto
--reranker none Reranker for ambiguous boundary refinement: nvidia, cohere, cloudflare, oauth, onnx:<path>, or a bare path
--relations off Extract relation triples via LLM (requires OpenAI API key)
--synopsis off Generate LLM-based synopsis for each chunk (requires OpenAI API key)
--graph off Output as graph structure (nodes + edges) instead of flat chunks
--emit-signals off Emit full boundary signal diagnostics to stderr
--no-markdown off Treat input as plain text instead of markdown
--format -f plain Output format: plain, json, jsonl

Examples:

# Cognitive chunking with Ollama (default)
cognigraph-chunker cognitive -i document.md

# Use OpenAI embeddings, JSON output
cognigraph-chunker cognitive -i doc.md -p openai -f json

# Custom token budgets
cognigraph-chunker cognitive -i doc.md --soft-budget 256 --hard-budget 512

# With NVIDIA NIM reranker (reads .env.nvidia for credentials)
cognigraph-chunker cognitive -i doc.md --reranker nvidia

# With Cohere reranker (reads .env.cohere for credentials)
cognigraph-chunker cognitive -i doc.md --reranker cohere

# With Cloudflare Workers AI reranker (reads .env.cloudflare for credentials)
cognigraph-chunker cognitive -i doc.md --reranker cloudflare

# With OAuth-authenticated reranker (reads .env.oauth for credentials)
cognigraph-chunker cognitive -i doc.md --reranker oauth

# With local ONNX cross-encoder reranker
cognigraph-chunker cognitive -i doc.md --reranker onnx:./models/ms-marco-MiniLM-L-6-v2

# OpenAI embeddings + NVIDIA reranking (best quality/speed combo)
cognigraph-chunker cognitive -i doc.md -p openai --reranker nvidia

# Extract relation triples via LLM
cognigraph-chunker cognitive -i doc.md --relations -f json

# Graph export (nodes + edges with entity links)
cognigraph-chunker cognitive -i doc.md --graph

# Generate chunk synopses via LLM
cognigraph-chunker cognitive -i doc.md --synopsis -f json

# Force language (skip auto-detection)
cognigraph-chunker cognitive -i doc.md --language de

# Full diagnostics with stats
cognigraph-chunker cognitive -i doc.md --emit-signals --stats -f json

# Plain text mode (no markdown parsing)
cognigraph-chunker cognitive -i doc.md --no-markdown

serve -- REST API server

Start an HTTP server exposing all chunking operations.

Flag Short Default Description
--host 0.0.0.0 Host address to bind to
--port -p 3000 Port to listen on
--api-key none API key for bearer token authentication
--no-auth off Run without authentication (insecure)
--allow-private-urls off Allow embedding provider base URLs pointing to private/loopback IPs
--cors-origin none Allowed CORS origins (repeatable; omit for same-origin only)

Examples:

# Start with authentication
cognigraph-chunker serve --api-key my-secret

# Custom port with CORS
cognigraph-chunker serve --port 8080 --api-key my-secret --cors-origin https://example.com

# Development mode (no auth, allow private URLs)
cognigraph-chunker serve --no-auth --allow-private-urls

completions -- Shell completions

cognigraph-chunker completions bash > ~/.bash_completions/cognigraph-chunker
cognigraph-chunker completions zsh > ~/.zfunc/_cognigraph-chunker
cognigraph-chunker completions fish > ~/.config/fish/completions/cognigraph-chunker.fish

REST API Reference

All endpoints are under /api/v1. When --api-key is configured, include Authorization: Bearer <key> in all requests (except health).

Request body limit: 10 MiB. Request timeout: 120 seconds.

GET /api/v1/health

Health check. Always open (no auth required).

Response:

{ "status": "ok" }

POST /api/v1/chunk

Fixed-size chunking.

Request body:

{
  "text": "string (required)",
  "size": 4096,
  "delimiters": "\n.",
  "pattern": null,
  "prefix": false,
  "consecutive": false,
  "forward_fallback": false,
  "merge": false,
  "chunk_size": 512
}

Response:

{
  "chunks": [
    { "index": 0, "text": "...", "offset": 0, "length": 1024 },
    { "index": 1, "text": "...", "offset": 1024, "length": 980 }
  ],
  "count": 2
}

POST /api/v1/split

Delimiter/pattern splitting.

Request body:

{
  "text": "string (required)",
  "delimiters": ".?!",
  "patterns": null,
  "include_delim": "prev",
  "min_chars": 0,
  "merge": false,
  "chunk_size": 512
}
  • include_delim: "prev" (default), "next", or "none"

Response: Same structure as /api/v1/chunk.

POST /api/v1/semantic

Semantic chunking with embeddings.

Request body:

{
  "text": "string (required)",
  "provider": "ollama",
  "model": null,
  "api_key": null,
  "base_url": null,
  "model_path": null,
  "cf_auth_token": null,
  "cf_account_id": null,
  "cf_ai_gateway": null,
  "oauth_token_url": null,
  "oauth_client_id": null,
  "oauth_client_secret": null,
  "oauth_scope": null,
  "oauth_base_url": null,
  "danger_accept_invalid_certs": false,
  "sim_window": 3,
  "sg_window": 11,
  "poly_order": 3,
  "threshold": 0.5,
  "min_distance": 2,
  "no_markdown": false,
  "merge": false,
  "chunk_size": 512
}
  • provider: "ollama" (default), "openai", "onnx", "cloudflare", or "oauth"
  • model_path is required when provider is "onnx"
  • cf_auth_token and cf_account_id are required for "cloudflare" (also reads env vars or .env.cloudflare)
  • cf_ai_gateway optionally routes requests through a Cloudflare AI Gateway
  • oauth_* fields are required for "oauth" (also reads env vars or .env.oauth)
  • danger_accept_invalid_certs disables TLS verification for corporate proxies with custom CAs
  • base_url is validated against SSRF (private IPs rejected unless --allow-private-urls is set)

Response: Same structure as /api/v1/chunk.

POST /api/v1/cognitive

Cognition-aware chunking with multi-signal boundary scoring.

Request body:

{
  "text": "string (required)",
  "provider": "ollama",
  "model": null,
  "api_key": null,
  "base_url": null,
  "model_path": null,
  "cf_auth_token": null,
  "cf_account_id": null,
  "cf_ai_gateway": null,
  "oauth_token_url": null,
  "oauth_client_id": null,
  "oauth_client_secret": null,
  "oauth_scope": null,
  "oauth_base_url": null,
  "danger_accept_invalid_certs": false,
  "soft_budget": 512,
  "hard_budget": 768,
  "sim_window": 3,
  "sg_window": 11,
  "poly_order": 3,
  "no_markdown": false,
  "emit_signals": false,
  "relations": false,
  "language": null,
  "reranker_path": null,
  "graph": false
}
  • soft_budget / hard_budget: token budget controls (assembly prefers soft, never exceeds hard)
  • language: override auto-detection ("en", "de", "fr", "es", "pt", "it", "nl", "ru", "zh", "ja", "ko", "ar", "tr", "pl", "auto" for explicit auto-detect)
  • reranker_path: reranker provider for ambiguous boundary refinement — "nvidia", "cohere", "cloudflare", "oauth", "onnx:<path>", or a bare path to an ONNX model directory
  • relations: extract relation triples via LLM (requires OpenAI API key)
  • graph: return graph-shaped output (nodes + edges) instead of flat chunks
  • All embedding provider fields work the same as /api/v1/semantic

Response (flat mode):

{
  "chunks": [
    {
      "index": 0,
      "text": "...",
      "offset_start": 0,
      "offset_end": 1024,
      "length": 1024,
      "heading_path": ["Architecture", "Scoring"],
      "dominant_entities": ["CogniGraph", "boundary scorer"],
      "token_estimate": 256,
      "continuity_confidence": 0.85,
      "prev_chunk": null,
      "next_chunk": 1
    }
  ],
  "count": 5,
  "block_count": 23,
  "evaluation": {
    "entity_orphan_rate": 0.0,
    "pronoun_boundary_rate": 0.0,
    "heading_attachment_rate": 1.0,
    "discourse_break_rate": 0.0,
    "triple_severance_rate": 0.0
  },
  "shared_entities": {
    "cognigraph": [0, 2, 4],
    "boundary scorer": [1, 3]
  }
}

Response (graph mode, "graph": true):

{
  "nodes": [
    { "id": 0, "text": "...", "heading_path": [...], "entities": [...], "token_estimate": 256 }
  ],
  "edges": [
    { "source": 0, "target": 1, "edge_type": "adjacency" },
    { "source": 0, "target": 3, "edge_type": "entity", "entity": "CogniGraph" }
  ],
  "metadata": { "node_count": 5, "edge_count": 12 }
}

POST /api/v1/merge

Merge pre-split chunks into token-budget groups.

Request body:

{
  "chunks": ["chunk one", "chunk two", "chunk three"],
  "chunk_size": 512
}

Response:

{
  "chunks": [
    { "index": 0, "text": "chunk one chunk two", "offset": 0, "length": 19 }
  ],
  "count": 1,
  "token_counts": [4]
}

Python API Reference

Chunker

Fixed-size chunking. Iterable.

from cognigraph_chunker import Chunker

chunker = Chunker(
    text,                       # str, required
    size=4096,                  # target chunk size in bytes
    delimiters=None,            # bytes, single-byte delimiters
    pattern=None,               # bytes, multi-byte pattern
    prefix=False,               # delimiter at start of next chunk
    consecutive=False,          # split at consecutive delimiter runs
    forward_fallback=False,     # search forward if no backward boundary
)

# Iterate
for chunk in chunker:
    print(chunk)

# Or collect all at once
chunker.reset()
chunks = chunker.collect_chunks()     # list[str]
offsets = chunker.collect_offsets()    # list[tuple[int, int]]

split_at_delimiters / split_at_patterns

Delimiter and pattern splitting functions.

from cognigraph_chunker import split_at_delimiters, split_at_patterns

# Split on single-byte delimiters
offsets = split_at_delimiters(
    text,                       # str
    delimiters,                 # bytes (e.g., b".?!")
    include_delim="prev",       # "prev", "next", or "none"
    min_chars=0,                # minimum chars per segment
)
# Returns list[tuple[int, int]] -- (start, end) byte offsets

# Split on multi-byte patterns
offsets = split_at_patterns(
    text,
    patterns,                   # list[bytes] (e.g., [b". ", b"? "])
    include_delim="prev",
    min_chars=0,
)

PatternSplitter

Reusable pattern splitter (compiles patterns once).

from cognigraph_chunker import PatternSplitter

splitter = PatternSplitter(patterns=[b". ", b"? ", b"! "])
offsets = splitter.split(text, include_delim="prev", min_chars=0)

merge_splits / find_merge_indices

Merge small chunks into token-budget groups.

from cognigraph_chunker import merge_splits, find_merge_indices

result = merge_splits(
    splits=["chunk one", "chunk two", "chunk three"],
    token_counts=[2, 2, 2],
    chunk_size=5,
)
print(result.merged)         # list[str]
print(result.token_counts)   # list[int]

# Just get merge boundary indices
indices = find_merge_indices(token_counts=[2, 2, 2], chunk_size=5)

Semantic Chunking

from cognigraph_chunker import (
    OllamaProvider, OpenAiProvider, OnnxProvider,
    SemanticConfig, semantic_chunk,
)

# Choose a provider
provider = OllamaProvider(model="nomic-embed-text")
# provider = OpenAiProvider("sk-...", model="text-embedding-3-small")
# provider = OnnxProvider("/path/to/model-dir")

config = SemanticConfig(
    sim_window=3,         # cross-similarity window (odd, >= 3)
    sg_window=11,         # Savitzky-Golay window (odd)
    poly_order=3,         # polynomial order
    threshold=0.5,        # percentile threshold (0.0-1.0)
    min_distance=2,       # minimum block gap between splits
    max_blocks=10000,     # maximum blocks to process
)

result = semantic_chunk(text, provider, config, markdown=True)
for chunk_text, offset in result.chunks:
    print(f"[offset={offset}] {chunk_text[:80]}...")

# Access signal data
print(result.similarities)              # list[float] -- raw distance curve
print(result.smoothed)                  # list[float] -- smoothed curve
print(result.split_indices.indices)     # list[int] -- split point indices
print(result.split_indices.values)      # list[float] -- values at split points

Signal Processing Functions

Low-level signal processing primitives used by the semantic chunker.

from cognigraph_chunker import (
    savgol_filter,
    windowed_cross_similarity,
    find_local_minima,
    filter_split_indices,
)

# Savitzky-Golay filter
smoothed = savgol_filter(data, window_length=11, poly_order=3, deriv=0)

# Cross-similarity between embedding windows
distances = windowed_cross_similarity(embeddings, n=num_blocks, d=dim, window_size=3)

# Find local minima in the distance curve
result = find_local_minima(data, window_size=11, poly_order=3, tolerance=0.1)
print(result.indices, result.values)

# Filter split indices by threshold and minimum distance
filtered = filter_split_indices(indices, values, threshold=0.5, min_distance=2)
print(filtered.indices, filtered.values)

Configuration

Environment Variables

Variable Description
OPENAI_API_KEY OpenAI API key (used by openai provider)
OLLAMA_HOST Ollama server URL (default: http://localhost:11434)
CLOUDFLARE_AUTH_TOKEN Cloudflare API token (used by cloudflare provider)
CLOUDFLARE_ACCOUNT_ID Cloudflare account ID (used by cloudflare provider)
CLOUDFLARE_AI_GATEWAY Cloudflare AI Gateway name (optional; routes through gateway)
OAUTH_TOKEN_URL OAuth token endpoint URL (used by oauth provider)
OAUTH_CLIENT_ID OAuth client ID (used by oauth provider)
OAUTH_CLIENT_SECRET OAuth client secret (used by oauth provider)
OAUTH_SCOPE OAuth scope (optional)
OAUTH_BASE_URL Base URL for the OpenAI-compatible API (used by oauth provider)
OAUTH_MODEL Model name (used by oauth provider)
COGNIGRAPH_LLM_MODEL LLM model for relation extraction and synopsis (default: gpt-4.1-mini)
NVIDIA_API_KEY NVIDIA NIM API key (used by nvidia reranker)
NVIDIA_RERANK_MODEL NVIDIA reranker model (default: nv-rerank-qa-mistral-4b:1)
NVIDIA_RERANK_BASE_URL NVIDIA reranker base URL (default: https://ai.api.nvidia.com/v1)
COHERE_API_KEY Cohere API key (used by cohere reranker)
COHERE_RERANK_MODEL Cohere reranker model (default: rerank-v3.5)
COHERE_RERANK_BASE_URL Cohere reranker base URL (default: https://api.cohere.com/v2)
CLOUDFLARE_RERANK_MODEL Cloudflare reranker model (default: @cf/baai/bge-reranker-base)
OAUTH_RERANK_PATH Rerank endpoint path appended to OAUTH_BASE_URL (default: /rerank)
OAUTH_RERANK_MODEL Model name for OAuth reranker

.env.openai File

The OpenAI provider reads API keys from a .env.openai file in the working directory:

OPENAI_API_KEY=sk-...

Key resolution order: --api-key flag / api_key field > OPENAI_API_KEY env var > .env.openai file.

.env.cloudflare File

The Cloudflare provider reads credentials from a .env.cloudflare file in the working directory. These credentials are shared between the embedding provider and the cloudflare reranker:

CLOUDFLARE_AUTH_TOKEN=your-token
CLOUDFLARE_ACCOUNT_ID=your-account-id
CLOUDFLARE_AI_GATEWAY=your-gateway-name
CLOUDFLARE_RERANK_MODEL=@cf/baai/bge-reranker-base

Key resolution order: CLI flags / request fields > environment variables > .env.cloudflare file.

.env.oauth File

The OAuth provider reads credentials from a .env.oauth file in the working directory. These credentials are shared between the embedding provider and the oauth reranker:

OAUTH_TOKEN_URL=https://auth.example.com/api/oauth/token
OAUTH_CLIENT_ID=your-client-id
OAUTH_CLIENT_SECRET=your-client-secret
OAUTH_SCOPE=embeddings
OAUTH_BASE_URL=https://api.example.com/llm-api
OAUTH_MODEL=text-embedding-3-small
OAUTH_RERANK_PATH=/rerank
OAUTH_RERANK_MODEL=rerank-model-name

The OAUTH_RERANK_PATH is appended to OAUTH_BASE_URL to form the rerank endpoint (default: /rerank). This accommodates corporate API gateways that expose reranking at non-standard paths.

Key resolution order: CLI flags / request fields > environment variables > .env.oauth file.

.env.nvidia File

The NVIDIA reranker reads credentials from a .env.nvidia file in the working directory:

NVIDIA_API_KEY=nvapi-...
NVIDIA_RERANK_MODEL=nvidia/llama-nemotron-rerank-1b-v2
NVIDIA_RERANK_BASE_URL=https://ai.api.nvidia.com/v1

Available models include nvidia/llama-nemotron-rerank-1b-v2 (recommended — fast, high quality), nv-rerank-qa-mistral-4b:1, and nvidia/rerank-qa-mistral-4b. The endpoint path is derived automatically from the model name.

Key resolution order: environment variables > .env.nvidia file.

.env.cohere File

The Cohere reranker reads credentials from a .env.cohere file in the working directory:

COHERE_API_KEY=your-key
COHERE_RERANK_MODEL=rerank-v3.5

Available models: rerank-v3.5, rerank-english-v3.0, rerank-multilingual-v3.0.

Key resolution order: environment variables > .env.cohere file.

Embedding Provider Setup

Ollama (default) -- Install Ollama and pull a model:

ollama pull nomic-embed-text

OpenAI -- Set your API key via any of the methods above. Default model: text-embedding-3-small.

ONNX -- Download a model directory containing model.onnx and tokenizer.json. Compatible with Hugging Face ONNX exports (e.g., all-MiniLM-L6-v2).

ONNX Runtime must be available at runtime when using ONNX providers. Install it first (for example, brew install onnxruntime), and set ORT_DYLIB_PATH only when needed.

cognigraph-chunker semantic -i doc.md -p onnx --model-path ./models/all-MiniLM-L6-v2

Cloudflare Workers AI -- Uses Cloudflare's hosted embedding models (e.g., @cf/baai/bge-m3, @cf/qwen/qwen3-embedding-0.6b). Set credentials via environment variables or .env.cloudflare file. The token is verified at startup. Optionally route requests through an AI Gateway for logging and rate limiting.

cognigraph-chunker semantic -i doc.md -p cloudflare
cognigraph-chunker semantic -i doc.md -p cloudflare --cf-ai-gateway my-gateway -m @cf/qwen/qwen3-embedding-0.6b

OAuth -- For OpenAI-compatible APIs behind OAuth2 client credentials authentication (e.g., corporate API gateways). Set credentials via environment variables or .env.oauth file. The token is acquired automatically, cached, and refreshed before expiry. Use --danger-accept-invalid-certs for endpoints behind corporate proxies with custom CAs.

cognigraph-chunker semantic -i doc.md -p oauth
cognigraph-chunker semantic -i doc.md -p oauth --danger-accept-invalid-certs

Docker

Build

docker build -t cognigraph-chunker .

Run

# With API key authentication
docker run -p 3000:3000 -e API_KEY=my-secret cognigraph-chunker

# Without authentication (development)
docker run -p 3000:3000 -e NO_AUTH=1 cognigraph-chunker

# With OpenAI embeddings and CORS
docker run -p 3000:3000 \
  -e API_KEY=my-secret \
  -e OPENAI_API_KEY=sk-... \
  -e CORS_ORIGINS=https://example.com \
  cognigraph-chunker

Environment Variables

Variable Description
PORT Server port (default: 3000). Automatically set by Railway, Render, Fly.io.
API_KEY Bearer token for API authentication
NO_AUTH Set to 1 to disable authentication
CORS_ORIGINS Allowed CORS origins
OPENAI_API_KEY OpenAI API key for the openai embedding provider
CLOUDFLARE_AUTH_TOKEN Cloudflare API token for the cloudflare embedding provider
CLOUDFLARE_ACCOUNT_ID Cloudflare account ID for the cloudflare embedding provider
CLOUDFLARE_AI_GATEWAY Cloudflare AI Gateway name (optional)
OAUTH_TOKEN_URL OAuth token endpoint URL for the oauth embedding provider
OAUTH_CLIENT_ID OAuth client ID for the oauth embedding provider
OAUTH_CLIENT_SECRET OAuth client secret for the oauth embedding provider
OAUTH_SCOPE OAuth scope (optional)
OAUTH_BASE_URL Base URL for the OpenAI-compatible API
OAUTH_MODEL Model name for the oauth embedding provider
ORT_DYLIB_PATH Custom path to ONNX Runtime shared library (only used when the runtime is not on default system paths). Not bundled by this crate.
COGNIGRAPH_LLM_MODEL LLM model for --relations and --synopsis (default: gpt-4.1-mini)
NVIDIA_API_KEY NVIDIA NIM API key for the nvidia reranker
NVIDIA_RERANK_MODEL NVIDIA reranker model (default: nv-rerank-qa-mistral-4b:1)
NVIDIA_RERANK_BASE_URL NVIDIA reranker base URL
COHERE_API_KEY Cohere API key for the cohere reranker
COHERE_RERANK_MODEL Cohere reranker model (default: rerank-v3.5)

Deploy on Railway / Render / Fly.io

The Dockerfile is ready for container platforms that inject a PORT environment variable. Push to your Git repository and connect it to your platform of choice. Set API_KEY (or NO_AUTH=1) in the platform's environment variable settings.

Architecture

cognigraph-chunker/
  src/
    lib.rs              # Library root (public API)
    main.rs             # CLI entry point
    core/               # Core algorithms (chunk, split, merge, signal processing)
    embeddings/         # Embedding providers (OpenAI, Ollama, ONNX, Cloudflare, OAuth)
      reranker.rs       # Cross-encoder rerankers (NVIDIA NIM, Cohere, Cloudflare, OAuth, ONNX) for boundary refinement
    semantic/           # Semantic and cognitive chunking pipelines
      enrichment/       # Cognitive enrichment (entities, discourse, heading context, language)
      cognitive_*.rs    # Cognitive scoring, assembly, and reranking
      proposition_heal.rs # Proposition-aware chunk healing
      graph_export.rs   # Graph export format (nodes + edges)
      evaluation.rs     # Quality metrics
    llm/                # LLM integration (relation extraction, synopsis generation)
    api/                # REST API (Axum handlers, types, middleware)
    cli/                # CLI subcommands and options
    output/             # Output formatting (plain, json, jsonl)
  packages/
    python/             # Python bindings (PyO3 + maturin)

The core algorithms operate on byte slices for zero-copy performance. The semantic pipeline splits text into blocks (markdown-aware or sentence-based), computes embeddings, calculates cross-similarity distances, applies Savitzky-Golay smoothing, and detects topic boundaries at local minima.

The cognitive pipeline extends this with block-level enrichment (entity detection, discourse markers, heading context, continuation flags), weighted multi-signal boundary scoring, valley-based assembly with soft/hard token budgets, and proposition-aware healing that merges chunks with broken cross-references. Language detection runs automatically, selecting appropriate heuristics for 14 language groups.

License

MIT