Deterministic Document Segmentation via Atomic Parsing

This project explores how to divide a long document into N coherent sections in a deterministic, inspectable, and extensible way, without relying on LLMs to “understand” or rewrite the content.

Rather than treating the problem as text division itself, we treat it as structured document analysis.

The following shows the 1-page summary of the workflow:

Motivation

Splitting documents sounds simple, but in practice it is ambiguous:

Where should a document be split?
Should headings always dominate?
What about lists, tables, or long paragraphs?
How do we debug or adjust a bad split?

Key design decisions

1. No LLMs for segmentation

This system does not generate or modify content.
It only divides existing content.

Because of that, we intentionally avoid LLM-based heuristics and instead use:

Deterministic parsing
Explicit rules
Reproducible outcomes

This makes the system:

predictable
debuggable
suitable for automation pipelines
cost-efficient (no token/inference/hardware cost)

If needed, we can add LLMs after segmentation for optional refinement: e.g., improving inter-segment relationships, ranking split candidates, or tuning split thresholds.

2. Atom-first representation

Instead of directly “splitting a document,” we first parse it into atomic units:

headings
pseudo-headings (e.g. **Title**)
paragraphs
list blocks
tables
code blocks
horizontal rules
blanks

Each atom carries metadata:

position (line, byte, index)
word/character counts
loose hierarchical context
boundary strength (how good a cut this atom represents)

This atom layer is the core abstraction.

3. Loose structure, not a rigid tree

Documents are not always well-formed trees.

We therefore model:

a pseudo-tree using heading depth and section paths
without enforcing strict parent/child constraints

This allows:

flat documents (e.g. novels)
semi-structured memos
messy markdown
mixed formats

All future logic (splitting, visualization, tuning) operates on atoms.

4. Visualization as a first-class tool

Understanding segmentation decisions is hard without seeing structure.

We therefore add:

Mermaid-based section diagrams
optional leaf-level visualization (paragraphs, lists, code, tables)
segment coloring for split results

Visualization serves two purposes:

debugging (why did this split happen?)
future interaction (manual tweaking, UI-based adjustment)

High-level pipeline

Raw document
   ↓
Atomization (deterministic)
   ↓
Atoms + pseudo-structure
   ↓
Candidate cut generation
   ↓
N-way partitioning (by word balance)
   ↓
Optional visualization

Atomization

The parser detects document mode (Markdown vs plain text) and emits a linear sequence of atoms.

Each atom includes:

atom_type (heading, paragraph, list, …)
weight_words, weight_chars
depth (for headings / pseudo-headings)
section_path_ids
can_cut_before + boundary_strength

This representation is intentionally over-complete to support future policies.

Think of atoms as “document pixels”: small, immutable, and composable.

Splitting into N sections

Splitting is framed as:

Choose N−1 cut points from valid boundaries to balance total word count.

Candidate cuts

Candidates are selected from atoms that are:

strong boundaries (headings, pseudo-headings, HRs)
optionally lists / tables / code blocks
optionally paragraphs (fallback)

Relaxation happens in stages if strict candidates cannot produce N segments.

Objective

The default objective minimizes deviation from equal word counts across segments, while respecting boundary preferences.

This makes the behavior:

explainable
tunable
extensible

The final output looks like the following:

{
  "N": 5,
  "objective": [0, 91, 0.0],
  "cuts": [13, 30, 40, 54],
  "segments": [
    {
      "seg_idx": 0,
      "start_atom": 0,
      "end_atom_excl": 13,
      "words": 65,
      "start_path_ids": [1],
      "start_path_titles": ["Project Atlas: Strategy Memo"]
    },
    {
      "seg_idx": 1,
      "start_atom": 13,
      "end_atom_excl": 30,
      "words": 80,
      "start_path_ids": [1,2,4],
      "start_path_titles": [
        "Project Atlas: Strategy Memo",
        "1. Executive Summary",
        "1.1 Success Metrics"
      ]
    },
    ...
}

Visualization (Mermaid)

The system can emit a Mermaid flowchart showing:

section hierarchy (from headings)
optional leaf nodes (paragraphs, lists, code, tables)
coloring by split segment

This is especially useful when tuning split parameters or validating parser behavior.

Example

Below is an example Mermaid output. For the terminal command to generate the diagram, see the Usage #3 section below. The source document is available at ./examples/example_minimal.md, and it is split into N=3 parts. The diagram includes heading nodes and leaf nodes (paragraph, list, code block, table). Colors represent split groups.

%%{init: {"flowchart": {"nodeSpacing": 12, "rankSpacing": 28}} }%%
flowchart TD
classDef sec1 fill:#E3F2FD,stroke:#1E88E5,stroke-width:1px,color:#0D47A1;
classDef sec2 fill:#E8F5E9,stroke:#43A047,stroke-width:1px,color:#1B5E20;
classDef sec3 fill:#FFF3E0,stroke:#FB8C00,stroke-width:1px,color:#E65100;
classDef leafEmpty stroke-width:1px;
classDef leafLabeled stroke-width:1px;
    S1["Mini Example: Deterministic N-Split Demo"]
    S2["1. Background"]
    S3["Constraints"]
    S4["2. Method"]
class S1 sec1;
class S2,S3 sec2;
class S4 sec3;
    A2["P"]
    S1 --> A2
    class A2 leafLabeled;
    class A2 sec1;
    A8["P"]
    S2 --> A8
    class A8 leafLabeled;
    class A8 sec2;
    A11["L"]
    S3 --> A11
    class A11 leafLabeled;
    class A11 sec2;
    A17["P"]
    S4 --> A17
    class A17 leafLabeled;
    class A17 sec3;
    A19["C"]
    S4 --> A19
    class A19 leafLabeled;
    class A19 sec3;
    S1 --> S2
    S1 --> S4
    S2 --> S3

In leaf nodes, P=paragraph, L=list, C=code, T=table

Usage

1. Parse and inspect atoms

Parse a document into atomic units (headings, paragraphs, lists, code blocks, etc.) and print a structured inspection table. This is useful for understanding how the document is interpreted internally before performing any split.

python run_atomizer.py --file ./examples/example_mistral.md

2. Split into N sections

Split the document into N sections using the deterministic split algorithm:

python run_atomizer.py --file ./examples/example_mistral.md --split 3 --split-json-out out.json

By default, --split-json-mode is plan. In this mode, out.json contains metadata only, including cut indices, word counts, and section boundaries.

If you want materialized text sections (a list of strings whose concatenation exactly reconstructs the original document), use --split-json-mode sections:

python run_atomizer.py \
  --file ./examples/example_mistral.md \
  --split 3 \
  --split-json-mode sections \
  --split-json-out out.json

Example output:

{
  "mode": "sections",
  "N": 3,
  "cuts": [25, 40],
  "sections": [
    "…exact original text for section 1…",
    "…exact original text for section 2…",
    "…exact original text for section 3…"
  ]
}

3. Export Mermaid diagram

Visualize the inferred document structure and section boundaries using Mermaid:

python run_atomizer.py \
  --file ./examples/example_mistral.md \
  --split 5 \
  --mermaid-out example_mistral.mmd \
  --mermaid-leaves

This generates a Mermaid diagram showing:

Section hierarchy
Optional leaf nodes (paragraphs, lists, code blocks)
Section coloring (when splits are enabled)

You can render the .mmd file using the Mermaid CLI (for example,
npx -y -p @mermaid-js/mermaid-cli@10 mmdc -i out.mmd -o out.png -s 3),
compatible Markdown renderers, or free online Mermaid editors such as
https://www.mermaidflow.app/editor.

When using npx, for high-resolution exports (e.g., for papers or slides), increase the scale factor using -s.

Summary

Document segmentation is ambiguous, so we made it explicit.
We avoid LLMs where generation is unnecessary.
We parse once, then reason over atoms.
Visualization is not an afterthought; it’s part of the design.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
examples		examples
.gitignore		.gitignore
README.md		README.md
atomizer.py		atomizer.py
partition.py		partition.py
render.py		render.py
run_atomizer.py		run_atomizer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deterministic Document Segmentation via Atomic Parsing

Motivation

Key design decisions

1. No LLMs for segmentation

2. Atom-first representation

3. Loose structure, not a rigid tree

4. Visualization as a first-class tool

High-level pipeline

Atomization

Splitting into N sections

Candidate cuts

Objective

Visualization (Mermaid)

Example

Usage

1. Parse and inspect atoms

2. Split into N sections

3. Export Mermaid diagram

Summary

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Deterministic Document Segmentation via Atomic Parsing

Motivation

Key design decisions

1. No LLMs for segmentation

2. Atom-first representation

3. Loose structure, not a rigid tree

4. Visualization as a first-class tool

High-level pipeline

Atomization

Splitting into N sections

Candidate cuts

Objective

Visualization (Mermaid)

Example

Usage

1. Parse and inspect atoms

2. Split into N sections

3. Export Mermaid diagram

Summary

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages