Pack and unpack text files into structured, plain-text archives with embedded XML. Perfectly formatted for LLM context windows.
The mdbox utility bundles text-based file directories into a single plain-text archive structured with an embedded XML block. It provides a zipfile-like Python API and a tar-style CLI for creating, extracting, modifying, and deleting entries.
Every archive embeds a visual directory tree and stores file contents safely in CDATA sections. The result is an archive that is completely Git-friendly and easily parseable by any standard XML tool.
Standard archive formats (.zip, .tar.gz) are binary and invisible to Large Language Models. mdbox solves this by creating self-contained, text-only bundles.
By utilizing the preamble and epilogue features (which allow you to attach arbitrary text before and after the XML data), a single mdbox archive becomes a complete prompt + context bundle. Just attach your system instructions in the preamble, pack the relevant codebase, and feed the single file directly into your LLM pipeline.
- Human-readable XML: Archives are plain text, Git-friendly, and inspectable in any text editor.
- Embedded Directory Tree: Every archive includes a visual,
tree-style directory listing at a glance. - Preamble & Epilogue: Prepend and append arbitrary text (like Markdown prompts or system instructions) around the XML block.
- Secure Extraction: Sandboxed unpacking prevents malicious path traversal (e.g.,
../attacks). - Strict Content Validation: Built-in UTF-8 enforcement and XML 1.0 compatibility checks with precise error localization.
zipfile-compatible Python API: Familiar methods likeopen(),write(),readstr(),extractall(), andnamelist().tar-style CLI: Quick and familiar command-line interface with bundled flags (-cvf).- Async Extraction Pipeline: Concurrent file writing powered by
asyncioandaiofilefor maximum I/O performance. - Transactional Safety: Atomic repacking for add/delete operations ensures no partial writes corrupt your archive if an exception occurs.
- Lazy File Reading: Disk sources are only read when content is explicitly accessed or the archive is flushed.
Requires Python 3.12+.
# Standard pip
pip install mdbox
# With uv (Recommended)
uv add mdboxPack a directory and a specific file into a single .xml bundle.
CLI:
mdbox -cvf backup.xml src/ README.mdPython:
import mdbox
with mdbox.open("backup.xml", mode="w") as qf:
qf.write("src")
qf.write("README.md")Unpack the bundle back to your local disk.
CLI:
mdbox -xf backup.xml output/Python:
with mdbox.open("backup.xml", mode="r") as qf:
qf.extractall("output")An mdbox archive consists of three distinct sections:
- Preamble: Arbitrary text (Markdown, system prompts, prose).
<archive>XML block: The structured file data and directory tree.- Epilogue: Arbitrary trailing text (metadata, formatting closures).
Because file contents are stored in <![CDATA[...]]> blocks, all characters are preserved exactly without requiring strict entity encoding.
# Project Snapshot
> System Prompt: Review the following codebase for security vulnerabilities.
<archive version="1.0">
<directory_tree><![CDATA[
.
├── src/
│ ├── main.py
│ └── utils/
│ └── helpers.py
└── README.md
]]></directory_tree>
<file path="README.md">
<content><![CDATA[# My Project
A sample project.
]]></content>
</file>
<file path="src/main.py">
<content><![CDATA[print("hello")
]]></content>
</file>
</archive>
---
*End of context bundle.*The mdbox utility supports standard tar-style bundled flags. Note: The -f flag must always come last in a bundle.
# Basic creation
mdbox -cvf archive.xml src/ docs/ README.md
# Creation with prompt injection (Preamble/Epilogue)
mdbox -cvf archive.xml --preamble "Build 2024-01-15" --epilogue license.txt src/# Extracts to default (.) or specified output directory
mdbox -xf archive.xml output/Creates the archive if it doesn't exist, or safely merges new entries into an existing one via atomic replacement.
mdbox -avf archive.xml new_module.pyRemoves files or whole directory prefixes. Uses atomic repacking to prevent corruption.
mdbox --delete -f archive.xml old_module.py src/deprecated/| Flag | Description |
|---|---|
-c |
Create a new archive |
-x |
Extract an archive |
-a |
Add/upsert files into an archive |
--delete |
Remove files from an archive |
-f <file> |
Archive file path (required) |
-v |
Verbose output |
--debug |
Structured debug logging |
--preamble <text|file> |
Text or file content to prepend before XML |
--epilogue <text|file> |
Text or file content to append after XML |
import mdbox
# Write mode (creates or overwrites)
with mdbox.open("archive.xml", mode="w") as qf:
qf.write("src")
# Read mode (parses existing archive)
with mdbox.open("archive.xml", mode="r") as qf:
for info in qf:
print(f"File: {info.name}, Size: {info.length} bytes")with mdbox.open("archive.xml", mode="w") as qf:
qf.write("main.py") # Add single file
qf.write("src") # Add entire directory
qf.write("build/out.js", arcname="dist.js") # Override internal path
qf.writestr("virtual.txt", "hello world") # Write straight from memoryimport io
with mdbox.open("archive.xml", mode="r") as qf:
names = qf.namelist() # ['src/main.py', ...]
text_content = qf.readstr("src/main.py") # Returns decoded string
raw_bytes = qf.read("src/main.py") # Returns raw bytes
# Access injected LLM prompts
print("Prompt:", qf.preamble)
print("Trailing:", qf.epilogue)
# mdbox fully supports in-memory file-like objects
with io.BytesIO() as buffer:
with mdbox.open(buffer, mode="w") as qf:
qf.writestr("test.txt", "hello")with mdbox.open("archive.xml", mode="r") as qf:
# Extract everything
qf.extractall("output/")
# Extract conditionally
python_files = [info for info in qf if info.name.endswith(".py")]
qf.extractall("src_only/", members=python_files)The mdbox library provides strict validation. Malformed inputs or malicious extraction paths will throw explicit errors:
from mdbox import BinaryFileError, PathTraversalError
try:
with mdbox.open("archive.xml", mode="w") as qf:
qf.write("image.png")
except BinaryFileError as e:
print(f"Rejected: {e}") # Triggers if file fails UTF-8 checks
try:
with mdbox.open("archive.xml", mode="r") as qf:
qf.extractall()
except PathTraversalError as e:
print(f"Blocked malicious path: {e}") # Triggers on absolute paths or ../- Security: Extraction paths are strictly validated using
Path.relative_to(). Absolute paths and..escape attempts are blocked outright. - Data Validation: Files must pass UTF-8 decoding, and content is scanned to ensure XML 1.0 compatibility (blocking
NULLandC0/C1control characters) to guarantee parseability. - Performance: * In read mode, data is extracted via
memoryviewslicing directly from raw bytes to skip redundant parsing overhead.extractall()leverages a bounded async queue and concurrent workers.
- Transactional Safety: If a
withblock encounters an exception,__exit__safely aborts without writing, avoiding corrupt output states.
# Clone and sync dependencies
git clone [https://github.com/chgroeling/mdbox.git](https://github.com/chgroeling/mdbox.git)
cd mdbox
uv sync --all-extras
# Run full quality gate (format, lint, type-check, test)
uv run ruff format src/ tests/ && \
uv run ruff check src/ tests/ && \
uv run mypy src/ && \
uv run pytest
# Check coverage
uv run pytest --cov=mdbox --cov-report=htmlMIT — see LICENSE.