-
Notifications
You must be signed in to change notification settings - Fork 270
Description
Feature Request
Add a CLI subcommand (e.g. kreuzberg formats or kreuzberg supported-formats) that prints all supported input file formats/extensions.
Use Case
When kreuzberg is used as a tool inside an AI agent container, the agent needs to know which file formats kreuzberg can handle before deciding whether to use it. Currently this information is only available in the library's source code / documentation, not discoverable at runtime via the CLI.
A simple command like:
kreuzberg formats
# or
kreuzberg supported-formatsthat outputs the list of supported extensions (grouped by category or as a flat list) would allow AI agents to programmatically query capabilities without hardcoding format lists.
Suggested Output
Something like:
PDF: .pdf
Word: .docx .odt
Spreadsheet: .xlsx .xlsm .xlsb .xls .xla .xlam .xltm .ods
Presentation: .pptx .ppt .ppsx
E-book: .epub .fb2
Image (OCR): .png .jpg .jpeg .gif .webp .bmp .tiff .tif .jp2 ...
Text/Markup: .txt .md .rst .org .html .xhtml .htm .xml .json .csv ...
Code: .py .js .ts .jsx .tsx .java .c .cpp ...
Email: .eml .msg
Rich Text: .rtf
Or alternatively a --json flag for machine-readable output.
Context
We use kreuzberg as a document extraction + OCR tool inside a Docker-based AI agent runtime. The agent selects tools based on their declared capabilities. Having a runtime-queryable format list would eliminate the need to maintain a separate hardcoded list.