Skip to content

feat: add command to list supported input formats #404

@Haoxincode

Description

@Haoxincode

Feature Request

Add a CLI subcommand (e.g. kreuzberg formats or kreuzberg supported-formats) that prints all supported input file formats/extensions.

Use Case

When kreuzberg is used as a tool inside an AI agent container, the agent needs to know which file formats kreuzberg can handle before deciding whether to use it. Currently this information is only available in the library's source code / documentation, not discoverable at runtime via the CLI.

A simple command like:

kreuzberg formats
# or
kreuzberg supported-formats

that outputs the list of supported extensions (grouped by category or as a flat list) would allow AI agents to programmatically query capabilities without hardcoding format lists.

Suggested Output

Something like:

PDF:          .pdf
Word:         .docx .odt
Spreadsheet:  .xlsx .xlsm .xlsb .xls .xla .xlam .xltm .ods
Presentation: .pptx .ppt .ppsx
E-book:       .epub .fb2
Image (OCR):  .png .jpg .jpeg .gif .webp .bmp .tiff .tif .jp2 ...
Text/Markup:  .txt .md .rst .org .html .xhtml .htm .xml .json .csv ...
Code:         .py .js .ts .jsx .tsx .java .c .cpp ...
Email:        .eml .msg
Rich Text:    .rtf

Or alternatively a --json flag for machine-readable output.

Context

We use kreuzberg as a document extraction + OCR tool inside a Docker-based AI agent runtime. The agent selects tools based on their declared capabilities. Having a runtime-queryable format list would eliminate the need to maintain a separate hardcoded list.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions