-
Notifications
You must be signed in to change notification settings - Fork 270
Description
Hi team,
after upgrading from 4.2.15 to 4.3.x (tested with 4.3.0 and 4.3.6), XLSX extraction with output_format="markdown" appears to have regressed from structured markdown tables to line-based text, which removes explicit table structure and makes downstream LLM parsing harder.
Config
from kreuzberg import ExtractionConfig, extract_bytes
config = ExtractionConfig(
force_ocr=False,
output_format="markdown",
)
result = await extract_bytes(file_bytes, mime_type, config)
print(result.content)Observed change (same XLSX input)
Before (4.2.15)
## Segment A
| Survey Segment A | | | | | | |
| --- | --- | --- | --- | --- | --- | --- |
| Segment A | Col 1 | Col 2 | Col 3 | Col 4 | Col 5 | Col 6 |
| Metric X | value_a1 | 2.0 | value_a2 | 4.0 | value_a3 | 4.0 |
| Metric Y | value_b1 | 3.0 | value_b2 | 7.0 | value_b3 | 5.0 |
After (4.3.x)
Survey Segment A
Segment A Col 1 Col 2 Col 3 Col 4 Col 5 Col 6
Metric X value_a1 2.0 value_a2 4.0 value_a3 4.0
Metric Y value_b1 3.0 value_b2 7.0 value_b3 5.0
Expected
When output_format="markdown", XLSX output should preserve explicit tabular structure (headers + row/column boundaries), similar to prior behavior.
Actual
Output is linearized into line-based text and no longer includes markdown table syntax (| ... |), so row/column structure becomes implicit.
Impact
This is a functional regression for LLM-oriented downstream consumers that rely on markdown table structure for reliable interpretation of spreadsheet content.