Skip to content

Commit 0ab2f31

Browse files
robodev-r2d2a-klos
andauthored
feat: support of more file formats + fallbacks (#155)
This pull request introduces significant improvements to the document extraction pipeline, enhances deployment configuration for caching and permissions, and refines documentation to reflect these changes. The main focus is on a more robust, layered fallback mechanism for file extraction, expanded format support, and improved container orchestration for model caches. Additionally, environment variables and configuration maps have been streamlined for clarity and maintainability. **Document extraction pipeline improvements:** * The extraction pipeline now orchestrates Docling, MarkItDown, and custom extractors in a deterministic fallback chain, ensuring that if one extractor fails, the next is tried automatically. The default order is configurable, and the pipeline covers a broader range of formats including Office docs, spreadsheets, Markdown/AsciiDoc, CSV, TXT, EPUB, HTML/XML, and raster images. [[1]](diffhunk://#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5L54-R54) [[2]](diffhunk://#diff-9879d55539dbabcfd9190ec32b1828dfe5874d5e40d32816db8208de3aeeed1aL13-R95) [[3]](diffhunk://#diff-9879d55539dbabcfd9190ec32b1828dfe5874d5e40d32816db8208de3aeeed1aL48-R130) * The `README.md` and `libs/extractor-api-lib/README.md` have been updated to document the new fallback logic, supported formats, and configuration options. The documentation now includes detailed tables of extractor priorities and extension mappings, as well as instructions for customizing the pipeline. [[1]](diffhunk://#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5L112-R114) [[2]](diffhunk://#diff-9879d55539dbabcfd9190ec32b1828dfe5874d5e40d32816db8208de3aeeed1aL13-R95) [[3]](diffhunk://#diff-9879d55539dbabcfd9190ec32b1828dfe5874d5e40d32816db8208de3aeeed1aL48-R130) [[4]](diffhunk://#diff-9879d55539dbabcfd9190ec32b1828dfe5874d5e40d32816db8208de3aeeed1aL83-R172) **Deployment and configuration enhancements:** * Added support for HuggingFace and ModelScope model cache directories in the extractor deployment, with corresponding environment variables (`HF_HOME`, `HUGGINGFACE_HUB_CACHE`, `MODELSCOPE_HOME`, `XDG_CACHE_HOME`) and volume mounts. These cache paths are now configurable via `values.yaml`. [[1]](diffhunk://#diff-673dd2d3d4e66a8fd4e45f9c1c9900711313f946bf8b6a89e96c954988fc14f3R404-R406) [[2]](diffhunk://#diff-289e7e7aa5f8a10603dafc1c094fa3487201006a7d5429a0dd9c6c80b3426fcfR28-R63) [[3]](diffhunk://#diff-289e7e7aa5f8a10603dafc1c094fa3487201006a7d5429a0dd9c6c80b3426fcfR80-R81) [[4]](diffhunk://#diff-289e7e7aa5f8a10603dafc1c094fa3487201006a7d5429a0dd9c6c80b3426fcfL99-R128) [[5]](diffhunk://#diff-3ab40efdb049da16ac327c9fbaf8ec1d25f26efbeded4e0c2cfd7f50b976d3ceR80-R87) * Improved init container scripts for both admin-backend and extractor deployments: added strict error handling (`set -euo pipefail`), ensured cleanup of temporary files, and set correct permissions and ownership for NLTK data and cache directories. [[1]](diffhunk://#diff-2b6f7f2ec4938055207faa53acf7a300e0ec235db31d1cfb6896703b97292348R39-R49) [[2]](diffhunk://#diff-289e7e7aa5f8a10603dafc1c094fa3487201006a7d5429a0dd9c6c80b3426fcfR28-R63) **Configuration and environment variable cleanup:** * Removed the now-obsolete `pdfextractor` configmap and related environment variables, consolidating extractor configuration and simplifying Helm templates. [[1]](diffhunk://#diff-3ab40efdb049da16ac327c9fbaf8ec1d25f26efbeded4e0c2cfd7f50b976d3ceL55-L58) [[2]](diffhunk://#diff-d72bec7914fc3e7d3fe01a8c0cbdb24832a26956bae5563d109bf8bb19955e0eL12-L20) [[3]](diffhunk://#diff-673dd2d3d4e66a8fd4e45f9c1c9900711313f946bf8b6a89e96c954988fc14f3L467-L469) [[4]](diffhunk://#diff-2b6f7f2ec4938055207faa53acf7a300e0ec235db31d1cfb6896703b97292348L111-L112) * Updated Python version specification in `pyproject.toml` to use a version range instead of a caret, and added a per-file ignore for docstring warnings in `__init__.py`. [[1]](diffhunk://#diff-dede389bcfb615c4b45cd1da7ac14cbe9535305f41f19cce09e321c91a8bb323R46) [[2]](diffhunk://#diff-dede389bcfb615c4b45cd1da7ac14cbe9535305f41f19cce09e321c91a8bb323L79-R80) --------- Co-authored-by: Andreas Klos <[email protected]>
1 parent 144d88f commit 0ab2f31

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

44 files changed

+10723
-3901
lines changed

README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ Welcome to the STACKIT RAG Template! This is a basic example of how to use the R
5151

5252
## Features 🚀
5353

54-
**Document Management**: Supports PDFs, DOCX, PPTX, XML, EPUB documents and websource via confluence as well as sitemaps.
54+
**Document Management**: Supports PDFs, Office docs (DOCX, PPTX), spreadsheets (XLSX), Markdown/AsciiDoc (MD, MDX, ADOC), EPUB/HTML/XML, CSV/TXT, and raster images, with automatic fallbacks between Docling, MarkItDown, and custom extractors; also handles Confluence spaces and sitemaps.
5555

5656
**AI Integration**: Multiple LLM and embedder providers for flexibility.
5757

@@ -109,9 +109,9 @@ All components are provided by the *admin-api-lib*. For further information on e
109109

110110
#### 1.1.3 Document extractor
111111

112-
The Document extractor is a component that is used to extract the content from the documents and confluence spaces.
112+
The Document extractor ingests uploaded files and remote sources (Confluence, sitemap) and now orchestrates multiple extractors with a deterministic fallback chain. Docling runs first for rich formats (PDF, Office, Markdown, HTML, images), MarkItDown provides lightweight markdown conversion, and specialised custom extractors (PDF, MS Office, XML, EPUB, Tesseract OCR) handle edge cases. The order and availability can be customised through the dependency-injector container.
113113

114-
All components are provided by the *extractor-api-lib*. For further information on endpoints and requirements, please consult [the libs README](./libs/README.md#3-extractor-api-lib).
114+
All components are provided by the *extractor-api-lib*. For further information on endpoints, extractor ordering, supported formats, and configuration tips, please consult [the libs README](./libs/README.md#3-extractor-api-lib).
115115

116116
#### 1.1.4 MCP Server
117117

infrastructure/rag/templates/_admin_backend_and_extractor_helpers.tpl

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -52,10 +52,6 @@
5252
{{- printf "%s-langfuse-configmap" .Release.Name | trunc 63 | trimSuffix "-" -}}
5353
{{- end -}}
5454

55-
{{- define "configmap.pdfextractorName" -}}
56-
{{- printf "%s-pdfextractor-configmap" .Release.Name | trunc 63 | trimSuffix "-" -}}
57-
{{- end -}}
58-
5955
{{- define "configmap.adminBackendName" -}}
6056
{{- printf "%s-admin-backend-configmap" .Release.Name | trunc 63 | trimSuffix "-" -}}
6157
{{- end -}}
@@ -81,6 +77,14 @@
8177
{{- printf "%s:%s" .Values.extractor.image.repository .Values.extractor.image.tag | trimSuffix ":" }}
8278
{{- end -}}
8379

80+
{{- define "extractor.huggingfaceCacheDir" -}}
81+
{{- default "/tmp/hf-cache" .Values.extractor.huggingfaceCacheDir -}}
82+
{{- end -}}
83+
84+
{{- define "extractor.modelscopeCacheDir" -}}
85+
{{- default "/var/modelscope" .Values.extractor.modelscopeCacheDir -}}
86+
{{- end -}}
87+
8488
# ingress
8589
{{- define "ingress.adminBackendFullname" -}}
8690
{{- printf "%s-admin-backend-ingress" .Release.Name | trunc 63 | trimSuffix "-" -}}

infrastructure/rag/templates/admin-backend/deployment.yaml

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -36,12 +36,17 @@ spec:
3636
- sh
3737
- -c
3838
- |
39+
set -euo pipefail;
3940
touch /app/services/admin-backend/log/logfile.log && \
4041
chmod 600 /app/services/admin-backend/log/logfile.log;
42+
mkdir -p /home/nonroot/nltk_data/tokenizers && \
43+
mkdir -p /home/nonroot/nltk_data/taggers && \
4144
wget -q -O /tmp/punkt.zip https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/tokenizers/punkt_tab.zip && \
42-
unzip /tmp/punkt.zip -d /home/nonroot/nltk_data/tokenizers && \
45+
unzip -oq /tmp/punkt.zip -d /home/nonroot/nltk_data/tokenizers && \
46+
rm -f /tmp/punkt.zip && \
4347
wget -q -O /tmp/averaged_perceptron_tagger_eng.zip https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/taggers/averaged_perceptron_tagger_eng.zip && \
44-
unzip /tmp/averaged_perceptron_tagger_eng.zip -d /home/nonroot/nltk_data/taggers;
48+
unzip -oq /tmp/averaged_perceptron_tagger_eng.zip -d /home/nonroot/nltk_data/taggers && \
49+
rm -f /tmp/averaged_perceptron_tagger_eng.zip;
4550
volumeMounts:
4651
- name: log-dir
4752
mountPath: /app/services/admin-backend/log
@@ -108,8 +113,6 @@ spec:
108113
name: {{ template "configmap.ragapiName" . }}
109114
- configMapRef:
110115
name: {{ template "configmap.stackitVllmName" . }}
111-
- configMapRef:
112-
name: {{ template "configmap.pdfextractorName" . }}
113116
- configMapRef:
114117
name: {{ template "configmap.keyValueStoreName" . }}
115118
- configMapRef:

infrastructure/rag/templates/configmap.yaml

Lines changed: 0 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -9,15 +9,6 @@ data:
99
---
1010
apiVersion: v1
1111
kind: ConfigMap
12-
metadata:
13-
name: {{ template "configmap.pdfextractorName" . }}
14-
data:
15-
{{- range $key, $value := .Values.shared.envs.pdfextractor }}
16-
{{ $key }}: {{ $value | quote }}
17-
{{- end }}
18-
---
19-
apiVersion: v1
20-
kind: ConfigMap
2112
metadata:
2213
name: {{ template "configmap.usecaseName" . }}
2314
data:

infrastructure/rag/templates/extractor/deployment.yaml

Lines changed: 26 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -25,30 +25,42 @@ spec:
2525
emptyDir: {}
2626
- name: nltk-data-dir
2727
emptyDir: {}
28+
- name: modelscope-cache
29+
emptyDir: {}
30+
{{- $msCacheDir := include "extractor.modelscopeCacheDir" . }}
2831
{{- if .Values.shared.imagePullSecret }}
2932
imagePullSecrets:
3033
- name: {{ .Values.shared.imagePullSecret.name }}
3134
{{- end }}
3235
initContainers:
3336
- name: init-permissions
3437
image: busybox
38+
securityContext:
39+
runAsUser: 0
40+
runAsGroup: 0
41+
runAsNonRoot: false
3542
command:
3643
- sh
3744
- -c
3845
- |
3946
touch /app/services/document-extractor/log/logfile.log && \
40-
chmod 600 /app/services/document-extractor/log/logfile.log;
47+
chmod 600 /app/services/document-extractor/log/logfile.log && \
48+
chown -R 10001:10001 /app/services/document-extractor/log && \
4149
wget -q -O /tmp/punkt.zip https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/tokenizers/punkt_tab.zip && \
42-
unzip /tmp/punkt.zip -d /home/nonroot/nltk_data/tokenizers && \
50+
unzip -o -q /tmp/punkt.zip -d /home/nonroot/nltk_data/tokenizers && \
4351
wget -q -O /tmp/averaged_perceptron_tagger_eng.zip https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/taggers/averaged_perceptron_tagger_eng.zip && \
44-
unzip /tmp/averaged_perceptron_tagger_eng.zip -d /home/nonroot/nltk_data/taggers;
52+
unzip -o -q /tmp/averaged_perceptron_tagger_eng.zip -d /home/nonroot/nltk_data/taggers && \
53+
mkdir -p /tmp/hf-cache && chown -R 10001:10001 /tmp/hf-cache && \
54+
mkdir -p {{ $msCacheDir }} && chown -R 10001:10001 {{ $msCacheDir }};
4555
volumeMounts:
4656
- name: log-dir
4757
mountPath: /app/services/document-extractor/log
4858
- name: nltk-data-dir
4959
mountPath: /home/nonroot/nltk_data
5060
- name: tmp-dir
5161
mountPath: /tmp
62+
- name: modelscope-cache
63+
mountPath: {{ $msCacheDir }}
5264
containers:
5365
- name: {{ .Values.extractor.name }}
5466
securityContext:
@@ -65,6 +77,8 @@ spec:
6577
mountPath: /tmp
6678
- name: nltk-data-dir
6779
mountPath: /home/nonroot/nltk_data
80+
- name: modelscope-cache
81+
mountPath: {{ $msCacheDir }}
6882
image: {{ template "extractor.fullImageName" . }}
6983
imagePullPolicy: {{ .Values.extractor.image.pullPolicy }}
7084
{{- if not (empty .Values.extractor.command) }}
@@ -96,12 +110,19 @@ spec:
96110
envFrom:
97111
- configMapRef:
98112
name: {{ template "configmap.s3Name" . }}
99-
- configMapRef:
100-
name: {{ template "configmap.pdfextractorName" . }}
101113
- secretRef:
102114
name: {{ template "secret.s3Name" . }}
115+
{{- $hfCacheDir := include "extractor.huggingfaceCacheDir" . }}
103116
env:
104117
- name: PYTHONPATH
105118
value: {{ .Values.extractor.pythonPathEnv.PYTHONPATH }}
106119
- name: NLTK_DATA
107120
value: /home/nonroot/nltk_data
121+
- name: HF_HOME
122+
value: {{ $hfCacheDir | quote }}
123+
- name: HUGGINGFACE_HUB_CACHE
124+
value: {{ $hfCacheDir | quote }}
125+
- name: MODELSCOPE_HOME
126+
value: {{ $msCacheDir | quote }}
127+
- name: XDG_CACHE_HOME
128+
value: {{ $msCacheDir | quote }}

infrastructure/rag/values.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -401,6 +401,9 @@ extractor:
401401

402402
pythonPathEnv:
403403
PYTHONPATH: src
404+
huggingfaceCacheDir: /tmp/hf-cache
405+
# Directory inside the container to use as writable cache for ModelScope / OCR models
406+
modelscopeCacheDir: /var/modelscope
404407

405408
adminFrontend:
406409
name: admin-frontend
@@ -464,9 +467,6 @@ shared:
464467

465468

466469
envs:
467-
pdfExtractor:
468-
PDF_EXTRACTOR_DIAGRAMS_FOLDER_NAME: "connection_diagrams"
469-
PDF_EXTRACTOR_FOOTER_HEIGHT: 155
470470
s3:
471471
S3_ENDPOINT: http://rag-minio:9000
472472
S3_BUCKET: documents

libs/extractor-api-lib/README.md

Lines changed: 97 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -10,11 +10,89 @@ Content ingestion layer for the STACKIT RAG template. This library exposes a Fas
1010

1111
## Feature highlights
1212

13-
- **Broad format coverage** – PDFs, DOCX, PPTX, XML/EPUB, Confluence spaces, and sitemap-driven websites.
13+
- **Layered extraction pipeline** – Docling, MarkItDown, and the custom extractors now cooperate with a deterministic fallback chain, so a failed run automatically cascades to the next extractor.
14+
- **Expanded format coverage** – PDFs, Office documents, EPUB, XML, Markdown/AsciiDoc, CSV/TXT, raster images, Confluence spaces, and sitemap-driven websites.
1415
- **Consistent output schema** – Information pieces are returned in a unified structure with content type (`TEXT`, `TABLE`, `IMAGE`) and metadata.
1516
- **Swappable extractors** – Dependency-injector container makes it easy to add or replace file/source extractors, table converters, etc.
1617
- **Production-grade plumbing** – Built-in S3-compatible file service, LangChain loaders with retry/backoff, optional PDF OCR, and throttling controls for web crawls.
1718

19+
## File extractor pipeline
20+
21+
[`GeneralFileExtractor`](src/extractor_api_lib/impl/api_endpoints/general_file_extractor.py) orchestrates file parsing. It resolves the file type from the extension, filters the extractors that declare matching `compatible_file_types`, reverses that filtered list, and then executes the extractors in sequence until one returns content or all have failed. Exceptions are logged and the next extractor takes over; only if every extractor either returns no content or raises an exception do we bubble up an error.
22+
23+
### Default execution order
24+
25+
The dependency container wires extractors in the following list:
26+
27+
1. `DoclingFileExtractor`
28+
2. `MarkitdownFileExtractor`
29+
3. `PDFExtractor`
30+
4. `EpubExtractor`
31+
5. `XMLExtractor`
32+
6. `MSDocsExtractor`
33+
7. `TesseractImageExtractor`
34+
35+
Because the orchestrator reverses the candidate list before the fallback loop, the priority for overlapping formats is the reverse of this wiring. For example, PDFs run through Docling first, then fall back to MarkItDown, and finally to the custom PDF extractor; DOCX/PPTX files follow Docling → MarkItDown → MSDocs; raster images go through Docling’s OCR pipeline before falling back to the Tesseract-only extractor.
36+
37+
### Supported formats
38+
39+
| Format family | Extensions | Primary extractor | Fallbacks | Notes |
40+
|--------------------------|----------------------------------------------------------|----------------------------|----------------------------------------------------------|-------|
41+
| PDF | `.pdf` | Docling | MarkItDown → Custom PDF extractor | Docling performs OCR + table extraction; the PDF extractor keeps Camelot/pdfplumber heuristics as a last resort. |
42+
| Microsoft Word | `.docx` | Docling | MarkItDown → MSDocs | MSDocs keeps unstructured-based table conversion for custom cases. |
43+
| Microsoft PowerPoint | `.pptx` | Docling | MarkItDown → MSDocs | MarkItDown splits slides by `<!-- Slide number: N -->`. |
44+
| Microsoft Excel | `.xlsx` | Docling || Tables returned as markdown; Docling infers sheet structure. |
45+
| EPUB | `.epub` | MarkItDown | EPUB extractor | MarkItDown covers simple ebooks; the LangChain-based EPUB extractor preserves metadata when MarkItDown fails. |
46+
| HTML | `.html` | Docling | MarkItDown | Docling keeps DOM-aware segmentation; MarkItDown is lighter-weight. |
47+
| Markdown | `.md`, `.markdown`, `.mdx` | Docling || MarkItDown does not currently register for Markdown. |
48+
| AsciiDoc | `.adoc`, `.asciidoc` | Docling || |
49+
| CSV | `.csv` | Docling | MarkItDown | Both produce markdown tables; Docling preserves structured metadata. |
50+
| Plain text | `.txt` | MarkItDown || |
51+
| XML | `.xml` | XML extractor || Uses the unstructured XML partitioner. |
52+
| Raster images | `.jpg`, `.jpeg`, `.png`, `.tiff`, `.tif`, `.bmp` | Docling (OCR) | Tesseract image extractor | Docling feeds Tesseract CLI OCR; the fallback enforces single-frame images via Pillow. |
53+
54+
Image coverage currently excludes animated GIF, WebP, HEIC, and SVG files. These extensions are ignored by the routing logic and will surface as “No extractor found” errors until an extractor declares support.
55+
56+
### Source extractor pipeline
57+
58+
`GeneralSourceExtractor` wires Confluence and sitemap loaders behind a similar abstraction. Unlike files, source extractors are keyed by `ExtractionParameters.source_type` and the matching extractor is called directly (no fallback chain).
59+
60+
## Configuring extractor order
61+
62+
The order lives in `DependencyContainer.file_extractors`. You can override it either by subclassing the container or by overriding the provider at runtime before wiring the FastAPI app. Example:
63+
64+
`container.py`
65+
66+
```python
67+
from dependency_injector.providers import List
68+
69+
from extractor_api_lib.dependency_container import DependencyContainer
70+
71+
72+
class CustomExtractorContainer(DependencyContainer):
73+
file_extractors = List(
74+
DependencyContainer.docling_extractor,
75+
DependencyContainer.markitdown_extractor,
76+
DependencyContainer.ms_docs_extractor,
77+
DependencyContainer.pdf_extractor,
78+
DependencyContainer.image_extractor,
79+
DependencyContainer.xml_extractor,
80+
DependencyContainer.epub_extractor,
81+
)
82+
```
83+
84+
`main.py`
85+
86+
```python
87+
from extractor_api_lib.main import app as perfect_extractor_app, register_dependency_container
88+
89+
from container import CustomExtractorContainer
90+
91+
register_dependency_container(CustomExtractorContainer())
92+
```
93+
94+
The last provider in the list becomes the first extractor tried for a matching file type. Keep shared singleton providers (file service, converters) in the parent class to avoid double instantiation.
95+
1896
## Installation
1997

2098
```bash
@@ -45,11 +123,11 @@ Both endpoints stream their results back to `admin-api-lib`, which takes care of
45123

46124
## How the file extraction endpoint works
47125

48-
1. Download the file from S3
49-
2. Chose suitable file extractor based on the filename ending
50-
3. Extract the content from the file
51-
4. Map the internal representation to the external schema
52-
5. Return the final output
126+
1. Download the file from S3.
127+
2. Derive the file type from the extension (normalizing common image/Markdown/AsciiDoc aliases).
128+
3. Select extractors that declare support for the resolved `FileType`.
129+
4. Run the extractors in priority order (highest priority first); stop at the first non-empty result or keep falling back if an extractor raises.
130+
5. Map the internal representation to the external schema and return the final output.
53131

54132
## How the source extraction endpoint works
55133

@@ -64,7 +142,6 @@ Both endpoints stream their results back to `admin-api-lib`, which takes care of
64142
Two `pydantic-settings` models ship with this package:
65143

66144
- **S3 storage** (`S3Settings`) – configure the built-in file service with `S3_ACCESS_KEY_ID`, `S3_SECRET_ACCESS_KEY`, `S3_ENDPOINT`, and `S3_BUCKET`.
67-
- **PDF extraction** (`PDFExtractorSettings`) – adjust footer trimming or diagram export via `PDF_EXTRACTOR_FOOTER_HEIGHT` and `PDF_EXTRACTOR_DIAGRAMS_FOLDER_NAME`.
68145

69146
Other extractors accept their parameters at runtime through the request payload (`ExtractionParameters`). For example, the admin backend forwards Confluence credentials, sitemap URLs, or custom headers when it calls `/extract_from_source`. This keeps the library stateless and makes it easy to plug in additional sources without redeploying.
70147

@@ -80,10 +157,19 @@ from extractor_api_lib.main import app as perfect_extractor_app
80157

81158
## Extending the library
82159

83-
1. Implement `InformationFileExtractor` or `InformationExtractor` for your new format/source.
84-
2. Register it in `dependency_container.py` (append to `file_extractors` list or `source_extractors` dict).
85-
3. Update mapper or metadata handling if additional fields are required.
86-
4. Add unit tests under `libs/extractor-api-lib/tests` using fixtures and fake storage providers.
160+
1. Implement `InformationFileExtractor` (for file-based inputs) or `InformationExtractor` (for remote sources).
161+
2. Add a provider to `DependencyContainer` (usually a `Singleton`) and wire dependencies such as the shared `FileService` or table converter.
162+
3. Append the provider to `file_extractors` (or to the source extractor list) in the desired position so that the fallback order is correct.
163+
4. Update mappers or metadata handling if additional fields are required.
164+
5. Cover the happy path and a failure edge case with tests under `libs/extractor-api-lib/tests`, mocking external services (OCR, network, file I/O).
165+
166+
## Advantages and caveats
167+
168+
- Docling-first prioritisation dramatically improves structured extraction (tables, headings) and adds OCR to formats that previously lacked it.
169+
- Retaining MarkItDown and the custom PDF/MS extractors provides graceful degradation when Docling fails or produces empty output.
170+
- Image support now goes through Docling’s OCR before falling back to pure Tesseract.
171+
- The configuration still requires code changes; there is no environment-variable switch to reshuffle or disable extractors at runtime.
172+
- Multi-frame images, animated/novel image formats, and office formats such as ODT/RTF remain unsupported.
87173

88174
## Contributing
89175

0 commit comments

Comments
 (0)