Commit 0ab2f31
feat: support of more file formats + fallbacks (#155)
This pull request introduces significant improvements to the document
extraction pipeline, enhances deployment configuration for caching and
permissions, and refines documentation to reflect these changes. The
main focus is on a more robust, layered fallback mechanism for file
extraction, expanded format support, and improved container
orchestration for model caches. Additionally, environment variables and
configuration maps have been streamlined for clarity and
maintainability.
**Document extraction pipeline improvements:**
* The extraction pipeline now orchestrates Docling, MarkItDown, and
custom extractors in a deterministic fallback chain, ensuring that if
one extractor fails, the next is tried automatically. The default order
is configurable, and the pipeline covers a broader range of formats
including Office docs, spreadsheets, Markdown/AsciiDoc, CSV, TXT, EPUB,
HTML/XML, and raster images.
[[1]](diffhunk://#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5L54-R54)
[[2]](diffhunk://#diff-9879d55539dbabcfd9190ec32b1828dfe5874d5e40d32816db8208de3aeeed1aL13-R95)
[[3]](diffhunk://#diff-9879d55539dbabcfd9190ec32b1828dfe5874d5e40d32816db8208de3aeeed1aL48-R130)
* The `README.md` and `libs/extractor-api-lib/README.md` have been
updated to document the new fallback logic, supported formats, and
configuration options. The documentation now includes detailed tables of
extractor priorities and extension mappings, as well as instructions for
customizing the pipeline.
[[1]](diffhunk://#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5L112-R114)
[[2]](diffhunk://#diff-9879d55539dbabcfd9190ec32b1828dfe5874d5e40d32816db8208de3aeeed1aL13-R95)
[[3]](diffhunk://#diff-9879d55539dbabcfd9190ec32b1828dfe5874d5e40d32816db8208de3aeeed1aL48-R130)
[[4]](diffhunk://#diff-9879d55539dbabcfd9190ec32b1828dfe5874d5e40d32816db8208de3aeeed1aL83-R172)
**Deployment and configuration enhancements:**
* Added support for HuggingFace and ModelScope model cache directories
in the extractor deployment, with corresponding environment variables
(`HF_HOME`, `HUGGINGFACE_HUB_CACHE`, `MODELSCOPE_HOME`,
`XDG_CACHE_HOME`) and volume mounts. These cache paths are now
configurable via `values.yaml`.
[[1]](diffhunk://#diff-673dd2d3d4e66a8fd4e45f9c1c9900711313f946bf8b6a89e96c954988fc14f3R404-R406)
[[2]](diffhunk://#diff-289e7e7aa5f8a10603dafc1c094fa3487201006a7d5429a0dd9c6c80b3426fcfR28-R63)
[[3]](diffhunk://#diff-289e7e7aa5f8a10603dafc1c094fa3487201006a7d5429a0dd9c6c80b3426fcfR80-R81)
[[4]](diffhunk://#diff-289e7e7aa5f8a10603dafc1c094fa3487201006a7d5429a0dd9c6c80b3426fcfL99-R128)
[[5]](diffhunk://#diff-3ab40efdb049da16ac327c9fbaf8ec1d25f26efbeded4e0c2cfd7f50b976d3ceR80-R87)
* Improved init container scripts for both admin-backend and extractor
deployments: added strict error handling (`set -euo pipefail`), ensured
cleanup of temporary files, and set correct permissions and ownership
for NLTK data and cache directories.
[[1]](diffhunk://#diff-2b6f7f2ec4938055207faa53acf7a300e0ec235db31d1cfb6896703b97292348R39-R49)
[[2]](diffhunk://#diff-289e7e7aa5f8a10603dafc1c094fa3487201006a7d5429a0dd9c6c80b3426fcfR28-R63)
**Configuration and environment variable cleanup:**
* Removed the now-obsolete `pdfextractor` configmap and related
environment variables, consolidating extractor configuration and
simplifying Helm templates.
[[1]](diffhunk://#diff-3ab40efdb049da16ac327c9fbaf8ec1d25f26efbeded4e0c2cfd7f50b976d3ceL55-L58)
[[2]](diffhunk://#diff-d72bec7914fc3e7d3fe01a8c0cbdb24832a26956bae5563d109bf8bb19955e0eL12-L20)
[[3]](diffhunk://#diff-673dd2d3d4e66a8fd4e45f9c1c9900711313f946bf8b6a89e96c954988fc14f3L467-L469)
[[4]](diffhunk://#diff-2b6f7f2ec4938055207faa53acf7a300e0ec235db31d1cfb6896703b97292348L111-L112)
* Updated Python version specification in `pyproject.toml` to use a
version range instead of a caret, and added a per-file ignore for
docstring warnings in `__init__.py`.
[[1]](diffhunk://#diff-dede389bcfb615c4b45cd1da7ac14cbe9535305f41f19cce09e321c91a8bb323R46)
[[2]](diffhunk://#diff-dede389bcfb615c4b45cd1da7ac14cbe9535305f41f19cce09e321c91a8bb323L79-R80)
---------
Co-authored-by: Andreas Klos <[email protected]>1 parent 144d88f commit 0ab2f31
File tree
44 files changed
+10723
-3901
lines changed- infrastructure/rag
- templates
- admin-backend
- extractor
- libs/extractor-api-lib
- src/extractor_api_lib
- impl
- api_endpoints
- extractors/file_extractors
- settings
- types
- tests
- test_data
- services
- document-extractor
- frontend
- libs
- admin-app/feature-document
- i18n/admin
Some content is hidden
Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
44 files changed
+10723
-3901
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
51 | 51 | | |
52 | 52 | | |
53 | 53 | | |
54 | | - | |
| 54 | + | |
55 | 55 | | |
56 | 56 | | |
57 | 57 | | |
| |||
109 | 109 | | |
110 | 110 | | |
111 | 111 | | |
112 | | - | |
| 112 | + | |
113 | 113 | | |
114 | | - | |
| 114 | + | |
115 | 115 | | |
116 | 116 | | |
117 | 117 | | |
| |||
Lines changed: 8 additions & 4 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
52 | 52 | | |
53 | 53 | | |
54 | 54 | | |
55 | | - | |
56 | | - | |
57 | | - | |
58 | | - | |
59 | 55 | | |
60 | 56 | | |
61 | 57 | | |
| |||
81 | 77 | | |
82 | 78 | | |
83 | 79 | | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
84 | 88 | | |
85 | 89 | | |
86 | 90 | | |
| |||
Lines changed: 7 additions & 4 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
36 | 36 | | |
37 | 37 | | |
38 | 38 | | |
| 39 | + | |
39 | 40 | | |
40 | 41 | | |
| 42 | + | |
| 43 | + | |
41 | 44 | | |
42 | | - | |
| 45 | + | |
| 46 | + | |
43 | 47 | | |
44 | | - | |
| 48 | + | |
| 49 | + | |
45 | 50 | | |
46 | 51 | | |
47 | 52 | | |
| |||
108 | 113 | | |
109 | 114 | | |
110 | 115 | | |
111 | | - | |
112 | | - | |
113 | 116 | | |
114 | 117 | | |
115 | 118 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
9 | 9 | | |
10 | 10 | | |
11 | 11 | | |
12 | | - | |
13 | | - | |
14 | | - | |
15 | | - | |
16 | | - | |
17 | | - | |
18 | | - | |
19 | | - | |
20 | | - | |
21 | 12 | | |
22 | 13 | | |
23 | 14 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
25 | 25 | | |
26 | 26 | | |
27 | 27 | | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
28 | 31 | | |
29 | 32 | | |
30 | 33 | | |
31 | 34 | | |
32 | 35 | | |
33 | 36 | | |
34 | 37 | | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
35 | 42 | | |
36 | 43 | | |
37 | 44 | | |
38 | 45 | | |
39 | 46 | | |
40 | | - | |
| 47 | + | |
| 48 | + | |
41 | 49 | | |
42 | | - | |
| 50 | + | |
43 | 51 | | |
44 | | - | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
45 | 55 | | |
46 | 56 | | |
47 | 57 | | |
48 | 58 | | |
49 | 59 | | |
50 | 60 | | |
51 | 61 | | |
| 62 | + | |
| 63 | + | |
52 | 64 | | |
53 | 65 | | |
54 | 66 | | |
| |||
65 | 77 | | |
66 | 78 | | |
67 | 79 | | |
| 80 | + | |
| 81 | + | |
68 | 82 | | |
69 | 83 | | |
70 | 84 | | |
| |||
96 | 110 | | |
97 | 111 | | |
98 | 112 | | |
99 | | - | |
100 | | - | |
101 | 113 | | |
102 | 114 | | |
| 115 | + | |
103 | 116 | | |
104 | 117 | | |
105 | 118 | | |
106 | 119 | | |
107 | 120 | | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
401 | 401 | | |
402 | 402 | | |
403 | 403 | | |
| 404 | + | |
| 405 | + | |
| 406 | + | |
404 | 407 | | |
405 | 408 | | |
406 | 409 | | |
| |||
464 | 467 | | |
465 | 468 | | |
466 | 469 | | |
467 | | - | |
468 | | - | |
469 | | - | |
470 | 470 | | |
471 | 471 | | |
472 | 472 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
10 | 10 | | |
11 | 11 | | |
12 | 12 | | |
13 | | - | |
| 13 | + | |
| 14 | + | |
14 | 15 | | |
15 | 16 | | |
16 | 17 | | |
17 | 18 | | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
18 | 96 | | |
19 | 97 | | |
20 | 98 | | |
| |||
45 | 123 | | |
46 | 124 | | |
47 | 125 | | |
48 | | - | |
49 | | - | |
50 | | - | |
51 | | - | |
52 | | - | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
53 | 131 | | |
54 | 132 | | |
55 | 133 | | |
| |||
64 | 142 | | |
65 | 143 | | |
66 | 144 | | |
67 | | - | |
68 | 145 | | |
69 | 146 | | |
70 | 147 | | |
| |||
80 | 157 | | |
81 | 158 | | |
82 | 159 | | |
83 | | - | |
84 | | - | |
85 | | - | |
86 | | - | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
87 | 173 | | |
88 | 174 | | |
89 | 175 | | |
| |||
0 commit comments