Skip to content

Commit a9f41b0

Browse files
authored
docs: add information extraction example (#2199)
* docs: add information exctraction example Signed-off-by: Panos Vagenas <[email protected]> * update README Signed-off-by: Panos Vagenas <[email protected]> * minor typo Signed-off-by: Panos Vagenas <[email protected]> * update README Signed-off-by: Panos Vagenas <[email protected]> --------- Signed-off-by: Panos Vagenas <[email protected]>
1 parent b3d7542 commit a9f41b0

File tree

5 files changed

+689
-6
lines changed

5 files changed

+689
-6
lines changed

README.md

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -29,17 +29,20 @@ Docling simplifies document processing, parsing diverse formats — including ad
2929

3030
## Features
3131

32-
* 🗂️ Parsing of [multiple document formats][supported_formats] incl. PDF, DOCX, PPTX, XLSX, HTML, WAV, MP3, images (PNG, TIFF, JPEG, ...), and more
32+
* 🗂️ Parsing of [multiple document formats][supported_formats] incl. PDF, DOCX, PPTX, XLSX, HTML, WAV, MP3, images (PNG, TIFF, JPEG, ...), and more
3333
* 📑 Advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more
3434
* 🧬 Unified, expressive [DoclingDocument][docling_document] representation format
35-
* ↪️ Various [export formats][supported_formats] and options, including Markdown, HTML, [DocTags](https://arxiv.org/abs/2503.11576) and lossless JSON
35+
* ↪️ Various [export formats][supported_formats] and options, including Markdown, HTML, [DocTags](https://arxiv.org/abs/2503.11576) and lossless JSON
3636
* 🔒 Local execution capabilities for sensitive data and air-gapped environments
3737
* 🤖 Plug-and-play [integrations][integrations] incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
3838
* 🔍 Extensive OCR support for scanned PDFs and images
3939
* 👓 Support of several Visual Language Models ([SmolDocling](https://huggingface.co/ds4sd/SmolDocling-256M-preview))
40-
* 🎙️ Support for Audio with Automatic Speech Recognition (ASR) models
40+
* 🎙️ Audio support with Automatic Speech Recognition (ASR) models
4141
* 💻 Simple and convenient CLI
4242

43+
### What's new
44+
* 📤 Structured [information extraction][extraction] \[🧪 beta\]
45+
4346
### Coming soon
4447

4548
* 📝 Metadata extraction, including title, authors, references & language
@@ -150,3 +153,4 @@ The project was started by the AI for knowledge team at IBM Research Zurich.
150153
[supported_formats]: https://docling-project.github.io/docling/usage/supported_formats/
151154
[docling_document]: https://docling-project.github.io/docling/concepts/docling_document/
152155
[integrations]: https://docling-project.github.io/docling/integrations/
156+
[extraction]: https://docling-project.github.io/docling/examples/extraction/

docs/examples/dpk-ingest-chunck-tokenize.ipynb renamed to docs/examples/dpk-ingest-chunk-tokenize.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
"id": "3f312845",
66
"metadata": {},
77
"source": [
8-
"# 🛡️ Chunking and tokenizing HTML documents using Data Prep Kit and the Docling Transforms\n",
8+
"# Chunking & tokenization with Data Prep Kit\n",
99
"\n",
1010
"This notebook demonstrates how to build a sequence of <a href=https://github.com/data-prep-kit/data-prep-kit> <b>DPK transforms</b> </a> for ingesting HTML documents using Docling2Parquet transforms and chunking them using Doc_Chunk transform. Both transforms are based on the <a href=https://docling-project.github.io/docling/> Docling library</a>. \n",
1111
"\n",

0 commit comments

Comments
 (0)