Skip to content

Commit 271b9b8

Browse files
feat: docx converter
Implemented DocxConverter for converting DOCX files to ContextGem Document objects. Updated docs and README, including new usage examples. Fixed LLM config serialization. Updated tests and cassettes.
1 parent d659e98 commit 271b9b8

68 files changed

Lines changed: 71482 additions & 33385 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,13 @@ venv
88
.venv
99
.coverage
1010
.cz.msg
11+
.vscode
1112
~$*
1213
*.tmp
1314

1415
notebooks
16+
htmlcov
17+
coverage_annotate
1518
!dev/notebooks
1619
docs/build
1720
dist

CITATION.cff

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,6 @@ authors:
44
- family-names: Shcherbak
55
given-names: Sergii
66
email: sergii@shcherbak.ai
7-
title: "ContextGem: Easier and faster way to build LLM extraction workflows through powerful abstractions"
7+
title: "ContextGem: Effortless LLM extraction from documents"
88
date-released: 2025-04-02
99
url: "https://github.com/shcherbak-ai/contextgem"

CONTRIBUTING.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -197,7 +197,7 @@ You **must** re-record cassettes if:
197197
```
198198
4. Run your tests, which will create new cassette files
199199
200-
**Important**: Re-recording cassettes will use your OpenAI API key and may incur charges to your account based on the number and type of API calls made during testing. Please be aware of these potential costs before re-recording. (Re-running the whole test suite with the current set of OpenAI LLMs and making actual LLM API requests currently incurs up to $0.50 USD, based on the default OpenAI API pricing.)
200+
**Important**: Re-recording cassettes will use your OpenAI API key and may incur charges to your account based on the number and type of API calls made during testing. Please be aware of these potential costs before re-recording.
201201
202202
Note that our VCR configuration is set up to automatically strip API keys and other personal data from the cassettes by default.
203203

NOTICE

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1-
ContextGem - Easier and faster way to build LLM extraction workflows through powerful abstractions
2-
=========================================================================================================
1+
ContextGem - Effortless LLM extraction from documents
2+
======================================================
33

44
Copyright (c) 2025 Shcherbak AI AS
55
All rights reserved

README.md

Lines changed: 96 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
![ContextGem](https://contextgem.dev/_static/contextgem_poster.png "ContextGem - Easier and faster way to build LLM extraction workflows through powerful abstractions")
1+
![ContextGem](https://contextgem.dev/_static/contextgem_poster.png "ContextGem - Effortless LLM extraction from documents")
22

3-
# ContextGem: Easier and faster way to build LLM extraction workflows
3+
# ContextGem: Effortless LLM extraction from documents
44

55
[![tests](https://github.com/shcherbak-ai/contextgem/actions/workflows/ci-tests.yml/badge.svg?branch=main)](https://github.com/shcherbak-ai/contextgem/actions/workflows/ci-tests.yml)
66
[![Coverage](https://img.shields.io/endpoint?url=https://gist.githubusercontent.com/SergiiShcherbak/daaee00e1dfff7a29ca10a922ec3becd/raw/coverage.json)](https://github.com/shcherbak-ai/contextgem/actions)
@@ -13,13 +13,14 @@
1313
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
1414
[![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat)](https://pycqa.github.io/isort/)
1515
[![Pydantic v2](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json)](https://pydantic.dev)
16+
[![Poetry](https://img.shields.io/endpoint?url=https://python-poetry.org/badge/v0.json)](https://python-poetry.org/)
1617
[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-blue?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)
1718
[![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.1-4baaaa.svg)](CODE_OF_CONDUCT.md)
1819

1920
<img src="https://contextgem.dev/_static/tab_solid.png" alt="ContextGem: 2nd Product of the week" width="250">
2021
<br/><br/>
2122

22-
ContextGem is a free, open-source LLM framework for easier, faster extraction of structured data and insights from documents through powerful abstractions.
23+
ContextGem is a free, open-source LLM framework that makes it radically easier to extract structured data and insights from documents — with minimal code.
2324

2425

2526
## 💎 Why ContextGem?
@@ -46,113 +47,113 @@ Read more on the project [motivation](https://contextgem.dev/motivation.html) in
4647
<td>
4748
Automated dynamic prompts
4849
</td>
49-
<td></td>
50-
<td></td>
50+
<td>🟢</td>
51+
<td></td>
5152
</tr>
5253
<tr>
5354
<td>
5455
Automated data modelling and validators
5556
</td>
56-
<td></td>
57-
<td></td>
57+
<td>🟢</td>
58+
<td></td>
5859
</tr>
5960
<tr>
6061
<td>
6162
Precise granular reference mapping (paragraphs & sentences)
6263
</td>
63-
<td></td>
64-
<td></td>
64+
<td>🟢</td>
65+
<td></td>
6566
</tr>
6667
<tr>
6768
<td>
6869
Justifications (reasoning backing the extraction)
6970
</td>
70-
<td></td>
71-
<td></td>
71+
<td>🟢</td>
72+
<td></td>
7273
</tr>
7374
<tr>
7475
<td>
7576
Neural segmentation (SaT)
7677
</td>
77-
<td></td>
78-
<td></td>
78+
<td>🟢</td>
79+
<td></td>
7980
</tr>
8081
<tr>
8182
<td>
8283
Multilingual support (I/O without prompting)
8384
</td>
84-
<td></td>
85-
<td></td>
85+
<td>🟢</td>
86+
<td></td>
8687
</tr>
8788
<tr>
8889
<td>
8990
Single, unified extraction pipeline (declarative, reusable, fully serializable)
9091
</td>
91-
<td></td>
92-
<td>⚠️</td>
92+
<td>🟢</td>
93+
<td>🟡</td>
9394
</tr>
9495
<tr>
9596
<td>
9697
Grouped LLMs with role-specific tasks
9798
</td>
98-
<td></td>
99-
<td>⚠️</td>
99+
<td>🟢</td>
100+
<td>🟡</td>
100101
</tr>
101102
<tr>
102103
<td>
103104
Nested context extraction
104105
</td>
105-
<td></td>
106-
<td>⚠️</td>
106+
<td>🟢</td>
107+
<td>🟡</td>
107108
</tr>
108109
<tr>
109110
<td>
110111
Unified, fully serializable results storage model (document)
111112
</td>
112-
<td></td>
113-
<td>⚠️</td>
113+
<td>🟢</td>
114+
<td>🟡</td>
114115
</tr>
115116
<tr>
116117
<td>
117118
Extraction task calibration with examples
118119
</td>
119-
<td></td>
120-
<td>⚠️</td>
120+
<td>🟢</td>
121+
<td>🟡</td>
121122
</tr>
122123
<tr>
123124
<td>
124125
Built-in concurrent I/O processing
125126
</td>
126-
<td></td>
127-
<td>⚠️</td>
127+
<td>🟢</td>
128+
<td>🟡</td>
128129
</tr>
129130
<tr>
130131
<td>
131132
Automated usage & costs tracking
132133
</td>
133-
<td></td>
134-
<td>⚠️</td>
134+
<td>🟢</td>
135+
<td>🟡</td>
135136
</tr>
136137
<tr>
137138
<td>
138139
Fallback and retry logic
139140
</td>
140-
<td></td>
141-
<td></td>
141+
<td>🟢</td>
142+
<td>🟢</td>
142143
</tr>
143144
<tr>
144145
<td>
145146
Multiple LLM providers
146147
</td>
147-
<td></td>
148-
<td></td>
148+
<td>🟢</td>
149+
<td>🟢</td>
149150
</tr>
150151
</tbody>
151152
</table>
152153

153-
- fully supported - no additional setup required<br>
154-
⚠️ - partially supported - requires additional setup<br>
155-
- not supported - requires custom logic
154+
🟢 - fully supported - no additional setup required<br>
155+
🟡 - partially supported - requires additional setup<br>
156+
- not supported - requires custom logic
156157

157158
\* See [descriptions](https://contextgem.dev/motivation.html#the-contextgem-solution) of ContextGem abstractions and [comparisons](https://contextgem.dev/vs_other_frameworks.html) of specific implementation examples using ContextGem and other popular open-source LLM frameworks.
158159

@@ -181,6 +182,8 @@ pip install -U contextgem
181182

182183
### Aspect extraction
183184

185+
> **Aspect is a defined area or topic** within a document (or another aspect). For example, "Payment terms" clauses in a contract.
186+
184187
```python
185188
# Quick Start Example - Extracting payment terms from a document
186189

@@ -238,6 +241,8 @@ for item in doc.aspects[0].extracted_items:
238241

239242
### Concept extraction
240243

244+
> **Concept is a unit of information or an entity**, derived from an aspect or the broader document context. Concepts represent a wide range of data points and insights, from simple entities (names, dates, monetary values) to complex evaluations, conclusions, and answers to specific questions.
245+
241246
```python
242247
# Quick Start Example - Extracting anomalies from a document, with source references and justifications
243248

@@ -313,6 +318,42 @@ See more examples in the documentation:
313318
- [Using a Multi-LLM Pipeline to Extract Data from Several Documents](https://contextgem.dev/advanced_usage.html#using-a-multi-llm-pipeline-to-extract-data-from-several-documents)
314319

315320

321+
## 🔄 Document converters
322+
323+
To create a document for analysis, you can either pass raw text directly, or use ContextGem's built-in document converters that handle various file formats.
324+
325+
### 📄 DOCX converter
326+
327+
ContextGem provides built-in converter to easily transform DOCX files into ContextGem document objects.
328+
329+
- Extracts information that other open-source tools often do not capture: misaligned tables, comments, footnotes, textboxes, headers/footers, and embedded images
330+
- Preserves document structure with rich metadata for improved LLM analysis
331+
332+
```python
333+
# Using ContextGem's DocxConverter
334+
335+
from contextgem import DocxConverter
336+
337+
converter = DocxConverter()
338+
339+
# Convert a DOCX file to a ContextGem Document
340+
# from path
341+
document = converter.convert("path/to/document.docx")
342+
# or from file object
343+
with open("path/to/document.docx", "rb") as docx_file_object:
344+
document = converter.convert(docx_file_object)
345+
346+
# You can also use it as a standalone text extractor
347+
docx_text = converter.convert_to_text_format(
348+
"path/to/document.docx",
349+
output_format="markdown", # or "raw"
350+
)
351+
352+
```
353+
354+
Learn more about [DOCX converter features](https://contextgem.dev/converters/docx.html) in the documentation.
355+
356+
316357
## 🎯 Focused document analysis
317358

318359
ContextGem leverages LLMs' long context windows to deliver superior extraction accuracy from individual documents. Unlike RAG approaches that often [struggle with complex concepts and nuanced insights](https://www.linkedin.com/pulse/raging-contracts-pitfalls-rag-contract-review-shcherbak-ai-ptg3f), ContextGem capitalizes on [continuously expanding context capacity](https://arxiv.org/abs/2502.12962), evolving LLM capabilities, and decreasing costs. This focused approach enables direct information extraction from complete documents, eliminating retrieval inconsistencies while optimizing for in-depth single-document analysis. While this delivers higher accuracy for individual documents, ContextGem does not currently support cross-document querying or corpus-wide retrieval - for these use cases, modern RAG systems (e.g., LlamaIndex, Haystack) remain more appropriate.
@@ -325,6 +366,7 @@ Read more on [how ContextGem works](https://contextgem.dev/how_it_works.html) in
325366
ContextGem supports both cloud-based and local LLMs through [LiteLLM](https://github.com/BerriAI/litellm) integration:
326367
- **Cloud LLMs**: OpenAI, Anthropic, Google, Azure OpenAI, and more
327368
- **Local LLMs**: Run models locally using providers like Ollama, LM Studio, etc.
369+
- **Model Architectures**: Works with both reasoning/CoT-capable (e.g. o4-mini) and non-reasoning models (e.g. gpt-4.1)
328370
- **Simple API**: Unified interface for all LLMs with easy provider switching
329371

330372

@@ -339,14 +381,25 @@ ContextGem documentation offers guidance on optimization strategies to maximize
339381
- [Choosing the Right LLM(s)](https://contextgem.dev/optimizations/optimization_choosing_llm.html)
340382

341383

384+
## 💾 Serializing results
385+
386+
ContextGem allows you to save and load Document objects, pipelines, and LLM configurations with built-in serialization methods:
387+
388+
- Save processed documents to avoid repeating expensive LLM calls
389+
- Transfer extraction results between systems
390+
- Persist pipeline and LLM configurations for later reuse
391+
392+
Learn more about [serialization options](https://contextgem.dev/serialization.html) in the documentation.
393+
394+
342395
## 📚 Documentation
343396

344397
Full documentation is available at [contextgem.dev](https://contextgem.dev).
345398

346399
A raw text version of the full documentation is available at [`docs/docs-raw-for-llm.txt`](https://github.com/shcherbak-ai/contextgem/blob/main/docs/docs-raw-for-llm.txt). This file is automatically generated and contains all documentation in a format optimized for LLM ingestion (e.g. for Q&A).
347400

348401

349-
## 🗨️ Community
402+
## 💬 Community
350403

351404
If you have a feature request or a bug report, feel free to [open an issue](https://github.com/shcherbak-ai/contextgem/issues/new) on GitHub. If you'd like to discuss a topic or get general advice on using ContextGem for your project, start a thread in [GitHub Discussions](https://github.com/shcherbak-ai/contextgem/discussions/new/).
352405

@@ -356,25 +409,14 @@ If you have a feature request or a bug report, feel free to [open an issue](http
356409
We welcome contributions from the community - whether it's fixing a typo or developing a completely new feature! To get started, please check out our [Contributor Guidelines](https://github.com/shcherbak-ai/contextgem/blob/main/CONTRIBUTING.md).
357410

358411

359-
## 🗺️ Roadmap
360-
361-
ContextGem is at an early stage. Our development roadmap includes:
362-
363-
- **Enhanced Analytical Abstractions**: Building more sophisticated analytical layers on top of the core extraction workflow to enable deeper insights and more complex document understanding
364-
- **API Simplification**: Continuing to refine and streamline the API surface to make document analysis more intuitive and accessible
365-
- **Terminology Refinement**: Improving consistency and clarity of terminology throughout the framework to enhance developer experience
366-
367-
We are committed to making ContextGem the most effective tool for extracting structured information from documents.
368-
369-
370412
## 🔐 Security
371413

372414
This project is automatically scanned for security vulnerabilities using [CodeQL](https://codeql.github.com/). We also use [Snyk](https://snyk.io) as needed for supplementary dependency checks.
373415

374416
See [SECURITY](https://github.com/shcherbak-ai/contextgem/blob/main/SECURITY.md) file for details.
375417

376418

377-
## 🙏 Acknowledgements
419+
## 💖 Acknowledgements
378420

379421
ContextGem relies on these excellent open-source packages:
380422

@@ -388,6 +430,11 @@ ContextGem relies on these excellent open-source packages:
388430
- [aiolimiter](https://github.com/mjpieters/aiolimiter): Powerful rate limiting for async operations
389431

390432

433+
## 🌱 Support the project
434+
435+
ContextGem is just getting started, and your support means the world to us! If you find ContextGem useful, the best way to help is by sharing it with others and giving the project a ⭐. Your feedback and contributions are what make this project grow!
436+
437+
391438
## 📄 License & Contact
392439

393440
This project is licensed under the Apache 2.0 License - see the [LICENSE](https://github.com/shcherbak-ai/contextgem/blob/main/LICENSE) and [NOTICE](https://github.com/shcherbak-ai/contextgem/blob/main/NOTICE) files for details.

contextgem/__init__.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717
#
1818

1919
"""
20-
ContextGem - Easier and faster way to build LLM extraction workflows through powerful abstractions
20+
ContextGem - Effortless LLM extraction from documents
2121
"""
2222

2323
__version__ = "0.1.2"
@@ -31,6 +31,7 @@
3131
DocumentLLM,
3232
DocumentLLMGroup,
3333
DocumentPipeline,
34+
DocxConverter,
3435
Image,
3536
JsonObjectConcept,
3637
JsonObjectExample,
@@ -78,4 +79,6 @@
7879
# Utils
7980
"image_to_base64",
8081
"reload_logger_settings",
82+
# Converters
83+
"DocxConverter",
8184
]

0 commit comments

Comments
 (0)