You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Implemented DocxConverter for converting DOCX files to ContextGem Document objects. Updated docs and README, including new usage examples. Fixed LLM config serialization. Updated tests and cassettes.
Copy file name to clipboardExpand all lines: CONTRIBUTING.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -197,7 +197,7 @@ You **must** re-record cassettes if:
197
197
```
198
198
4. Run your tests, which will create new cassette files
199
199
200
-
**Important**: Re-recording cassettes will use your OpenAI API key and may incur charges to your account based on the number and type of API calls made during testing. Please be aware of these potential costs before re-recording. (Re-running the whole test suite with the current set of OpenAI LLMs and making actual LLM API requests currently incurs up to $0.50 USD, based on the default OpenAI API pricing.)
200
+
**Important**: Re-recording cassettes will use your OpenAI API key and may incur charges to your account based on the number and type of API calls made during testing. Please be aware of these potential costs before re-recording.
201
201
202
202
Note that our VCR configuration is set up to automatically strip API keys and other personal data from the cassettes by default.
Copy file name to clipboardExpand all lines: README.md
+96-49Lines changed: 96 additions & 49 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
-

1
+

2
2
3
-
# ContextGem: Easier and faster way to build LLM extraction workflows
3
+
# ContextGem: Effortless LLM extraction from documents
<imgsrc="https://contextgem.dev/_static/tab_solid.png"alt="ContextGem: 2nd Product of the week"width="250">
20
21
<br/><br/>
21
22
22
-
ContextGem is a free, open-source LLM framework for easier, faster extraction of structured data and insights from documents through powerful abstractions.
23
+
ContextGem is a free, open-source LLM framework that makes it radically easier to extract structured data and insights from documents — with minimal code.
23
24
24
25
25
26
## 💎 Why ContextGem?
@@ -46,113 +47,113 @@ Read more on the project [motivation](https://contextgem.dev/motivation.html) in
\* See [descriptions](https://contextgem.dev/motivation.html#the-contextgem-solution) of ContextGem abstractions and [comparisons](https://contextgem.dev/vs_other_frameworks.html) of specific implementation examples using ContextGem and other popular open-source LLM frameworks.
158
159
@@ -181,6 +182,8 @@ pip install -U contextgem
181
182
182
183
### Aspect extraction
183
184
185
+
> **Aspect is a defined area or topic** within a document (or another aspect). For example, "Payment terms" clauses in a contract.
186
+
184
187
```python
185
188
# Quick Start Example - Extracting payment terms from a document
186
189
@@ -238,6 +241,8 @@ for item in doc.aspects[0].extracted_items:
238
241
239
242
### Concept extraction
240
243
244
+
> **Concept is a unit of information or an entity**, derived from an aspect or the broader document context. Concepts represent a wide range of data points and insights, from simple entities (names, dates, monetary values) to complex evaluations, conclusions, and answers to specific questions.
245
+
241
246
```python
242
247
# Quick Start Example - Extracting anomalies from a document, with source references and justifications
243
248
@@ -313,6 +318,42 @@ See more examples in the documentation:
313
318
-[Using a Multi-LLM Pipeline to Extract Data from Several Documents](https://contextgem.dev/advanced_usage.html#using-a-multi-llm-pipeline-to-extract-data-from-several-documents)
314
319
315
320
321
+
## 🔄 Document converters
322
+
323
+
To create a document for analysis, you can either pass raw text directly, or use ContextGem's built-in document converters that handle various file formats.
324
+
325
+
### 📄 DOCX converter
326
+
327
+
ContextGem provides built-in converter to easily transform DOCX files into ContextGem document objects.
328
+
329
+
- Extracts information that other open-source tools often do not capture: misaligned tables, comments, footnotes, textboxes, headers/footers, and embedded images
330
+
- Preserves document structure with rich metadata for improved LLM analysis
withopen("path/to/document.docx", "rb") as docx_file_object:
344
+
document = converter.convert(docx_file_object)
345
+
346
+
# You can also use it as a standalone text extractor
347
+
docx_text = converter.convert_to_text_format(
348
+
"path/to/document.docx",
349
+
output_format="markdown", # or "raw"
350
+
)
351
+
352
+
```
353
+
354
+
Learn more about [DOCX converter features](https://contextgem.dev/converters/docx.html) in the documentation.
355
+
356
+
316
357
## 🎯 Focused document analysis
317
358
318
359
ContextGem leverages LLMs' long context windows to deliver superior extraction accuracy from individual documents. Unlike RAG approaches that often [struggle with complex concepts and nuanced insights](https://www.linkedin.com/pulse/raging-contracts-pitfalls-rag-contract-review-shcherbak-ai-ptg3f), ContextGem capitalizes on [continuously expanding context capacity](https://arxiv.org/abs/2502.12962), evolving LLM capabilities, and decreasing costs. This focused approach enables direct information extraction from complete documents, eliminating retrieval inconsistencies while optimizing for in-depth single-document analysis. While this delivers higher accuracy for individual documents, ContextGem does not currently support cross-document querying or corpus-wide retrieval - for these use cases, modern RAG systems (e.g., LlamaIndex, Haystack) remain more appropriate.
@@ -325,6 +366,7 @@ Read more on [how ContextGem works](https://contextgem.dev/how_it_works.html) in
325
366
ContextGem supports both cloud-based and local LLMs through [LiteLLM](https://github.com/BerriAI/litellm) integration:
326
367
-**Cloud LLMs**: OpenAI, Anthropic, Google, Azure OpenAI, and more
327
368
-**Local LLMs**: Run models locally using providers like Ollama, LM Studio, etc.
369
+
-**Model Architectures**: Works with both reasoning/CoT-capable (e.g. o4-mini) and non-reasoning models (e.g. gpt-4.1)
328
370
-**Simple API**: Unified interface for all LLMs with easy provider switching
329
371
330
372
@@ -339,14 +381,25 @@ ContextGem documentation offers guidance on optimization strategies to maximize
339
381
-[Choosing the Right LLM(s)](https://contextgem.dev/optimizations/optimization_choosing_llm.html)
340
382
341
383
384
+
## 💾 Serializing results
385
+
386
+
ContextGem allows you to save and load Document objects, pipelines, and LLM configurations with built-in serialization methods:
387
+
388
+
- Save processed documents to avoid repeating expensive LLM calls
389
+
- Transfer extraction results between systems
390
+
- Persist pipeline and LLM configurations for later reuse
391
+
392
+
Learn more about [serialization options](https://contextgem.dev/serialization.html) in the documentation.
393
+
394
+
342
395
## 📚 Documentation
343
396
344
397
Full documentation is available at [contextgem.dev](https://contextgem.dev).
345
398
346
399
A raw text version of the full documentation is available at [`docs/docs-raw-for-llm.txt`](https://github.com/shcherbak-ai/contextgem/blob/main/docs/docs-raw-for-llm.txt). This file is automatically generated and contains all documentation in a format optimized for LLM ingestion (e.g. for Q&A).
347
400
348
401
349
-
## 🗨️ Community
402
+
## 💬 Community
350
403
351
404
If you have a feature request or a bug report, feel free to [open an issue](https://github.com/shcherbak-ai/contextgem/issues/new) on GitHub. If you'd like to discuss a topic or get general advice on using ContextGem for your project, start a thread in [GitHub Discussions](https://github.com/shcherbak-ai/contextgem/discussions/new/).
352
405
@@ -356,25 +409,14 @@ If you have a feature request or a bug report, feel free to [open an issue](http
356
409
We welcome contributions from the community - whether it's fixing a typo or developing a completely new feature! To get started, please check out our [Contributor Guidelines](https://github.com/shcherbak-ai/contextgem/blob/main/CONTRIBUTING.md).
357
410
358
411
359
-
## 🗺️ Roadmap
360
-
361
-
ContextGem is at an early stage. Our development roadmap includes:
362
-
363
-
-**Enhanced Analytical Abstractions**: Building more sophisticated analytical layers on top of the core extraction workflow to enable deeper insights and more complex document understanding
364
-
-**API Simplification**: Continuing to refine and streamline the API surface to make document analysis more intuitive and accessible
365
-
-**Terminology Refinement**: Improving consistency and clarity of terminology throughout the framework to enhance developer experience
366
-
367
-
We are committed to making ContextGem the most effective tool for extracting structured information from documents.
368
-
369
-
370
412
## 🔐 Security
371
413
372
414
This project is automatically scanned for security vulnerabilities using [CodeQL](https://codeql.github.com/). We also use [Snyk](https://snyk.io) as needed for supplementary dependency checks.
373
415
374
416
See [SECURITY](https://github.com/shcherbak-ai/contextgem/blob/main/SECURITY.md) file for details.
375
417
376
418
377
-
## 🙏 Acknowledgements
419
+
## 💖 Acknowledgements
378
420
379
421
ContextGem relies on these excellent open-source packages:
380
422
@@ -388,6 +430,11 @@ ContextGem relies on these excellent open-source packages:
388
430
-[aiolimiter](https://github.com/mjpieters/aiolimiter): Powerful rate limiting for async operations
389
431
390
432
433
+
## 🌱 Support the project
434
+
435
+
ContextGem is just getting started, and your support means the world to us! If you find ContextGem useful, the best way to help is by sharing it with others and giving the project a ⭐. Your feedback and contributions are what make this project grow!
436
+
437
+
391
438
## 📄 License & Contact
392
439
393
440
This project is licensed under the Apache 2.0 License - see the [LICENSE](https://github.com/shcherbak-ai/contextgem/blob/main/LICENSE) and [NOTICE](https://github.com/shcherbak-ai/contextgem/blob/main/NOTICE) files for details.
0 commit comments