Skip to content

Commit c1c0d9b

Browse files
Merge pull request #1 from shcherbak-ai/dev
Merge with Dev
2 parents af18f23 + 04da166 commit c1c0d9b

File tree

10 files changed

+628
-393
lines changed

10 files changed

+628
-393
lines changed

.github/workflows/README.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22

33
This directory contains GitHub Actions workflow configurations for continuous integration (CI) of the ContextGem project.
44

5+
56
## Available Workflows
67

78
### tests (`ci-tests.yml`)
@@ -22,6 +23,33 @@ This directory contains GitHub Actions workflow configurations for continuous in
2223
- `CONTEXTGEM_OPENAI_API_KEY`: Secret OpenAI API key
2324
- `GIST_SECRET`: Secret token to upload coverage results to a gist for badge generation
2425

26+
### CodeQL Analysis (`codeql.yml`)
27+
28+
This workflow performs code security scanning using GitHub's CodeQL analysis engine.
29+
30+
**Features:**
31+
- Scans Python codebase for security vulnerabilities and coding errors
32+
- Analyzes code quality and identifies potential issues
33+
- Results are available in the Security tab of the repository
34+
35+
**Trigger:**
36+
- Automatically runs on push and pull request events on the main and dev branches
37+
- Scheduled to run weekly
38+
- Can be triggered manually through the GitHub Actions UI
39+
40+
### Documentation Build (`docs.yml`)
41+
42+
This workflow builds and deploys the project documentation to GitHub Pages.
43+
44+
**Features:**
45+
- Builds documentation using Sphinx
46+
- Deploys documentation to GitHub Pages when merged to main
47+
- Creates preview builds on pull requests
48+
49+
**Trigger:**
50+
- Automatically runs on push and pull request events on the main branch
51+
- Can be triggered manually through the GitHub Actions UI
52+
2553
### Check Contributor Agreement (`contributor-agreement-check.yml`)
2654

2755
This workflow ensures all contributors have signed the Contributor Agreement by checking for properly filled agreement files.
@@ -35,7 +63,10 @@ This workflow ensures all contributors have signed the Contributor Agreement by
3563
**Trigger:**
3664
- Automatically runs on all pull request events (opened, synchronized, reopened)
3765

66+
3867
## Running Workflows
3968

4069
- **tests:** These run automatically on push/PR to the main branch
70+
- **CodeQL Analysis:** Runs automatically on push/PR to main/dev, weekly, and manually
71+
- **Documentation Build:** Runs automatically on push/PR to main and manually
4172
- **Check Contributor Agreement:** Runs automatically on all PRs

.github/workflows/ci-tests.yml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,9 @@ name: tests
22

33
on:
44
push:
5-
branches: [ main ]
5+
branches: [ main, dev ]
66
pull_request:
7-
branches: [ main ]
7+
branches: [ main, dev ]
88
workflow_dispatch:
99

1010
jobs:
@@ -92,6 +92,7 @@ jobs:
9292
update-badge:
9393
needs: tests-with-vcr
9494
runs-on: ubuntu-latest
95+
if: github.ref == 'refs/heads/main'
9596
steps:
9697
- name: Download coverage artifact
9798
uses: actions/download-artifact@v4

.github/workflows/codeql.yml

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
name: "CodeQL"
2+
3+
on:
4+
push:
5+
branches: [ main, dev ]
6+
pull_request:
7+
branches: [ main, dev ]
8+
schedule:
9+
- cron: '0 0 * * 0' # Run once per week at midnight on Sunday
10+
workflow_dispatch:
11+
12+
jobs:
13+
analyze:
14+
name: Analyze
15+
runs-on: ubuntu-latest
16+
permissions:
17+
actions: read
18+
contents: read
19+
security-events: write
20+
21+
strategy:
22+
fail-fast: false
23+
matrix:
24+
language: [ 'python' ]
25+
26+
steps:
27+
- name: Checkout repository
28+
uses: actions/checkout@v4
29+
30+
- name: Initialize CodeQL
31+
uses: github/codeql-action/init@v3
32+
with:
33+
languages: ${{ matrix.language }}
34+
35+
- name: Set up Python
36+
uses: actions/setup-python@v5
37+
with:
38+
python-version: '3.13'
39+
40+
- name: Install Poetry
41+
uses: snok/install-poetry@v1
42+
with:
43+
virtualenvs-create: true
44+
virtualenvs-in-project: true
45+
installer-parallel: true
46+
47+
- name: Load cached pip wheels
48+
id: cached-pip-wheels
49+
uses: actions/cache@v4
50+
with:
51+
path: |
52+
~/.cache/pip
53+
~/Library/Caches/pip
54+
~\AppData\Local\pip\Cache
55+
key: pip-${{ runner.os }}-python-${{ hashFiles('**/poetry.lock') }}
56+
57+
- name: Install dependencies
58+
run: poetry install --no-interaction --with dev --no-root
59+
60+
- name: Perform CodeQL Analysis
61+
uses: github/codeql-action/analyze@v3
62+
with:
63+
category: "/language:${{matrix.language}}"

.github/workflows/contributor-agreement-check.yml

Lines changed: 34 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,13 +12,44 @@ jobs:
1212
check-contributor-agreement:
1313
runs-on: ubuntu-latest
1414
steps:
15+
- name: Check if user is a maintainer
16+
id: check-maintainer
17+
uses: actions/github-script@v7
18+
with:
19+
github-token: ${{ secrets.GITHUB_TOKEN }}
20+
script: |
21+
const { owner, repo } = context.repo;
22+
const username = context.payload.pull_request.user.login;
23+
24+
try {
25+
const { data: permission } = await github.rest.repos.getCollaboratorPermissionLevel({
26+
owner,
27+
repo,
28+
username,
29+
});
30+
31+
// Skip check for users with admin or write permissions
32+
if (['admin', 'write'].includes(permission.permission)) {
33+
console.log(`User ${username} is a maintainer with ${permission.permission} permissions. Skipping check.`);
34+
return true;
35+
}
36+
37+
console.log(`User ${username} has ${permission.permission} permissions. Continuing with check.`);
38+
return false;
39+
} catch (error) {
40+
console.log(`Error checking permissions: ${error}`);
41+
return false;
42+
}
43+
1544
- name: Checkout code
45+
if: steps.check-maintainer.outputs.result != 'true'
1646
uses: actions/checkout@v4
1747
with:
1848
ref: ${{ github.event.pull_request.head.sha }}
1949
fetch-depth: 0
2050

2151
- name: Check for contributor agreement
52+
if: steps.check-maintainer.outputs.result != 'true'
2253
id: check-agreement
2354
run: |
2455
# Get the PR author's username
@@ -50,6 +81,7 @@ jobs:
5081
fi
5182
5283
- name: Check for deleted contributor agreements
84+
if: steps.check-maintainer.outputs.result != 'true'
5385
id: check-deleted
5486
run: |
5587
# Set proper base ref
@@ -68,8 +100,8 @@ jobs:
68100
fi
69101
70102
- name: Comment on PR if checks fail
71-
if: ${{ failure() }}
72-
uses: actions/github-script@v6
103+
if: ${{ failure() && steps.check-maintainer.outputs.result != 'true' }}
104+
uses: actions/github-script@v7
73105
with:
74106
github-token: ${{ secrets.GITHUB_TOKEN }}
75107
script: |

README.md

Lines changed: 38 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -7,14 +7,16 @@
77
[![docs](https://github.com/shcherbak-ai/contextgem/actions/workflows/docs.yml/badge.svg?branch=main)](https://github.com/shcherbak-ai/contextgem/actions/workflows/docs.yml)
88
[![documentation](https://img.shields.io/badge/docs-latest-blue.svg)](https://shcherbak-ai.github.io/contextgem/)
99
[![License](https://img.shields.io/badge/License-Apache_2.0-bright.svg)](https://opensource.org/licenses/Apache-2.0)
10+
![PyPI](https://img.shields.io/pypi/v/contextgem)
1011
[![Python Versions](https://img.shields.io/badge/python-3.10%20%7C%203.11%20%7C%203.12%20%7C%203.13-blue)](https://www.python.org/downloads/)
12+
[![Code Security](https://github.com/shcherbak-ai/contextgem/actions/workflows/codeql.yml/badge.svg?branch=main)](https://github.com/shcherbak-ai/contextgem/actions/workflows/codeql.yml)
1113
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
1214
[![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat)](https://pycqa.github.io/isort/)
1315
[![Pydantic v2](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json)](https://pydantic.dev)
1416
[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-blue?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)
1517
[![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.1-4baaaa.svg)](CODE_OF_CONDUCT.md)
1618

17-
ContextGem is an LLM framework for easier, faster extraction of structured data and insights from documents through powerful abstractions.
19+
ContextGem is a free, open-source LLM framework for easier, faster extraction of structured data and insights from documents through powerful abstractions.
1820

1921

2022
## 💎 Why ContextGem?
@@ -26,6 +28,16 @@ ContextGem addresses this challenge by providing a flexible, intuitive framework
2628
Read more on the project [motivation](https://contextgem.dev/motivation.html) in the documentation.
2729

2830

31+
## 💡 What can you do with ContextGem?
32+
33+
With ContextGem, you can:
34+
- **Extract structured data** from documents (text, images) with minimal code
35+
- **Identify and analyze key aspects** (topics, themes, categories) within documents
36+
- **Extract specific concepts** (entities, facts, conclusions, assessments) from documents
37+
- **Build complex extraction workflows** through a simple, intuitive API
38+
- **Create multi-level extraction pipelines** (aspects containing concepts, hierarchical aspects)
39+
40+
2941
## ⭐ Key features
3042

3143
<table>
@@ -177,7 +189,7 @@ doc = Document(
177189
"The term of the agreement is 1 year from the Effective Date...\n"
178190
"The Supplier shall provide consultancy services as described in Annex 2...\n"
179191
"The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\n"
180-
"The purple elephant danced gracefully on the moon while eating ice cream.\n" # out-of-context / anomaly
192+
"The purple elephant danced gracefully on the moon while eating ice cream.\n" # 💎 anomaly
181193
"This agreement is governed by the laws of Norway...\n"
182194
),
183195
)
@@ -191,8 +203,9 @@ doc.concepts = [
191203
reference_depth="sentences",
192204
add_justifications=True,
193205
justification_depth="brief",
194-
# add more concepts to the document, if needed
195206
)
207+
# add more concepts to the document, if needed
208+
# see the docs for available concepts: StringConcept, JsonObjectConcept, etc.
196209
]
197210
# Or use doc.add_concepts([...])
198211

@@ -201,15 +214,17 @@ llm = DocumentLLM(
201214
model="openai/gpt-4o-mini", # or any other LLM from e.g. Anthropic, etc.
202215
api_key=os.environ.get(
203216
"CONTEXTGEM_OPENAI_API_KEY"
204-
), # your API key for the LLM provider
217+
), # your API key for the LLM provider, e.g. OpenAI, Anthropic, etc.
205218
# see the docs for more configuration options
206219
)
207220

208221
# Extract information from the document
209222
doc = llm.extract_all(doc) # or use async version llm.extract_all_async(doc)
210223

211224
# Access extracted information in the document object
212-
print(doc.concepts[0].extracted_items) # extracted items with references justifications
225+
print(
226+
doc.concepts[0].extracted_items
227+
) # extracted items with references & justifications
213228
# or doc.get_concept_by_name("Anomalies").extracted_items
214229

215230
```
@@ -236,6 +251,14 @@ ContextGem leverages LLMs' long context windows to deliver superior extraction a
236251
Read more on [how it works](https://contextgem.dev/how_it_works.html) in the documentation.
237252

238253

254+
## 🤖 Supported LLMs
255+
256+
ContextGem supports both cloud-based and local LLMs through [LiteLLM](https://github.com/BerriAI/litellm) integration:
257+
- **Cloud LLMs**: OpenAI, Anthropic, Google, Azure OpenAI, and more
258+
- **Local LLMs**: Run models locally using providers like Ollama, LM Studio, etc.
259+
- **Simple API**: Unified interface for all LLMs with easy provider switching
260+
261+
239262
## ⚡ Optimizations
240263

241264
ContextGem documentation offers guidance on optimization strategies to maximize performance, minimize costs, and enhance extraction accuracy:
@@ -275,11 +298,20 @@ ContextGem is at an early stage. Our development roadmap includes:
275298
We are committed to making ContextGem the most effective tool for extracting structured information from documents.
276299

277300

301+
## 🔐 Security
302+
303+
This project is automatically scanned for security vulnerabilities using [CodeQL](https://codeql.github.com/). We also use [Snyk](https://snyk.io) as needed for supplementary dependency checks.
304+
305+
See [SECURITY](https://github.com/shcherbak-ai/contextgem/blob/main/SECURITY.md) file for details.
306+
307+
278308
## 📄 License & Contact
279309

280310
This project is licensed under the Apache 2.0 License - see the [LICENSE](https://github.com/shcherbak-ai/contextgem/blob/main/LICENSE) and [NOTICE](https://github.com/shcherbak-ai/contextgem/blob/main/NOTICE) files for details.
281311

282-
Copyright © 2025 [Shcherbak AI AS](https://shcherbak.ai) - AI engineering company developing tools for AI/ML/NLP developers.
312+
Copyright © 2025 [Shcherbak AI AS](https://shcherbak.ai), an AI engineering company building tools for AI/ML/NLP developers.
313+
314+
Shcherbak AI is now part of Microsoft for Startups.
283315

284316
[Connect with us on LinkedIn](https://www.linkedin.com/in/sergii-shcherbak-10068866/) for questions or collaboration ideas.
285317

SECURITY.md

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
# Security Policy
2+
3+
4+
## Supported Versions
5+
6+
We maintain security practices for the latest release of this library. Older versions may not receive security updates.
7+
8+
9+
## Security Testing
10+
11+
This project is automatically tested for security issues using [CodeQL](https://codeql.github.com/) static analysis (run via GitHub Actions).
12+
13+
We also use [Snyk](https://snyk.io) as needed for supplementary dependency vulnerability monitoring.
14+
15+
16+
## Data Privacy
17+
18+
This library uses LiteLLM as a local Python package to communicate with LLM providers using unified interface. No data or telemetry is transmitted to LiteLLM servers, as the SDK is run entirely within the user's environment. According to LiteLLM's documentation, self-hosted or local SDK use involves no data storage and no telemetry. For details, see [LiteLLM's documentation](https://docs.litellm.ai/docs/data_security).
19+
20+
21+
## Reporting a Vulnerability
22+
23+
We value the security community's role in protecting our users. If you discover a potential security issue in this project, please report it as follows:
24+
25+
📧 **Email**: `sergii@shcherbak.ai`
26+
27+
When reporting, please include:
28+
- A detailed description of the issue
29+
- Steps to reproduce the vulnerability
30+
- Any relevant logs, context, or configurations
31+
32+
We aim to respond promptly to all valid reports. Please note that we do not currently offer a bug bounty program.
33+
34+
35+
## Questions?
36+
37+
If you’re unsure whether something is a vulnerability or just a bug, feel free to reach out via the email above before submitting a full report.

0 commit comments

Comments
 (0)