Skip to content

Commit b7fa3f9

Browse files
authored
Merge pull request #15 from future-agi/dev
[Chore] Release 1.0.0
2 parents d26f024 + 28d1bc2 commit b7fa3f9

File tree

433 files changed

+125360
-3813
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

433 files changed

+125360
-3813
lines changed

.gitignore

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -52,12 +52,17 @@ output/
5252
#csv files
5353
*.csv
5454
#virtual environments
55-
**/venv
55+
**/.venv
5656
**/env
5757
**/.env
5858
*.DS_Store
5959
*.pdf
60+
**/.fi
61+
**/.pytest_cache
62+
planning/
63+
manual-testing/
64+
6065

6166
# typescript
6267
node_modules/
63-
package-lock.json
68+
package-lock.json

CHANGELOG.md

Lines changed: 29 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,33 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
66
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
77

88

9+
## [1.0.0] - 2026-02-27
10+
11+
### Python
12+
13+
#### Added
14+
- **`evaluate()` unified API** — single entrypoint for local, cloud, and LLM-as-Judge evaluations with automatic engine routing
15+
- **Multimodal LLM Judge** — pass `image_url`, `audio_url`, `input_image_url`, `output_image_url` for vision/audio evaluation with Gemini, GPT-4o, etc.
16+
- **Auto-generate grading criteria**`generate_prompt=True` converts a short description into a detailed rubric via LLM
17+
- **LLM augmentation**`augment=True` runs local heuristic first, then LLM refines the score (faithfulness, hallucination_score, task_completion, etc.)
18+
- **Feedback loop system** — submit corrections, retrieve as few-shot examples via ChromaDB, calibrate pass/fail thresholds
19+
- **72+ local metrics** — string checks, JSON validation, similarity, NLI-based hallucination detection, RAG evaluation, function calling, agent trajectory, structured output, security guardrails
20+
- **OpenTelemetry integration**`enable_auto_enrichment()` emits `gen_ai.evaluation.*` spans for Jaeger/Datadog/Grafana
21+
- **Streaming evaluation** — token-by-token monitoring with configurable early stopping
22+
- **9 cookbooks** — local metrics, LLM judge, RAG evaluation, guardrails, streaming, autoeval, OTEL tracing, feedback loop, multimodal judge
23+
24+
#### Changed
25+
- **Poetry to uv** — migrated build system for 10x faster dependency resolution
26+
- Widened LLM provider type signatures from `Dict[str, str]` to `Dict[str, Any]` for multimodal content parts
27+
28+
#### Fixed
29+
- 6 code security scanner bugs (Phase 2)
30+
- Guardrails ensemble scoring and scanner edge cases
31+
- NLI consolidation and empty-input handling in RAG metrics
32+
- K8s backend JSON log parsing
33+
- Temporal Docker healthcheck and DB config
34+
- Celery serialization for closures
35+
936
## [0.2.2] - 2025-10-27
1037

1138
- Introducing LLM As A Judge
@@ -68,7 +95,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
6895

6996
---
7097

71-
[Unreleased]: https://github.com/future-agi/ai-evaluation/compare/v0.2.2...HEAD
98+
[Unreleased]: https://github.com/future-agi/ai-evaluation/compare/v1.0.0...HEAD
99+
[1.0.0]: https://github.com/future-agi/ai-evaluation/compare/v0.2.2...v1.0.0
72100
[0.2.2]: https://github.com/future-agi/ai-evaluation/compare/v0.2.1...v0.2.2
73101
[0.2.1]: https://github.com/future-agi/ai-evaluation/compare/v0.1.0...v0.2.1
74102
[0.1.0]: https://github.com/future-agi/ai-evaluation/releases/tag/v0.1.0

CONTRIBUTING.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ Thanks for your interest in contributing! 🎉
99
git clone https://github.com/YOUR_USERNAME/ai-evaluation.git
1010

1111
# Python
12-
cd python && poetry install && poetry run pytest
12+
cd python && uv sync --dev && uv run pytest
1313

1414
# TypeScript
1515
cd typescript/ai-evaluation && pnpm install && pnpm test

0 commit comments

Comments
 (0)