-
Notifications
You must be signed in to change notification settings - Fork 1.1k
feat: OpenAI batch api #2245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: OpenAI batch api #2245
Conversation
1a7eb06 to
bd05971
Compare
|
/claude-review |
|
Claude finished @anistark's task —— View job 🔍 PR Code Review: OpenAI Batch API ImplementationReview Progress:
📋 Change Type Classification✨ New Feature: This PR adds comprehensive OpenAI Batch API support to Ragas, enabling cost-effective evaluation with up to 50% savings for large-scale evaluations. 🚀 Strengths & Excellent Implementation✅ Outstanding Architecture Design
✅ Robust Error Handling & Validation
✅ Intelligent Response ParsingThe score extraction logic in
✅ Smart Faithfulness Integration
✅ Comprehensive Documentation
✅ Strong Test Coverage
🔄 Areas for Enhancement
|
## Issue Link / Problem Description <!-- Link to related issue or describe the problem this PR solves --> - contd #2246 --------- Co-authored-by: Nabegon <[email protected]>
Ragas is a great tool but the contents of the website can still be improved. I corrected typos and unified some names (like langchain, Langchain -> LangChain) in Markdown documentation files. It seems that most of docs were written by a single author and lots of common typos were repeatly introduced. There are also some other problems about docs such as readability, categorization, clarification and more. I should have planned to polish them one by one but I may not have enough time to polish all of them, because I realized there are so many documentations after correcting typos for them. XD This pull request only involves correcting typos and I recommend the contributor who is in charge of writing docs to have a look at the modifications to avoid introducing these typos in the future. You may also consider using LTex extension in VSCode extension market to recognize them (though it is not perfect). The matching ipynb files are not corrected since I do not figure out a method to identify those typos in ipynb files easily. You may consider to correct the ipynb files as well so you can generate Markdown files accordingly. --------- Co-authored-by: Siddharth Sahu <[email protected]>
Deprecation warnings for LLMs and Prompts
## Changes Made <!-- Describe what you changed and why --> - **Added comprehensive RAG evaluation guide**: Created `docs/howtos/applications/evaluate-and-improve-rag.md` with step-by-step instructions for evaluating and improving RAG apps with Ragas - **Created `ragas_examples.improve_rag` module** with complete working examples: - `simple_rag.py`: Basic RAG implementation using BM25 retriever and OpenAI LLM with MLflow tracing - `agentic_rag.py`: Advanced agentic RAG using OpenAI Agents SDK that can iteratively search and refine queries - `evals.py`: Complete evaluation pipeline with Ragas experiments - `data_utils.py`: Shared utilities for BM25 retriever setup and document processing - **Added new dependency group**: `improverag` extra in `examples/pyproject.toml` including MLflow, BM25, LangChain, and OpenAI Agents SDK - **Updated navigation**: Added new guide to mkdocs.yml and howtos applications index - **Added MLflow tracing screenshot**: `docs/_static/imgs/howto_improve_rag_mlflow.png` showing evaluation traces ## Testing <!-- Describe how this should be tested --> ### How to Test - [x] Manual testing steps: 1. Install dependencies: `uv pip install "ragas-examples[improverag]"` 2. Set OpenAI API key: `export OPENAI_API_KEY="your_key"` 3. Run simple RAG: `uv run python -m ragas_examples.improve_rag.simple_rag` 4. Run agentic RAG: `uv run python -m ragas_examples.improve_rag.agentic_rag` 5. Run evaluation (test mode): `uv run python -m ragas_examples.improve_rag.evals --test` 6. Run evaluation with agentic RAG: `uv run python -m ragas_examples.improve_rag.evals --agentic-rag --test` 7. Start MLflow UI: `uv run mlflow ui --backend-store-uri sqlite:///mlflow.db --port 5000` 8. Verify traces appear in MLflow dashboard --------- Co-authored-by: Kumar Anirudha <[email protected]>
## Issue Link / Problem Description <!-- Link to related issue or describe the problem this PR solves --> - contd #2241 --------- Co-authored-by: sahusiddharth <[email protected]>
## Issue Link / Problem Description <!-- Link to related issue or describe the problem this PR solves --> - fixes warnings in tests.
Description: This PR addresses an issue in the generate_multiple method where enabling caching caused it to return duplicate responses instead of generating multiple distinct outputs. Problem: When generate_multiple is used with caching turned on, all calls used the same cache key internally. As a result: The first generated output is cached. Subsequent calls to generate_multiple fetch the same cached output instead of generating new ones. This breaks the intended functionality of producing multiple diverse outputs. Root Cause: The cache key was not uniquely generated per input and per requested number of outputs. Instead, it reused the same key across all calls. Solution: Modified the cache key generation to include all parameters that affect output (input text, number of outputs, and any relevant generation options). Ensured that each call to generate_multiple with the same input but requesting multiple outputs generates distinct results and caches them appropriately. Added safeguards to prevent cache collisions. Testing Performed: Unit Tests: Verified that multiple outputs are returned for the same input when caching is enabled. Confirmed that caching works correctly without returning duplicates. Manual Testing: Tested with different inputs and multiple output requests. Ensured outputs are unique across multiple calls and cached appropriately. Regression Check: Confirmed that previous generate calls without multiple outputs continue to work as expected. Impact: Restores correct functionality of generate_multiple when caching is enabled. Ensures caching still improves performance without affecting output correctness. Additional Notes: This fix is backward-compatible and does not affect other parts of the codebase. All relevant formatting and linting checks have passed.
…om class-instantiated metrics (#2316) **Primary Motivation**: This PR fixes a fundamental inheritance pattern issue in the metrics system where factory-created metrics (via `@discrete_metric`, `@numeric_metric`, etc.) and class-instantiated metrics (via `DiscreteMetric()`, `NumericMetric()`, etc.) should have different base classes but were incorrectly sharing the same inheritance hierarchy. **The Problem**: - Factory-created metrics should inherit from `SimpleBaseMetric` (lightweight, decorator-based) - Class-instantiated metrics should inherit from `SimpleLLMMetric` (LLM-enabled, full-featured) - Previously, both paths incorrectly inherited from the same base classes, creating confusion and incorrect behavior **The Solution**: • **Separated base classes**: Created `SimpleBaseMetric` (for factory) and `SimpleLLMMetric` (for class instantiation) as distinct, unrelated base classes • **Removed `llm_based.py`**: Consolidated `BaseLLMMetric` and `LLMMetric` into `base.py` as `SimpleBaseMetric` and `SimpleLLMMetric` • **Fixed decorator inheritance**: Factory methods now create metrics that inherit from `SimpleBaseMetric + ValidatorMixin` only • **Fixed class inheritance**: Class-based metrics like `DiscreteMetric` now inherit from `SimpleLLMMetric + ValidatorMixin` • **Added validator system**: Introduced modular validation mixins that work with both inheritance patterns • **Maintained backward compatibility**: Added aliases `BaseMetric = SimpleBaseMetric` and `LLMMetric = SimpleLLMMetric` **Exact Steps Taken**: 1. `7d6de2a` - Updated gitignore for experimental directories 2. `c6101f8` - Renamed classes and established proper naming convention 3. `46450d8` - Refactored decorator and class-based inheritance patterns 4. `a464c37` - Simplified validator system with proper mixins 5. `fe996f6` - Removed `llm_based.py` after consolidation - [ ] Verify factory-created metrics (`@discrete_metric`) inherit from `SimpleBaseMetric` only - [ ] Verify class-instantiated metrics (`DiscreteMetric()`) inherit from `SimpleLLMMetric` - [ ] Test that both patterns work correctly with their respective validation mixins - [ ] Ensure backward compatibility with existing metric imports - [ ] Validate all metric functionality (scoring, async operations, alignment) - [ ] Run full test suite to ensure no regressions 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> --------- Co-authored-by: Ani <[email protected]>
## Issue Link / Problem Description <!-- Link to related issue or describe the problem this PR solves --> - Fixes #2324 ## Changes Made <!-- Describe what you changed and why --> - IndexError in generate_multiple when LLM returns fewer generations than requested
## Issue Link / Problem Description <!-- Link to related issue or describe the problem this PR solves --> - Fixes #2326 ## Changes Made <!-- Describe what you changed and why --> - fixed answer_relevancy scoring logic to prevent false zero scores for valid answers
This PR adds direct support for Oracle Cloud Infrastructure (OCI) Generative AI models in Ragas, enabling evaluation without requiring LangChain or LlamaIndex dependencies. Currently, users who want to use OCI Gen AI models must go through LangChain or LlamaIndex wrappers, which adds unnecessary complexity and dependencies. **Problem**: No direct OCI Gen AI integration exists in Ragas, forcing users to use indirect approaches through LangChain/LlamaIndex. **Solution**: Implement a native OCI Gen AI wrapper that uses the OCI Python SDK directly. - **Add `OCIGenAIWrapper`** - New LLM wrapper class extending `BaseRagasLLM` - **Direct OCI SDK Integration** - Uses `oci.generative_ai.GenerativeAiClient` directly - **Factory Function** - `oci_genai_factory()` for easy initialization - **Async Support** - Full async/await implementation with proper error handling [Pretrained Foundational Models supported by OCI Generative AI](https://docs.oracle.com/en-us/iaas/Content/generative-ai/pretrained-models.htm) - **Optional Dependency** - Added `oci>=2.160.1` as optional dependency - **Import Safety** - Graceful handling when OCI SDK is not installed - **Configuration Options** - Support for OCI CLI config, environment variables, or manual config - **Comprehensive Test Suite** - 15+ test cases with mocking - **Error Handling Tests** - Tests for authentication, model not found, permission errors - **Async Testing** - Full async operation testing - **Factory Testing** - Factory function validation - **Complete Integration Guide** - Step-by-step setup and usage - **Working Example Script** - `examples/oci_genai_example.py` - **Authentication Guide** - Multiple OCI auth methods - **Troubleshooting Section** - Common issues and solutions - **Updated Integration Index** - Added to main integrations page - **Usage Tracking** - Built-in analytics with `LLMUsageEvent` - **Error Logging** - Comprehensive error logging and debugging - **Performance Monitoring** - Request tracking and metrics - [x] Automated tests added/updated - [x] Manual testing steps: 1. **Install OCI dependency**: `pip install ragas[oci]` 2. **Configure OCI authentication**: Set up OCI config file or environment variables 3. **Run example script**: `python examples/oci_genai_example.py` 4. **Test with different models**: Try Cohere, Meta, and xAI models 5. **Test async operations**: Verify async generation works correctly 6. **Test error handling**: Verify proper error messages for auth/model issues ```bash pytest tests/unit/test_oci_genai_wrapper.py -v python -c "import ast; ast.parse(open('src/ragas/llms/oci_genai_wrapper.py').read())" ``` - **OCI Python SDK**: https://docs.oracle.com/en-us/iaas/tools/python/2.160.1/api/generative_ai.html - **OCI Gen AI Documentation**: https://docs.oracle.com/en-us/iaas/Content/generative-ai/ - **Ragas LLM Patterns**: Follows existing `BaseRagasLLM` patterns - **Related Issues**: Direct OCI Gen AI support request ```python from ragas.llms import oci_genai_factory from ragas import evaluate llm = oci_genai_factory( model_id="cohere.command", compartment_id="ocid1.compartment.oc1..example" ) result = evaluate(dataset, llm=llm) ``` ```python config = { "user": "ocid1.user.oc1..example", "key_file": "~/.oci/private_key.pem", "fingerprint": "your_fingerprint", "tenancy": "ocid1.tenancy.oc1..example", "region": "us-ashburn-1" } llm = oci_genai_factory( model_id="cohere.command", compartment_id="ocid1.compartment.oc1..example", config=config, endpoint_id="ocid1.endpoint.oc1..example" # Optional ) ``` - `src/ragas/llms/oci_genai_wrapper.py` - Main implementation - `src/ragas/llms/__init__.py` - Export new classes - `pyproject.toml` - Add OCI optional dependency - `tests/unit/test_oci_genai_wrapper.py` - Comprehensive tests - `docs/howtos/integrations/oci_genai.md` - Complete documentation - `docs/howtos/integrations/index.md` - Updated integration index - `examples/oci_genai_example.py` - Working example script None - This is a purely additive feature with no breaking changes. - **New Optional Dependency**: `oci>=2.160.1` - **No Breaking Changes**: Existing functionality unchanged - **Backward Compatible**: All existing code continues to work
…rics (#2320) ## Summary This PR adds persistence capabilities and better string representations for LLM-based metrics, making them easier to save, share, and debug. ## Changes ### 1. Save/Load Functionality - Added `save()` and `load()` methods to `SimpleLLMMetric` and its subclasses (`DiscreteMetric`, `NumericMetric`, `RankingMetric`) - Supports JSON format with optional gzip compression - Handles all prompt types including `Prompt` and `DynamicFewShotPrompt` - Smart defaults: `metric.save()` saves to `./metric_name.json` ### 2. Improved `__repr__` Methods - Clean, informative string representations for both LLM-based and decorator-based metrics - Removed implementation details (memory addresses, `<locals>`, internal attributes) - Smart prompt truncation (80 chars max) - Function signature display for decorator-based metrics **Before:** ```python create_metric_decorator.<locals>.decorator_factory.<locals>.decorator.<locals>.CustomMetric(name='summary_accuracy', _func=<function summary_accuracy at 0x151ffdf80>, ...) ``` **After:** ```python # LLM-based metrics DiscreteMetric(name='response_quality', allowed_values=['correct', 'incorrect'], prompt='Evaluate if the response...') # Decorator-based metrics summary_accuracy(user_input, response) -> DiscreteMetric[['pass', 'fail']] ``` ### 3. Response Model Handling - Added `create_auto_response_model()` factory to mark auto-generated models - Only warns about custom response models during save, not standard ones ## Usage Examples ```python # Save metric with default path metric.save() # → ./response_quality.json # Save with custom path metric.save("custom.json") metric.save("/path/to/metrics/") # → /path/to/metrics/response_quality.json metric.save("compressed.json.gz") # Compressed # Load metric loaded_metric = DiscreteMetric.load("response_quality.json") # For DynamicFewShotPrompt metrics loaded_metric = DiscreteMetric.load("metric.json", embedding_model=embeddings) ``` ## Testing - Comprehensive test suite with 8 tests covering all save/load scenarios - Tests for default paths, directory handling, compression - Tests for all prompt types and metric subclasses ## Dependencies **Note:** This PR builds on #2316 (Fix metric inheritance patterns) and requires it to be merged first. The changes here depend on the cleaned-up metric inheritance structure from that PR. ## Checklist - [x] Tests added - [x] Documentation in docstrings - [x] Backwards compatible (new functionality only) - [x] Follows TDD practices
#### Python 3.13 on macOS ARM: NumPy fails to install (builds from source) - Symptom: `make install` on Python 3.13 tries to build `numpy==2.0.x` from source on macOS ARM and fails with C/C++ errors. - Status: Ragas CI currently targets Python 3.9–3.12; Python 3.13 is best-effort until upstream wheels are broadly available. Workarounds: 1) Recommended: use Python 3.12 ```bash uv python install 3.12 uv venv -p 3.12 .venv-3.12 source .venv-3.12/bin/activate uv sync --group dev make check ``` 2) Stay on Python 3.13 (best effort): - Minimal install first to avoid heavy transitive pins: ```bash uv venv -p 3.13 .venv-3.13 source .venv-3.13/bin/activate uv pip install -e ".[dev-minimal]" make check ``` - If you need extras, add gradually: ```bash uv pip install "ragas[tracing,gdrive,ai-frameworks]" ``` - Prefer a prebuilt NumPy wheel (if available): ```bash uv pip install "numpy>=2.1" --only-binary=:all: ``` If the resolver still pins to 2.0.x via transitive deps, temporarily set `numpy>=2.1` locally and re-run `uv sync --group dev`. 3) Last resort: build NumPy locally ```bash xcode-select --install export SDKROOT="$(xcrun --sdk macosx --show-sdk-path)" export CC=clang uv pip install numpy ``` Safe alternate-venv tip: - Keep your project `.venv` untouched and use `.venv-3.12` / `.venv-3.13`. Avoid `make install` in alt envs; prefer `uv` commands directly. `make check` respects the active env via `uv run --active`.
## Issue Link / Problem Description <!-- Link to related issue or describe the problem this PR solves --> - Fixes typing
## Changes Made <!-- Describe what you changed and why --> - add type hints, docstrings, fix typos, and remove duplicate create_nano_id
Summary Remove Reo and RB2B analytics tracking from documentation site. This PR removes third-party tracking scripts from the documentation: - Deleted docs/_static/js/reo.js - Reo analytics tracking script - Deleted docs/_static/js/rb2b.js - RB2B analytics tracking script - Removed references to both scripts from mkdocs.yml Changes - mkdocs.yml: Removed two entries from extra_javascript configuration - docs/_static/js/reo.js: Deleted file - docs/_static/js/rb2b.js: Deleted file Motivation Streamlining analytics vendors used in the documentation site. Octolane and CommonRoom analytics remain active.
## Issue Link / Problem Description
<!-- Link to related issue or describe the problem this PR solves -->
- Makes `default_query_distribution` robust when a `KnowledgeGraph` is
missing data or certain synthesizers are incompatible.
- Skips incompatible MultiHop synthesizers instead of failing when they
cannot produce clusters.
## Changes Made
<!-- Describe what you changed and why -->
- In `src/ragas/testset/synthesizers/__init__.py`, updated
`default_query_distribution` to:
- Build a default set of synthesizers:
`SingleHopSpecificQuerySynthesizer`, `MultiHopAbstractQuerySynthesizer`,
`MultiHopSpecificQuerySynthesizer`.
- When a `KnowledgeGraph` is provided, probe each synthesizer via
`get_node_clusters(kg)` and include only those that can operate on the
given KG.
- Catch and log unexpected errors per synthesizer, skipping the failing
one instead of aborting the whole pipeline.
## Test Plan
### How to Test
Manual testing steps:
1. run generate_with_langchain_docs with default settings
2. the result show success
```
Applying SummaryExtractor: 100%|█| 1/1 [00:02<00
Applying CustomNodeFilter: 100%|█| 1/1 [00:02<00
Applying EmbeddingExtractor: 100%|█| 1/1 [00:02<
Applying ThemesExtractor: 100%|█| 1/1 [00:02<00:
Applying NERExtractor: 100%|█| 1/1 [00:02<00:00,
Applying CosineSimilarityBuilder: 100%|█| 1/1 [0
Applying OverlapScoreBuilder: 100%|█| 1/1 [00:00
Skipping multi_hop_abstract_query_synthesizer due to unexpected error: No relationships match the provided condition. Cannot form clusters.
Generating personas: 100%|█| 1/1 [00:03<00:00,
Generating Scenarios: 100%|█| 1/1 [00:02<00:00,
Generating Samples: 100%|█| 1/1 [00:02<00:00, 2
```
Test code:
```python
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from ragas.testset import TestsetGenerator
from langchain_openai import ChatOpenAI
from langchain_openai.embeddings import OpenAIEmbeddings
import os
os.environ["OPENAI_API_KEY"] = "xxxxxxxxx"
loader = DirectoryLoader(
"data/",
glob=["**/*.md", "**/*.mdx"],
loader_cls=TextLoader,
loader_kwargs={"encoding": "utf-8"},
)
documents = loader.load()
for document in documents:
document.metadata["filename"] = document.metadata["source"]
llm = ChatOpenAI(model="gpt-4o")
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
generator = TestsetGenerator.from_langchain(
llm=llm,
embedding_model=embeddings,
)
testset = generator.generate_with_langchain_docs(
documents,
testset_size=1,
with_debugging_logs=True,
)
```
## References
<!-- Link to related issues, discussions, forums, or external resources
-->
- Related issues: N/A (robustness improvement)
- Documentation: N/A
- External references: N/A
---
<!--
Thank you for contributing to Ragas!
Please fill out the sections above as completely as possible.
The more information you provide, the faster your PR can be reviewed and
merged.
-->
---------
Co-authored-by: ken-sevenseas <[email protected]>
PR Description: ## Issue Link / Problem Description <!-- Link to related issue or describe the problem this PR solves --> - Addresses the complexity barrier in Ragas examples that was making it difficult for beginners to understand and use evaluation workflows - Removes overly complex async patterns, abstractions, and infrastructure code from examples that obscured the core evaluation concepts ## Changes Made <!-- Describe what you changed and why --> - **Text2SQL Examples Simplification**: Streamlined all text2sql evaluation components by removing unnecessary async patterns, timing infrastructure, and complex abstractions - **Database and Data Utils Cleanup**: Simplified `db_utils.py` and `data_utils.py` to focus on core functionality while removing batch processing and concurrency complexity - **Agent Evaluation Streamlining**: Simplified agent evaluation examples by removing indirection layers and factory patterns - **Benchmark LLM Simplification**: Converted async patterns to simpler synchronous approach and removed unnecessary abstractions - **Improve RAG Examples**: Streamlined evaluation code by removing indirection layers and complex patterns - **Documentation Updates**: Updated text2sql and benchmark_llm documentation to reflect simplified examples and remove obsolete parameters - **Core Library Improvements**: Minor fixes to validation, evaluation, and utility modules for better code quality ## Testing <!-- Describe how this should be tested --> ### How to Test - [ ] Automated tests added/updated - [ ] Manual testing steps: 1. Run the simplified text2sql evaluation examples to ensure functionality is preserved 2. Verify benchmark_llm examples work with simplified codebase 3. Test improve_rag examples to confirm streamlined evaluation flows 4. Check that documentation accurately reflects the simplified examples 5. Ensure core ragas evaluation functionality remains intact after utility changes - Significantly reduced line count: 2,105 deletions vs 818 additions
- Issue: The `iterate_prompt.md` guide was incorrectly placed under Customizations → General, when it actually demonstrates a complete application workflow for evaluating and improving prompts. - The navigation structure didn't logically group related prompt evaluation guides together. - Moved `iterate_prompt.md` from `docs/howtos/customizations/` to `docs/howtos/applications/` - Created new "Prompt Evaluation" subsection under Applications in `mkdocs.yml` - Grouped both prompt evaluation guides together: - Iterate and Improve Prompts - Systematic Prompt Optimization - Updated `docs/howtos/applications/index.md` with the new Prompt Evaluation section - Removed misplaced entries from the Customizations section in `mkdocs.yml` - Fixed broken link (missing .md extension) for prompt_optimization in applications/index.md - [x] Automated tests added/updated: Pre-commit hooks passed (formatting checks) - [x] Manual testing steps: 1. Build the documentation locally: `make build-docs` 2. Verify navigation structure shows "Prompt Evaluation" under Applications 3. Confirm both guides are accessible and properly linked 4. Verify no broken links in the navigation
#2391) ## Issue Link / Problem Description No related issue. There was a formatting problem in the documentation: the step title and its description were merged together due to a missing line break. ## Changes Made Added a line break between the step title and description in the documentation to improve readability. ## Testing Manual testing steps: Open the documentation file and check that Step 3's title and description are separated. Visually confirm the formatting is now correct. ## References Screenshot ## Screenshots/Examples <img width="789" height="388" alt="Screenshot_57" src="https://github.com/user-attachments/assets/da31faef-d111-453c-b599-d768112c7ed6" />
## Issue Link / Problem Description - Fixes #2385 - Testset generator not preserving persona and scenario metadata - Improves synthetic data generation traceability by adding metadata fields to track query generation parameters - Currently there's no way to trace which persona, style, and length settings were used for synthetic queries ## Changes Made - Added metadata fields to `dataset_schema.py`: - `persona_name: Optional[str]` - `query_style: Optional[str]` - `query_length: Optional[str]` - Updated `single_hop/base.py` to populate these fields during synthetic data generation: ```python return SingleTurnSample( user_input=response.query, reference=response.answer, reference_contexts=[reference_context], persona_name=getattr(scenario.persona, "name", None), query_style=getattr(scenario.style, "name", None), query_length=getattr(scenario.length, "name", None), ) ``` - Updated class documentation with descriptions for new fields ## Testing ### How to Test - [x] Manual testing steps: 1. Run synthetic data generation using SingleHopQuerySynthesizer 2. Verify metadata fields are properly populated in generated samples 3. Confirm values match the scenario settings (persona, style, length) 4. Check backwards compatibility with existing code ## References - Fixes Issue: #2385 - Documentation: Updated in `dataset_schema.py` docstring - Implementation: Updated in `single_hop/base.py` for field population ## Screenshots/Examples ```python # Example of generated sample with metadata: { "user_input": "What are the key features of Python?", "reference": "Python is a versatile programming language...", "persona_name": "Student", "query_style": "POOR_GRAMMAR", "query_length": "MEDIUM" } ```
) ## Changes Made ### Documentation (docs/getstarted/evals.md): - Added `quickstart` cmd to follow through guide with an example project. - Updated "Custom Evaluation with LLMs" to reference DiscreteMetric from generated code - Replaced static examples with modern `llm_factory` API - Changed "Using Pre-Built Metrics" to AspectCritic with modern async/await syntax - Updated "Evaluating on a Dataset" to use ragas.Dataset API ### Build Configuration (mkdocs.yml): - Made social plugin conditional: `enabled: !ENV [MKDOCS_CI, true]` ### Makefile: - Added explicit `MKDOCS_CI=false` to serve-docs target. This avoids social plugin error in macos if in case `cairosvg` is not found.
…ete metric examples (#2399)
## Summary Fixes #2404 The formula for recall was incorrectly labeled as "Precision" in the SQL metrics documentation. ## Changes - Fixed line 20 in - Changed label from "Precision" to "Recall" for the formula using reference rows as denominator ## Explanation The two formulas represent different metrics: **Precision** (line 16): - Numerator: Number of matching rows - **Denominator: Total rows in response** - Measures: "Of all rows we returned, how many were correct?" **Recall** (line 20): - Numerator: Number of matching rows - **Denominator: Total rows in reference** - Measures: "Of all correct rows, how many did we return?" The formula on line 20 was mislabeled as Precision but is actually the Recall formula. ## Testing - Verified formula definitions match standard precision/recall definitions - Confirmed the fix aligns with the explanation in line 23 about F1 score being the harmonic mean of precision and recall
## Summary This PR updates the documentation for metrics to showcase the new `ragas.metrics.collections` API as the primary recommended approach, while preserving legacy API documentation for backward compatibility. ## Changes ### Metrics - [x] AnswerAccuracy - [x] AnswerCorrectness - [x] AnswerRelevancy - [x] AnswerSimilarity - [x] BleuScore - [x] ContextEntityRecall - [x] ContextPrecision - [x] ContextUtilization - [x] Faithfulness - [x] ContextRelevance - [x] NoiseSensitivity - [x] RougeScore - [x] SemanticSimilarity - [x] String metrics (ExactMatch, StringPresence, NonLLMStringSimilarity, DistanceMeasure) - [x] SummaryScore ## Documentation Pattern Each metric documentation follows this structure: 1. **Primary Example**: Collections-based API (modern, recommended) 2. **Concepts/How It's Calculated**: Conceptual explanation (implementation-agnostic) 3. **Legacy Section**: Original API for backward compatibility ## Test Plan - [x] Build docs locally to verify formatting - [x] Test code examples to ensure they work - [x] Verify all metrics from collections are documented
…adigm (#2394) ## Problem Description: Documentation structure was outdated and didn't reflect the current library focus on experiments and custom metrics. The library has evolved to emphasize systematic experimentation and custom metrics for evaluating any AI application. ## Changes Made - **Home page (`docs/index.md`)**: Added "Why Ragas?" section explaining value proposition and key features. Updated to reflect experiments-first approach, custom metrics, and broader AI application evaluation. Removed FAQ section and improved card descriptions to be more actionable. - **Get Started (`docs/getstarted/index.md`)**: Reorganized to reflect experiments quickstart as the main entry point. Removed links to outdated RAG-focused tutorials (evals.md, rag_eval.md, rag_testset_generation.md). Added Discord community link and organized tutorials into clearer sections. - **Core Concepts (`docs/concepts/index.md`)**: Reordered sections to prioritize Experimentation and Datasets at the top with separate cards. Updated metrics description to reflect both available metrics library and creating custom metrics. Removed Feedback Intelligence card. Moved Components to the end. - **Navigation (`mkdocs.yml`)**: Updated navigation structure to match current library organization. Removed outdated tutorials from Get Started navigation. Flattened Experiments section (Experimentation and Datasets as direct children). Removed Feedback Intelligence from navigation. Reorganized to reflect experiments-first paradigm. ## Testing ### How to Test - [ ] Automated tests added/updated: N/A (documentation changes only) - [ ] Manual testing steps: 1. Build docs locally: make serve-docs 2. Navigate through home page and verify "Why Ragas?" section appears 3. Check Get Started section - experiments quickstart should be prominent 4. Verify Core Concepts has Experimentation and Datasets at the top 5. Confirm outdated RAG tutorials are no longer in navigation 6. Test all internal links work correctly 7. Verify navigation structure matches current library organization ## References - Related issues: - Documentation: Updated to reflect current library capabilities and focus - External references: N/A ## Screenshots/Examples (if applicable) <!-- Navigation structure changes and content reorganization visible in local docs build -->
Co-authored-by: Ani <[email protected]>
## Issue Link / Problem Description <!-- Link to related issue or describe the problem this PR solves --> - Fixes #2411
--------- Co-authored-by: Kumar Anirudha <[email protected]>
…2420) ## Issue Link / Problem Description - Follow-up to PR #2407 - Completes the migration to collections-based API documentation for metrics that were not covered in the initial PR ## Changes Made - Updated **ContextRecall** documentation to showcase `ragas.metrics.collections.ContextRecall` as the primary example - Updated **FactualCorrectness** documentation to showcase `ragas.metrics.collections.FactualCorrectness` with configuration options (mode, atomicity, coverage) - Updated **ResponseGroundedness** documentation in nvidia_metrics.md to showcase `ragas.metrics.collections.ResponseGroundedness` as the primary example - Moved all legacy API examples to "Legacy Metrics API" sections with deprecation warnings - Added synchronous usage notes (`.score()` method) for all three metrics - Preserved all conceptual explanations and "How It's Calculated" sections ## Testing ### How to Test - [x] Automated tests added/updated: N/A (documentation only) - [x] Manual testing steps: 1. Verified `make build-docs` succeeds without errors ✓ 2. Tested all new code examples to ensure they work as documented 3. Confirmed output values match expected results 4. Verified consistency with PR #2407 documentation style ## References - Related issues: Follow-up to PR #2407 - Documentation: - Updated: `docs/concepts/metrics/available_metrics/context_recall.md` - Updated: `docs/concepts/metrics/available_metrics/factual_correctness.md` - Updated: `docs/concepts/metrics/available_metrics/nvidia_metrics.md` (ResponseGroundedness section) - Pattern reference: PR #2407 (faithfulness.md, context_precision.md, answer_correctness.md) ## Screenshots/Examples (if applicable) All three metrics now follow the consistent pattern: 1. **Primary Example**: Collections-based API (modern, recommended) 2. **Concepts**: Implementation-agnostic explanation 3. **Synchronous Usage Note**: `.score()` method alternative 4. **Legacy Section**: Original API with deprecation timeline warnings
…and `top_p` constraint handling (#2418)
…ort (#2424) A fix to support latest instructor, as they removed `from_anthropic` and `from_gemini` methods for a more standard `from_provider`. Ref: [PR 1898](567-labs/instructor#1898) Also added support for #2422
|
Will work on a different approach. Closing this. |
Issue Link / Problem Description
Changes Made
Testing
How to Test
uv run pytest tests/ -k "batch" -vto test batch functionality.uv run pytest tests/ -k "metric" -k "batch" -vlike:uv run pytest tests/ -k "faithfulness" -k "batch" -vOPENAI_API_KEY=your_key uv run pytest tests/integration -k "batch" -vwith openai key.