Skip to content

Conversation

@anistark
Copy link
Member

@anistark anistark commented Sep 2, 2025

Issue Link / Problem Description

Changes Made

  • implement additional batch api
  • add faithfullness metric for starters
  • add docs on batch api usage and cutting costs

Testing

How to Test

  • Automated tests added/updated
  • Manual testing steps:
    1. uv run pytest tests/ -k "batch" -v to test batch functionality.
    2. uv run pytest tests/ -k "metric" -k "batch" -v like: uv run pytest tests/ -k "faithfulness" -k "batch" -v
    3. OPENAI_API_KEY=your_key uv run pytest tests/integration -k "batch" -v with openai key.

@dosubot dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Sep 2, 2025
@anistark anistark changed the title Feat/OpenAI batch api Feat: OpenAI batch api Sep 2, 2025
@anistark anistark changed the title Feat: OpenAI batch api feat: OpenAI batch api Sep 2, 2025
@anistark anistark requested a review from jjmachan September 2, 2025 15:19
@anistark anistark force-pushed the feat/openai-batch-api branch from 1a7eb06 to bd05971 Compare September 24, 2025 11:32
@anistark
Copy link
Member Author

/claude-review

@claude
Copy link

claude bot commented Sep 24, 2025

Claude finished @anistark's task —— View job


🔍 PR Code Review: OpenAI Batch API Implementation

Review Progress:

  • Analyze PR scope and change type classification
  • Review core batch API implementation (src/ragas/batch_evaluation.py)
  • Review LLM base changes and batch API integration (src/ragas/llms/)
  • Review faithfulness metric batch implementation (src/ragas/metrics/_faithfulness.py)
  • Review metrics base changes (src/ragas/metrics/base.py)
  • Review test coverage and quality (tests/unit/llms/)
  • Review documentation (docs/howtos/customizations/metrics/)
  • Check configuration changes (pyproject.toml, mkdocs.yml)
  • Provide overall assessment and recommendations

📋 Change Type Classification

✨ New Feature: This PR adds comprehensive OpenAI Batch API support to Ragas, enabling cost-effective evaluation with up to 50% savings for large-scale evaluations.


🚀 Strengths & Excellent Implementation

Outstanding Architecture Design

  • Clean separation of concerns with dedicated modules (batch_api.py, batch_evaluation.py)
  • Comprehensive type hints and proper error handling throughout
  • Async/sync dual support with proper runtime checks using _check_async_client()
  • Factory pattern usage for clean instantiation via create_batch_api() and create_batch_evaluator()

Robust Error Handling & Validation

  • Thorough request validation including duplicate custom_id detection (src/ragas/llms/batch_api.py:256-259)
  • File size limits and batch size enforcement with clear error messages (src/ragas/llms/batch_api.py:264-267)
  • Timeout handling with configurable polling intervals and meaningful timeout errors
  • Comprehensive status tracking with well-defined BatchStatus enum

Intelligent Response Parsing

The score extraction logic in BatchEvaluationResult is particularly well-designed:

  • Multiple parsing strategies for JSON, structured output, and raw text formats
  • Graceful degradation when parsing fails with detailed logging
  • Metric-specific patterns for common score fields (src/ragas/batch_evaluation.py:332-341)
  • Range validation ensuring scores are in reasonable bounds

Smart Faithfulness Integration

  • Two-phase batch processing with statement generation followed by NLI verification
  • Complete orchestration via CompleteFaithfulnessBatchJob class
  • Proper dependency handling between batch job phases
  • Error recovery with appropriate fallbacks to NaN values

Comprehensive Documentation

  • Clear usage examples with practical cost estimation utilities
  • Step-by-step guides covering sync/async usage patterns
  • Cost optimization guidance with concrete savings calculations
  • Well-integrated into existing documentation structure (mkdocs.yml:106)

Strong Test Coverage

  • Unit tests covering all major components with mock-based testing
  • Validation scenarios testing error conditions and edge cases
  • Batch support detection for different LLM types

🔄 Areas for Enhancement

⚠️ Minor Implementation Improvements

  1. Prompt Conversion Simplification (src/ragas/metrics/_faithfulness.py:234-242)

    • Current implementation uses simplified ChatPromptTemplate fallback
    • Consider integrating with actual PydanticPrompt conversion methods for consistency
  2. Context Utilization

    • Current faithfulness batch implementation doesn't fully leverage retrieved_contexts
    • Could enhance NLI verification accuracy by using complete context information
  3. Integration Test Gap

    • Consider adding optional end-to-end tests with actual OpenAI API
    • Large dataset batch splitting scenarios could benefit from integration testing

💡 Suggestions for Future Iterations

  1. Batch Size Optimization: Consider implementing adaptive batch sizing based on estimated token counts
  2. Progress Tracking: Could add progress callbacks for long-running batch jobs
  3. Retry Mechanisms: Enhanced retry logic for failed individual requests within batches
  4. Cost Tracking: Runtime cost calculation based on actual token usage from responses

🎯 Overall Assessment

Verdict: EXCELLENT Implementation ⭐⭐⭐⭐⭐

This PR demonstrates exceptional software engineering practices:

  • Production-ready code quality with comprehensive error handling and validation
  • Thoughtful API design balancing ease of use with flexibility
  • Excellent documentation making the feature immediately usable
  • Strong test coverage providing confidence in reliability
  • Smart integration with existing Ragas architecture without breaking changes

Impact & Value

  • Significant cost savings (up to 50%) for large-scale evaluations
  • Scalability improvement supporting up to 50K requests per batch
  • Non-blocking evaluation capabilities for better resource utilization
  • Maintains Ragas quality standards while adding powerful new functionality

Recommendation: APPROVE

This PR successfully addresses issue #2136 with a high-quality implementation that:

  • Follows established patterns and conventions
  • Provides comprehensive batch API support
  • Includes excellent documentation and examples
  • Maintains backward compatibility
  • Offers substantial value to users

The minor suggestions above are enhancements for future iterations rather than blocking issues. The implementation is solid and ready for production use.


anistark and others added 20 commits November 17, 2025 11:15
## Issue Link / Problem Description
<!-- Link to related issue or describe the problem this PR solves -->
- contd #2246

---------

Co-authored-by: Nabegon <[email protected]>
Ragas is a great tool but the contents of the website can still be
improved.

I corrected typos and unified some names (like langchain, Langchain ->
LangChain) in Markdown documentation files. It seems that most of docs
were written by a single author and lots of common typos were repeatly
introduced. There are also some other problems about docs such as
readability, categorization, clarification and more. I should have
planned to polish them one by one but I may not have enough time to
polish all of them, because I realized there are so many documentations
after correcting typos for them. XD

This pull request only involves correcting typos and I recommend the
contributor who is in charge of writing docs to have a look at the
modifications to avoid introducing these typos in the future. You may
also consider using LTex extension in VSCode extension market to
recognize them (though it is not perfect).
The matching ipynb files are not corrected since I do not figure out a
method to identify those typos in ipynb files easily. You may consider
to correct the ipynb files as well so you can generate Markdown files
accordingly.

---------

Co-authored-by: Siddharth Sahu <[email protected]>
Deprecation warnings for LLMs and Prompts
## Changes Made
<!-- Describe what you changed and why -->
- **Added comprehensive RAG evaluation guide**: Created
`docs/howtos/applications/evaluate-and-improve-rag.md` with step-by-step
instructions for evaluating and improving RAG apps with Ragas
- **Created `ragas_examples.improve_rag` module** with complete working
examples:
- `simple_rag.py`: Basic RAG implementation using BM25 retriever and
OpenAI LLM with MLflow tracing
- `agentic_rag.py`: Advanced agentic RAG using OpenAI Agents SDK that
can iteratively search and refine queries
  - `evals.py`: Complete evaluation pipeline with Ragas experiments
- `data_utils.py`: Shared utilities for BM25 retriever setup and
document processing
- **Added new dependency group**: `improverag` extra in
`examples/pyproject.toml` including MLflow, BM25, LangChain, and OpenAI
Agents SDK
- **Updated navigation**: Added new guide to mkdocs.yml and howtos
applications index
- **Added MLflow tracing screenshot**:
`docs/_static/imgs/howto_improve_rag_mlflow.png` showing evaluation
traces

## Testing
<!-- Describe how this should be tested -->
### How to Test
- [x] Manual testing steps:
  1. Install dependencies: `uv pip install "ragas-examples[improverag]"`
  2. Set OpenAI API key: `export OPENAI_API_KEY="your_key"`
3. Run simple RAG: `uv run python -m
ragas_examples.improve_rag.simple_rag`
4. Run agentic RAG: `uv run python -m
ragas_examples.improve_rag.agentic_rag`
5. Run evaluation (test mode): `uv run python -m
ragas_examples.improve_rag.evals --test`
6. Run evaluation with agentic RAG: `uv run python -m
ragas_examples.improve_rag.evals --agentic-rag --test`
7. Start MLflow UI: `uv run mlflow ui --backend-store-uri
sqlite:///mlflow.db --port 5000`
  8. Verify traces appear in MLflow dashboard

---------

Co-authored-by: Kumar Anirudha <[email protected]>
## Issue Link / Problem Description
<!-- Link to related issue or describe the problem this PR solves -->
- contd #2241

---------

Co-authored-by: sahusiddharth <[email protected]>
## Issue Link / Problem Description
<!-- Link to related issue or describe the problem this PR solves -->
- fixes warnings in tests.
Description:
This PR addresses an issue in the generate_multiple method where
enabling caching caused it to return duplicate responses instead of
generating multiple distinct outputs.

Problem:
When generate_multiple is used with caching turned on, all calls used
the same cache key internally. As a result:

The first generated output is cached.

Subsequent calls to generate_multiple fetch the same cached output
instead of generating new ones.

This breaks the intended functionality of producing multiple diverse
outputs.

Root Cause:
The cache key was not uniquely generated per input and per requested
number of outputs. Instead, it reused the same key across all calls.

Solution:

Modified the cache key generation to include all parameters that affect
output (input text, number of outputs, and any relevant generation
options).

Ensured that each call to generate_multiple with the same input but
requesting multiple outputs generates distinct results and caches them
appropriately.

Added safeguards to prevent cache collisions.

Testing Performed:

Unit Tests:

Verified that multiple outputs are returned for the same input when
caching is enabled.

Confirmed that caching works correctly without returning duplicates.

Manual Testing:

Tested with different inputs and multiple output requests.

Ensured outputs are unique across multiple calls and cached
appropriately.

Regression Check:

Confirmed that previous generate calls without multiple outputs continue
to work as expected.

Impact:

Restores correct functionality of generate_multiple when caching is
enabled.

Ensures caching still improves performance without affecting output
correctness.

Additional Notes:

This fix is backward-compatible and does not affect other parts of the
codebase.

All relevant formatting and linting checks have passed.
…om class-instantiated metrics (#2316)

**Primary Motivation**: This PR fixes a fundamental inheritance pattern
issue in the metrics system where factory-created metrics (via
`@discrete_metric`, `@numeric_metric`, etc.) and class-instantiated
metrics (via `DiscreteMetric()`, `NumericMetric()`, etc.) should have
different base classes but were incorrectly sharing the same inheritance
hierarchy.

**The Problem**:
- Factory-created metrics should inherit from `SimpleBaseMetric`
(lightweight, decorator-based)
- Class-instantiated metrics should inherit from `SimpleLLMMetric`
(LLM-enabled, full-featured)
- Previously, both paths incorrectly inherited from the same base
classes, creating confusion and incorrect behavior

**The Solution**:
• **Separated base classes**: Created `SimpleBaseMetric` (for factory)
and `SimpleLLMMetric` (for class instantiation) as distinct, unrelated
base classes
• **Removed `llm_based.py`**: Consolidated `BaseLLMMetric` and
`LLMMetric` into `base.py` as `SimpleBaseMetric` and `SimpleLLMMetric`
• **Fixed decorator inheritance**: Factory methods now create metrics
that inherit from `SimpleBaseMetric + ValidatorMixin` only
• **Fixed class inheritance**: Class-based metrics like `DiscreteMetric`
now inherit from `SimpleLLMMetric + ValidatorMixin`
• **Added validator system**: Introduced modular validation mixins that
work with both inheritance patterns
• **Maintained backward compatibility**: Added aliases `BaseMetric =
SimpleBaseMetric` and `LLMMetric = SimpleLLMMetric`

**Exact Steps Taken**:
1. `7d6de2a` - Updated gitignore for experimental directories
2. `c6101f8` - Renamed classes and established proper naming convention
3. `46450d8` - Refactored decorator and class-based inheritance patterns
4. `a464c37` - Simplified validator system with proper mixins
5. `fe996f6` - Removed `llm_based.py` after consolidation

- [ ] Verify factory-created metrics (`@discrete_metric`) inherit from
`SimpleBaseMetric` only
- [ ] Verify class-instantiated metrics (`DiscreteMetric()`) inherit
from `SimpleLLMMetric`
- [ ] Test that both patterns work correctly with their respective
validation mixins
- [ ] Ensure backward compatibility with existing metric imports
- [ ] Validate all metric functionality (scoring, async operations,
alignment)
- [ ] Run full test suite to ensure no regressions

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>

---------

Co-authored-by: Ani <[email protected]>
## Issue Link / Problem Description
<!-- Link to related issue or describe the problem this PR solves -->
- Fixes #2324 

## Changes Made
<!-- Describe what you changed and why -->
- IndexError in generate_multiple when LLM returns fewer generations
than requested
## Issue Link / Problem Description
<!-- Link to related issue or describe the problem this PR solves -->
- Fixes #2326 

## Changes Made
<!-- Describe what you changed and why -->
- fixed answer_relevancy scoring logic to prevent false zero scores for
valid answers
This PR adds direct support for Oracle Cloud Infrastructure (OCI)
Generative AI models in Ragas, enabling evaluation without requiring
LangChain or LlamaIndex dependencies. Currently, users who want to use
OCI Gen AI models must go through LangChain or LlamaIndex wrappers,
which adds unnecessary complexity and dependencies.

**Problem**: No direct OCI Gen AI integration exists in Ragas, forcing
users to use indirect approaches through LangChain/LlamaIndex.

**Solution**: Implement a native OCI Gen AI wrapper that uses the OCI
Python SDK directly.

- **Add `OCIGenAIWrapper`** - New LLM wrapper class extending
`BaseRagasLLM`
- **Direct OCI SDK Integration** - Uses
`oci.generative_ai.GenerativeAiClient` directly
- **Factory Function** - `oci_genai_factory()` for easy initialization
- **Async Support** - Full async/await implementation with proper error
handling

[Pretrained Foundational Models supported by OCI Generative
AI](https://docs.oracle.com/en-us/iaas/Content/generative-ai/pretrained-models.htm)

- **Optional Dependency** - Added `oci>=2.160.1` as optional dependency
- **Import Safety** - Graceful handling when OCI SDK is not installed
- **Configuration Options** - Support for OCI CLI config, environment
variables, or manual config

- **Comprehensive Test Suite** - 15+ test cases with mocking
- **Error Handling Tests** - Tests for authentication, model not found,
permission errors
- **Async Testing** - Full async operation testing
- **Factory Testing** - Factory function validation

- **Complete Integration Guide** - Step-by-step setup and usage
- **Working Example Script** - `examples/oci_genai_example.py`
- **Authentication Guide** - Multiple OCI auth methods
- **Troubleshooting Section** - Common issues and solutions
- **Updated Integration Index** - Added to main integrations page

- **Usage Tracking** - Built-in analytics with `LLMUsageEvent`
- **Error Logging** - Comprehensive error logging and debugging
- **Performance Monitoring** - Request tracking and metrics

- [x] Automated tests added/updated
- [x] Manual testing steps:
  1. **Install OCI dependency**: `pip install ragas[oci]`
2. **Configure OCI authentication**: Set up OCI config file or
environment variables
  3. **Run example script**: `python examples/oci_genai_example.py`
  4. **Test with different models**: Try Cohere, Meta, and xAI models
  5. **Test async operations**: Verify async generation works correctly
6. **Test error handling**: Verify proper error messages for auth/model
issues

```bash
pytest tests/unit/test_oci_genai_wrapper.py -v

python -c "import ast; ast.parse(open('src/ragas/llms/oci_genai_wrapper.py').read())"
```

- **OCI Python SDK**:
https://docs.oracle.com/en-us/iaas/tools/python/2.160.1/api/generative_ai.html
- **OCI Gen AI Documentation**:
https://docs.oracle.com/en-us/iaas/Content/generative-ai/
- **Ragas LLM Patterns**: Follows existing `BaseRagasLLM` patterns
- **Related Issues**: Direct OCI Gen AI support request

```python
from ragas.llms import oci_genai_factory
from ragas import evaluate

llm = oci_genai_factory(
    model_id="cohere.command",
    compartment_id="ocid1.compartment.oc1..example"
)

result = evaluate(dataset, llm=llm)
```

```python
config = {
    "user": "ocid1.user.oc1..example",
    "key_file": "~/.oci/private_key.pem",
    "fingerprint": "your_fingerprint",
    "tenancy": "ocid1.tenancy.oc1..example",
    "region": "us-ashburn-1"
}

llm = oci_genai_factory(
    model_id="cohere.command",
    compartment_id="ocid1.compartment.oc1..example",
    config=config,
    endpoint_id="ocid1.endpoint.oc1..example"  # Optional
)
```

- `src/ragas/llms/oci_genai_wrapper.py` - Main implementation
- `src/ragas/llms/__init__.py` - Export new classes
- `pyproject.toml` - Add OCI optional dependency
- `tests/unit/test_oci_genai_wrapper.py` - Comprehensive tests
- `docs/howtos/integrations/oci_genai.md` - Complete documentation
- `docs/howtos/integrations/index.md` - Updated integration index
- `examples/oci_genai_example.py` - Working example script

None - This is a purely additive feature with no breaking changes.

- **New Optional Dependency**: `oci>=2.160.1`
- **No Breaking Changes**: Existing functionality unchanged
- **Backward Compatible**: All existing code continues to work
…rics (#2320)

## Summary

This PR adds persistence capabilities and better string representations
for LLM-based metrics, making them easier to save, share, and debug.

## Changes

### 1. Save/Load Functionality
- Added `save()` and `load()` methods to `SimpleLLMMetric` and its
subclasses (`DiscreteMetric`, `NumericMetric`, `RankingMetric`)
- Supports JSON format with optional gzip compression
- Handles all prompt types including `Prompt` and `DynamicFewShotPrompt`
- Smart defaults: `metric.save()` saves to `./metric_name.json`

### 2. Improved `__repr__` Methods
- Clean, informative string representations for both LLM-based and
decorator-based metrics
- Removed implementation details (memory addresses, `<locals>`, internal
attributes)
- Smart prompt truncation (80 chars max)
- Function signature display for decorator-based metrics

**Before:**
```python
create_metric_decorator.<locals>.decorator_factory.<locals>.decorator.<locals>.CustomMetric(name='summary_accuracy', _func=<function summary_accuracy at 0x151ffdf80>, ...)
```

**After:**
```python
# LLM-based metrics
DiscreteMetric(name='response_quality', allowed_values=['correct', 'incorrect'], prompt='Evaluate if the response...')

# Decorator-based metrics  
summary_accuracy(user_input, response) -> DiscreteMetric[['pass', 'fail']]
```

### 3. Response Model Handling
- Added `create_auto_response_model()` factory to mark auto-generated
models
- Only warns about custom response models during save, not standard ones

## Usage Examples

```python
# Save metric with default path
metric.save()  # → ./response_quality.json

# Save with custom path
metric.save("custom.json")
metric.save("/path/to/metrics/")  # → /path/to/metrics/response_quality.json
metric.save("compressed.json.gz")  # Compressed

# Load metric
loaded_metric = DiscreteMetric.load("response_quality.json")

# For DynamicFewShotPrompt metrics
loaded_metric = DiscreteMetric.load("metric.json", embedding_model=embeddings)
```

## Testing
- Comprehensive test suite with 8 tests covering all save/load scenarios
- Tests for default paths, directory handling, compression
- Tests for all prompt types and metric subclasses

## Dependencies
**Note:** This PR builds on #2316 (Fix metric inheritance patterns) and
requires it to be merged first. The changes here depend on the
cleaned-up metric inheritance structure from that PR.

## Checklist
- [x] Tests added
- [x] Documentation in docstrings
- [x] Backwards compatible (new functionality only)
- [x] Follows TDD practices
#### Python 3.13 on macOS ARM: NumPy fails to install (builds from
source)

- Symptom: `make install` on Python 3.13 tries to build `numpy==2.0.x`
from source on macOS ARM and fails with C/C++ errors.
- Status: Ragas CI currently targets Python 3.9–3.12; Python 3.13 is
best-effort until upstream wheels are broadly available.

Workarounds:

1) Recommended: use Python 3.12
```bash
uv python install 3.12
uv venv -p 3.12 .venv-3.12
source .venv-3.12/bin/activate
uv sync --group dev
make check
```

2) Stay on Python 3.13 (best effort):
- Minimal install first to avoid heavy transitive pins:
```bash
uv venv -p 3.13 .venv-3.13
source .venv-3.13/bin/activate
uv pip install -e ".[dev-minimal]"
make check
```
- If you need extras, add gradually:
```bash
uv pip install "ragas[tracing,gdrive,ai-frameworks]"
```
- Prefer a prebuilt NumPy wheel (if available):
```bash
uv pip install "numpy>=2.1" --only-binary=:all:
```
If the resolver still pins to 2.0.x via transitive deps, temporarily set
`numpy>=2.1` locally and re-run `uv sync --group dev`.

3) Last resort: build NumPy locally
```bash
xcode-select --install
export SDKROOT="$(xcrun --sdk macosx --show-sdk-path)"
export CC=clang
uv pip install numpy
```

Safe alternate-venv tip:
- Keep your project `.venv` untouched and use `.venv-3.12` /
`.venv-3.13`. Avoid `make install` in alt envs; prefer `uv` commands
directly. `make check` respects the active env via `uv run --active`.
## Issue Link / Problem Description
<!-- Link to related issue or describe the problem this PR solves -->
- Fixes typing
## Changes Made
<!-- Describe what you changed and why -->
- add type hints, docstrings, fix typos, and remove duplicate
create_nano_id
Summary

  Remove Reo and RB2B analytics tracking from documentation site.

  This PR removes third-party tracking scripts from the documentation:
  - Deleted docs/_static/js/reo.js - Reo analytics tracking script
  - Deleted docs/_static/js/rb2b.js - RB2B analytics tracking script
  - Removed references to both scripts from mkdocs.yml

  Changes

  - mkdocs.yml: Removed two entries from extra_javascript configuration
  - docs/_static/js/reo.js: Deleted file
  - docs/_static/js/rb2b.js: Deleted file

  Motivation

Streamlining analytics vendors used in the documentation site. Octolane
and CommonRoom analytics remain active.
## Issue Link / Problem Description
<!-- Link to related issue or describe the problem this PR solves -->
- Makes `default_query_distribution` robust when a `KnowledgeGraph` is
missing data or certain synthesizers are incompatible.
- Skips incompatible MultiHop synthesizers instead of failing when they
cannot produce clusters.

## Changes Made
<!-- Describe what you changed and why -->
- In `src/ragas/testset/synthesizers/__init__.py`, updated
`default_query_distribution` to:
- Build a default set of synthesizers:
`SingleHopSpecificQuerySynthesizer`, `MultiHopAbstractQuerySynthesizer`,
`MultiHopSpecificQuerySynthesizer`.
- When a `KnowledgeGraph` is provided, probe each synthesizer via
`get_node_clusters(kg)` and include only those that can operate on the
given KG.
- Catch and log unexpected errors per synthesizer, skipping the failing
one instead of aborting the whole pipeline.

## Test Plan
### How to Test
Manual testing steps:
  1. run generate_with_langchain_docs with default settings
  2. the result show success

```
Applying SummaryExtractor: 100%|█| 1/1 [00:02<00
Applying CustomNodeFilter: 100%|█| 1/1 [00:02<00
Applying EmbeddingExtractor: 100%|█| 1/1 [00:02<
Applying ThemesExtractor: 100%|█| 1/1 [00:02<00:
Applying NERExtractor: 100%|█| 1/1 [00:02<00:00,
Applying CosineSimilarityBuilder: 100%|█| 1/1 [0
Applying OverlapScoreBuilder: 100%|█| 1/1 [00:00
Skipping multi_hop_abstract_query_synthesizer due to unexpected error: No relationships match the provided condition. Cannot form clusters.
Generating personas: 100%|█| 1/1 [00:03<00:00,  
Generating Scenarios: 100%|█| 1/1 [00:02<00:00, 
Generating Samples: 100%|█| 1/1 [00:02<00:00,  2
```

Test code:
```python
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from ragas.testset import TestsetGenerator
from langchain_openai import ChatOpenAI
from langchain_openai.embeddings import OpenAIEmbeddings

import os
os.environ["OPENAI_API_KEY"] = "xxxxxxxxx"

loader = DirectoryLoader(
    "data/",
    glob=["**/*.md", "**/*.mdx"],
    loader_cls=TextLoader,
    loader_kwargs={"encoding": "utf-8"},
)

documents = loader.load()

for document in documents:
    document.metadata["filename"] = document.metadata["source"]

llm = ChatOpenAI(model="gpt-4o")
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

generator = TestsetGenerator.from_langchain(
    llm=llm,
    embedding_model=embeddings,
)

testset = generator.generate_with_langchain_docs(
    documents,
    testset_size=1,
    with_debugging_logs=True,
)
```

## References
<!-- Link to related issues, discussions, forums, or external resources
-->
- Related issues: N/A (robustness improvement)
- Documentation: N/A
- External references: N/A

---
<!-- 
Thank you for contributing to Ragas! 
Please fill out the sections above as completely as possible.
The more information you provide, the faster your PR can be reviewed and
merged.
-->

---------

Co-authored-by: ken-sevenseas <[email protected]>
PR Description:

## Issue Link / Problem Description
<!-- Link to related issue or describe the problem this PR solves -->
- Addresses the complexity barrier in Ragas examples that was making it
difficult for beginners to understand and use evaluation workflows
- Removes overly complex async patterns, abstractions, and
infrastructure code from examples that obscured the core evaluation
concepts

## Changes Made
<!-- Describe what you changed and why -->
- **Text2SQL Examples Simplification**: Streamlined all text2sql
evaluation components by removing unnecessary async patterns, timing
infrastructure, and complex abstractions
- **Database and Data Utils Cleanup**: Simplified `db_utils.py` and
`data_utils.py` to focus on core functionality while removing batch
processing and concurrency complexity
- **Agent Evaluation Streamlining**: Simplified agent evaluation
examples by removing indirection layers and factory patterns
- **Benchmark LLM Simplification**: Converted async patterns to simpler
synchronous approach and removed unnecessary abstractions
- **Improve RAG Examples**: Streamlined evaluation code by removing
indirection layers and complex patterns
- **Documentation Updates**: Updated text2sql and benchmark_llm
documentation to reflect simplified examples and remove obsolete
parameters
- **Core Library Improvements**: Minor fixes to validation, evaluation,
and utility modules for better code quality

## Testing
<!-- Describe how this should be tested -->
### How to Test
- [ ] Automated tests added/updated
- [ ] Manual testing steps:
1. Run the simplified text2sql evaluation examples to ensure
functionality is preserved
  2. Verify benchmark_llm examples work with simplified codebase
  3. Test improve_rag examples to confirm streamlined evaluation flows
4. Check that documentation accurately reflects the simplified examples
5. Ensure core ragas evaluation functionality remains intact after
utility changes

- Significantly reduced line count: 2,105 deletions vs 818 additions
- Issue: The `iterate_prompt.md` guide was incorrectly placed under
Customizations → General, when it actually demonstrates a complete
application workflow for evaluating and improving prompts.
- The navigation structure didn't logically group related prompt
evaluation guides together.

- Moved `iterate_prompt.md` from `docs/howtos/customizations/` to
`docs/howtos/applications/`
- Created new "Prompt Evaluation" subsection under Applications in
`mkdocs.yml`
- Grouped both prompt evaluation guides together:
  - Iterate and Improve Prompts
  - Systematic Prompt Optimization
- Updated `docs/howtos/applications/index.md` with the new Prompt
Evaluation section
- Removed misplaced entries from the Customizations section in
`mkdocs.yml`
- Fixed broken link (missing .md extension) for prompt_optimization in
applications/index.md

- [x] Automated tests added/updated: Pre-commit hooks passed (formatting
checks)
- [x] Manual testing steps:
  1. Build the documentation locally: `make build-docs`
2. Verify navigation structure shows "Prompt Evaluation" under
Applications
  3. Confirm both guides are accessible and properly linked
  4. Verify no broken links in the navigation
nkch1k and others added 27 commits November 17, 2025 11:18
#2391)

## Issue Link / Problem Description
No related issue.

There was a formatting problem in the documentation: the step title and
its description were merged together due to a missing line break.

## Changes Made
Added a line break between the step title and description in the
documentation to improve readability.

## Testing

 Manual testing steps:
Open the documentation file and check that Step 3's title and
description are separated.
Visually confirm the formatting is now correct.

## References
Screenshot

## Screenshots/Examples
<img width="789" height="388" alt="Screenshot_57"
src="https://github.com/user-attachments/assets/da31faef-d111-453c-b599-d768112c7ed6"
/>
## Issue Link / Problem Description
- Fixes #2385 - Testset generator not preserving persona and scenario
metadata
- Improves synthetic data generation traceability by adding metadata
fields to track query generation parameters
- Currently there's no way to trace which persona, style, and length
settings were used for synthetic queries

## Changes Made
- Added metadata fields to `dataset_schema.py`:
  - `persona_name: Optional[str]`
  - `query_style: Optional[str]` 
  - `query_length: Optional[str]`
- Updated `single_hop/base.py` to populate these fields during synthetic
data generation:
  ```python
  return SingleTurnSample(
      user_input=response.query,
      reference=response.answer,
      reference_contexts=[reference_context],
      persona_name=getattr(scenario.persona, "name", None),
      query_style=getattr(scenario.style, "name", None),
      query_length=getattr(scenario.length, "name", None),
  )
  ```
- Updated class documentation with descriptions for new fields

## Testing
### How to Test
- [x] Manual testing steps:
  1. Run synthetic data generation using SingleHopQuerySynthesizer
  2. Verify metadata fields are properly populated in generated samples
  3. Confirm values match the scenario settings (persona, style, length)
  4. Check backwards compatibility with existing code

## References
- Fixes Issue: #2385
- Documentation: Updated in `dataset_schema.py` docstring
- Implementation: Updated in `single_hop/base.py` for field population

## Screenshots/Examples
```python
# Example of generated sample with metadata:
{
    "user_input": "What are the key features of Python?",
    "reference": "Python is a versatile programming language...",
    "persona_name": "Student",
    "query_style": "POOR_GRAMMAR",
    "query_length": "MEDIUM"
}
```
)

## Changes Made

### Documentation (docs/getstarted/evals.md):
- Added `quickstart` cmd to follow through guide with an example
project.
- Updated "Custom Evaluation with LLMs" to reference DiscreteMetric from
generated code
- Replaced static examples with modern `llm_factory` API
- Changed "Using Pre-Built Metrics" to AspectCritic with modern
async/await
syntax
- Updated "Evaluating on a Dataset" to use ragas.Dataset API

### Build Configuration (mkdocs.yml):
- Made social plugin conditional: `enabled: !ENV [MKDOCS_CI, true]`

### Makefile:
- Added explicit `MKDOCS_CI=false` to serve-docs target. This avoids
social plugin error in macos if in case `cairosvg` is not found.
## Summary

Fixes #2404 

The formula for recall was incorrectly labeled as "Precision" in the SQL
metrics documentation.

## Changes

- Fixed line 20 in 
- Changed label from "Precision" to "Recall" for the formula using
reference rows as denominator

## Explanation

The two formulas represent different metrics:

**Precision** (line 16):
- Numerator: Number of matching rows
- **Denominator: Total rows in response**
- Measures: "Of all rows we returned, how many were correct?"

**Recall** (line 20):
- Numerator: Number of matching rows  
- **Denominator: Total rows in reference**
- Measures: "Of all correct rows, how many did we return?"

The formula on line 20 was mislabeled as Precision but is actually the
Recall formula.

## Testing

- Verified formula definitions match standard precision/recall
definitions
- Confirmed the fix aligns with the explanation in line 23 about F1
score being the harmonic mean of precision and recall
## Summary

This PR updates the documentation for metrics to showcase the new
`ragas.metrics.collections` API as the primary recommended approach,
while preserving legacy API documentation for backward compatibility.

## Changes

### Metrics
- [x] AnswerAccuracy 
- [x] AnswerCorrectness
- [x] AnswerRelevancy
- [x] AnswerSimilarity
- [x] BleuScore
- [x] ContextEntityRecall
- [x] ContextPrecision
- [x] ContextUtilization
- [x] Faithfulness 
- [x] ContextRelevance
- [x] NoiseSensitivity
- [x] RougeScore
- [x] SemanticSimilarity
- [x] String metrics (ExactMatch, StringPresence,
NonLLMStringSimilarity, DistanceMeasure)
- [x] SummaryScore

## Documentation Pattern

Each metric documentation follows this structure:
1. **Primary Example**: Collections-based API (modern, recommended)
2. **Concepts/How It's Calculated**: Conceptual explanation
(implementation-agnostic)
3. **Legacy Section**: Original API for backward compatibility

## Test Plan

- [x] Build docs locally to verify formatting
- [x] Test code examples to ensure they work
- [x] Verify all metrics from collections are documented
…avor of SemanticSimilarity (#2410)

## Issue Link / Problem Description
<!-- Link to related issue or describe the problem this PR solves -->
- Fixes #2409
…adigm (#2394)

## Problem Description: 
Documentation structure was outdated and didn't reflect the current
library focus on experiments and custom metrics. The library has evolved
to emphasize systematic experimentation and custom metrics for
evaluating any AI application.

## Changes Made
- **Home page (`docs/index.md`)**: Added "Why Ragas?" section explaining
value proposition and key features. Updated to reflect experiments-first
approach, custom metrics, and broader AI application evaluation. Removed
FAQ section and improved card descriptions to be more actionable.

- **Get Started (`docs/getstarted/index.md`)**: Reorganized to reflect
experiments quickstart as the main entry point. Removed links to
outdated RAG-focused tutorials (evals.md, rag_eval.md,
rag_testset_generation.md). Added Discord community link and organized
tutorials into clearer sections.

- **Core Concepts (`docs/concepts/index.md`)**: Reordered sections to
prioritize Experimentation and Datasets at the top with separate cards.
Updated metrics description to reflect both available metrics library
and creating custom metrics. Removed Feedback Intelligence card. Moved
Components to the end.

- **Navigation (`mkdocs.yml`)**: Updated navigation structure to match
current library organization. Removed outdated tutorials from Get
Started navigation. Flattened Experiments section (Experimentation and
Datasets as direct children). Removed Feedback Intelligence from
navigation. Reorganized to reflect experiments-first paradigm.

## Testing
### How to Test
- [ ] Automated tests added/updated: N/A (documentation changes only)
- [ ] Manual testing steps:
  1. Build docs locally: make serve-docs 
  2. Navigate through home page and verify "Why Ragas?" section appears
3. Check Get Started section - experiments quickstart should be
prominent
  4. Verify Core Concepts has Experimentation and Datasets at the top
  5. Confirm outdated RAG tutorials are no longer in navigation
  6. Test all internal links work correctly
  7. Verify navigation structure matches current library organization

## References
- Related issues: 
- Documentation: Updated to reflect current library capabilities and
focus
- External references: N/A

## Screenshots/Examples (if applicable)
<!-- Navigation structure changes and content reorganization visible in
local docs build -->
## Issue Link / Problem Description
<!-- Link to related issue or describe the problem this PR solves -->
- Fixes #2411
…re factuality mode (#2414)

## Issue Link / Problem Description
<!-- Link to related issue or describe the problem this PR solves -->
- Fixes #2408
---------

Co-authored-by: Kumar Anirudha <[email protected]>
…2420)

## Issue Link / Problem Description
- Follow-up to PR #2407
- Completes the migration to collections-based API documentation for
metrics that were not covered in the initial PR

## Changes Made
- Updated **ContextRecall** documentation to showcase
`ragas.metrics.collections.ContextRecall` as the primary example
- Updated **FactualCorrectness** documentation to showcase
`ragas.metrics.collections.FactualCorrectness` with configuration
options (mode, atomicity, coverage)
- Updated **ResponseGroundedness** documentation in nvidia_metrics.md to
showcase `ragas.metrics.collections.ResponseGroundedness` as the primary
example
- Moved all legacy API examples to "Legacy Metrics API" sections with
deprecation warnings
- Added synchronous usage notes (`.score()` method) for all three
metrics
- Preserved all conceptual explanations and "How It's Calculated"
sections

## Testing
### How to Test
- [x] Automated tests added/updated: N/A (documentation only)
- [x] Manual testing steps:
  1. Verified `make build-docs` succeeds without errors ✓
  2. Tested all new code examples to ensure they work as documented
  3. Confirmed output values match expected results
  4. Verified consistency with PR #2407 documentation style

## References
- Related issues: Follow-up to PR #2407
- Documentation: 
  - Updated: `docs/concepts/metrics/available_metrics/context_recall.md`
- Updated:
`docs/concepts/metrics/available_metrics/factual_correctness.md`
- Updated: `docs/concepts/metrics/available_metrics/nvidia_metrics.md`
(ResponseGroundedness section)
- Pattern reference: PR #2407 (faithfulness.md, context_precision.md,
answer_correctness.md)

## Screenshots/Examples (if applicable)

All three metrics now follow the consistent pattern:
1. **Primary Example**: Collections-based API (modern, recommended)
2. **Concepts**: Implementation-agnostic explanation
3. **Synchronous Usage Note**: `.score()` method alternative
4. **Legacy Section**: Original API with deprecation timeline warnings
…ort (#2424)

A fix to support latest instructor, as they removed `from_anthropic` and
`from_gemini` methods for a more standard `from_provider`. Ref: [PR
1898](567-labs/instructor#1898)

Also added support for #2422
@anistark
Copy link
Member Author

Will work on a different approach. Closing this.

@anistark anistark closed this Nov 17, 2025
@anistark anistark deleted the feat/openai-batch-api branch November 17, 2025 05:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support for Batch API