Skip to content

CMRNLP-75 Golden Dataset - Evaluate collection embedding search with golden dataset#30

Open
cli620 wants to merge 4 commits intomainfrom
CMRNLP-75
Open

CMRNLP-75 Golden Dataset - Evaluate collection embedding search with golden dataset#30
cli620 wants to merge 4 commits intomainfrom
CMRNLP-75

Conversation

@cli620
Copy link
Contributor

@cli620 cli620 commented Feb 17, 2026

Add Offline RAG Evaluation Framework

Overview

Implements a comprehensive offline evaluation framework for the Earthdata RAG system, enabling systematic quality assessment of retrieval and generation components using multiple metrics tracked in Langfuse. The usage of offline RAG evaluation is intended to make AB Testing of different retrieval strategies consistent and controlled.


What's New

Core Evaluation System (rag_eval/)

EarthdataEvaluator - Main orchestrator for batch evaluation

  • Loads test cases from JSON files
  • Queries MCP server for each test case
  • Computes multiple evaluation metrics
  • Sends detailed traces and scores to Langfuse for visualization

SingleEvaluation - Reusable singleton class for scoring

  • Shared evaluation logic enabling future online evaluation
  • Lazy initialization of LLM and embeddings for efficiency
  • Returns structured scores with metadata (value, comment, data_type)

Evaluation Metrics

1. Custom Dataset Relevance (LLM-based scoring)

  • Individual relevance score (0.0-1.0) for each retrieved dataset
  • Average dataset relevance across all results
  • Max dataset relevance (best match score)

2. Ragas Metrics (Industry-standard RAG evaluation)

  • Faithfulness: Measures factual accuracy of generated answer against retrieved contexts
  • Context Precision: Evaluates ranking quality of retrieved contexts

Supporting Infrastructure

EarthdataRAGClient

  • HTTP client for MCP server communication
  • Session management with lazy initialization
  • Handles SSE responses and JSON parsing
  • Graceful error handling for double-encoded responses

DatasetRelevancePrompt

  • Pydantic-based LLM prompt for structured dataset scoring
  • Input/output validation
  • Few-shot examples for consistent scoring

Bedrock Integration

  • AWS Bedrock Nova Pro for evaluation scoring
  • Titan embeddings for semantic similarity

Observability

  • Full Langfuse integration for trace visualization
  • Detailed spans for retrieval, generation, and scoring phases
  • All metrics automatically logged with descriptive comments
  • Trace IDs included in evaluation results for debugging

Configuration

Environment variables:

MCP_SERVER_URL=https://cmr.sit.earthdata.nasa.gov/mcp
TESTSET_PATH=./evals/datasets/earthdata_manual_testset.json
# Standard Langfuse env vars (LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, LANGFUSE_BASE_URL)

---

### Architecture Benefits

- ✅ **Modular design**: `SingleEvaluation` enables code reuse for future online evaluation
- ✅ **Dynamic scoring**: Metrics handled generically via metadata structure, easy to extend
- ✅ **Singleton pattern**: Expensive LLM/embedding initialization happens once
- ✅ **Type safety**: Pydantic models for structured LLM inputs/outputs
- ✅ **Clean separation**: Evaluation logic decoupled from Langfuse sending

---

### Files Added

- `rag_eval/evals.py` (~400 lines) - Main evaluation orchestration
- `rag_eval/mcp_client.py` (~280 lines) - MCP server HTTP client
- `rag_eval/models.py` (~70 lines) - Pydantic models for dataset relevance
- `rag_eval/ragas_utils.py` (~100 lines) - Ragas metrics utilities

---

### Next Steps

This PR establishes the foundation for:

- 🎯 Online per-request evaluation (reusing `SingleEvaluation` class)
- 📊 Additional custom/standard rag metrics 
- 🔬 A/B testing different retrieval strategies
- 📈 Continuous monitoring of RAG quality in production


<!-- This is an auto-generated comment: release notes by coderabbit.ai -->
## Summary by CodeRabbit

* **New Features**
  * Added a RAG evaluation suite for assessing data retrieval and answer quality on Earthdata.
  * Added a synthetic test-set generator to produce evaluation datasets from Earthdata sources.
  * Added a client to query the Earthdata MCP server and end-to-end evaluation orchestration with run-level and item-level metrics.

* **Documentation**
  * Added a README with setup, usage, and customization guidance.

* **Chores**
  * Added project packaging and runtime dependencies to support the evaluation pipeline.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

@coderabbitai
Copy link

coderabbitai bot commented Feb 17, 2026

📝 Walkthrough

Walkthrough

Adds a new rag_eval package providing end-to-end RAG evaluation for Earthdata MCP: evaluation orchestration, MCP client, Ragas metric utilities, test-set generation, models, sample dataset, Langfuse helpers, and packaging/dependency updates.

Changes

Cohort / File(s) Summary
Project config
pyproject.toml, rag_eval/pyproject.toml
Added packaging metadata and new runtime dependencies (ragas, litellm, nest-asyncio, etc.). Review dependency pins and build config.
Evaluation core
rag_eval/evals.py, rag_eval/__init__.py
New EarthdataEvaluator and SingleEvaluation classes plus package version; main orchestration, task/evaluator builders, run-level aggregation, and CLI entry. High logic density.
MCP client
rag_eval/mcp_client.py
New EarthdataRAGClient: session init, SSE parsing, discover_data JSON-RPC query, result formatting, and lifecycle management. Review HTTP/SSE parsing and error paths.
Ragas & LLM utilities
rag_eval/ragas_utils.py, rag_eval/models.py
LLM/embedding factories, metric init and scoring helpers, and Pydantic prompt/models for dataset relevance scoring. Check model shapes and metric initialization.
Test data & generator
rag_eval/evals/datasets/earthdata_manual_sample.json, rag_eval/test_set_generator.py
Adds 20-case sample dataset and a complex EarthdataTestSetGenerator (DB ingestion, knowledge graph creation, enrichment, testset synthesis). Review DB fallbacks, enrichment pipeline, and external API usage.
Langfuse utilities
util/langfuse.py
Added get_current_trace_id and create_score helpers that wrap Langfuse interactions with safe fallbacks.
Docs & packaging init
rag_eval/README.md, rag_eval/__init__.py
Documentation and package init/version added. Verify README instructions match new dependencies and entrypoints.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Evaluator as EarthdataEvaluator
    participant Dataset as Dataset Storage
    participant MCPClient as EarthdataRAGClient
    participant Metrics as SingleEvaluation
    participant Langfuse as Langfuse

    User->>Evaluator: run_experiment(dataset_name)
    Evaluator->>Dataset: load_dataset(name)
    Dataset-->>Evaluator: test cases

    loop per test case
        Evaluator->>Langfuse: start_trace()
        Langfuse-->>Evaluator: trace_id

        Evaluator->>MCPClient: query(question)
        MCPClient-->>Evaluator: {collections, answer}

        Evaluator->>Metrics: evaluate_single(question, collections, contexts, answer)
        Metrics-->>Evaluator: metric scores

        Evaluator->>Langfuse: create_score(scores)
    end

    Evaluator->>Langfuse: flush()
    Evaluator-->>User: aggregated_results
Loading
sequenceDiagram
    participant User
    participant Generator as EarthdataTestSetGenerator
    participant DB as PostgreSQL
    participant KG as KnowledgeGraph
    participant LLM as Bedrock/OpenAI
    participant Output as JSON

    User->>Generator: generate_testset(size)
    Generator->>DB: load_all_entities_from_db()
    DB-->>Generator: entities

    Generator->>KG: create_knowledge_graph(entities)
    Generator->>KG: enrich_knowledge_graph()
    KG-->>Generator: enriched_kg

    Generator->>LLM: synthesize_test_cases(personas, enriched_kg)
    LLM-->>Generator: test_cases

    Generator->>Output: save_json(test_cases)
    Output-->>User: testset file
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Suggested reviewers

  • abbottry
  • dmistry1

Poem

🐰 I nibble through prompts and datasets bright,

I fetch the facts by day and night,
With ragas hum and traces traced,
Earthdata answers, neatly placed —
A hopping cheer for tests now right!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically references the main change: implementing evaluation of collection embedding search using a golden dataset, aligned with the CMRNLP-75 ticket and the substantial new rag_eval framework.
Docstring Coverage ✅ Passed Docstring coverage is 96.67% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch CMRNLP-75

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov-commenter
Copy link

codecov-commenter commented Feb 17, 2026

Codecov Report

❌ Patch coverage is 0.43668% with 456 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.55%. Comparing base (2b5e58a) to head (af1ce24).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
rag_eval/evals.py 0.00% 266 Missing ⚠️
rag_eval/mcp_client.py 0.00% 130 Missing ⚠️
rag_eval/ragas_utils.py 0.00% 28 Missing ⚠️
util/langfuse.py 10.00% 18 Missing ⚠️
rag_eval/models.py 0.00% 13 Missing ⚠️
rag_eval/__init__.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main      #30       +/-   ##
===========================================
- Coverage   88.18%   76.55%   -11.63%     
===========================================
  Files          50       57        +7     
  Lines        2361     2952      +591     
===========================================
+ Hits         2082     2260      +178     
- Misses        279      692      +413     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 18

🧹 Nitpick comments (12)
rag_eval/ragas_utils.py (2)

95-95: Use lazy %s formatting in logger call instead of f-string.

f-strings are evaluated eagerly regardless of log level.

Suggested fix
-        logger.info(f"Calculating {m.name}...")
+        logger.info("Calculating %s...", m.name)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rag_eval/ragas_utils.py` at line 95, Replace the eager f-string logger call
with lazy %-style formatting to avoid evaluating m.name when the log level would
skip it; specifically change the logger.info(f"Calculating {m.name}...") call to
use logger.info("Calculating %s...", m.name) (referencing the same logger.info
and m.name symbols in ragas_utils.py).

57-72: init_ragas_metrics creates a new RunConfig per metric — likely intended to be shared.

A single RunConfig instance could be created outside the loop and reused for all metrics. This is a minor efficiency and clarity improvement.

Suggested fix
 def init_ragas_metrics(metrics, llm, embedding):
+    run_config = RunConfig()
     for metric in metrics:
         if isinstance(metric, MetricWithLLM):
             metric.llm = llm
         if isinstance(metric, MetricWithEmbeddings):
             metric.embeddings = embedding
-        run_config = RunConfig()
         metric.init(run_config)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rag_eval/ragas_utils.py` around lines 57 - 72, The function
init_ragas_metrics currently constructs a new RunConfig() inside the loop for
every metric; instead, create one RunConfig instance before the for loop and
pass that same run_config to each metric.init(...) so all metrics share the same
RunConfig; keep the existing handling where you set metric.llm for MetricWithLLM
and metric.embeddings for MetricWithEmbeddings, but move the RunConfig
allocation out of the loop in init_ragas_metrics and call
metric.init(run_config) for each metric.
rag_eval/models.py (1)

29-38: Trailing whitespace on Line 30 inside the instruction string.

Line 30 (after "...user's question.) appears to have trailing whitespace before the blank line. While this won't affect functionality, it will be embedded in the prompt text sent to the LLM.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rag_eval/models.py` around lines 29 - 38, The instruction string assigned to
variable instruction in rag_eval/models.py contains trailing whitespace after
"user's question." — remove that trailing space inside the triple-quoted string
(or trim the string when assigning, e.g., use .strip()) so the prompt text sent
to the LLM has no embedded trailing whitespace; update the instruction
assignment in the instruction variable to eliminate the extra space.
rag_eval/pyproject.toml (1)

27-28: Empty py-modules suppresses auto-discovery but no packages is declared.

With py-modules = [] and no [tool.setuptools.packages] or [tool.setuptools.packages.find], setuptools may not discover any modules at all when this package is installed. If this is intended only for uv sync / editable installs it might work, but a pip install . may produce an empty package.

Suggested fix
 [tool.setuptools]
-py-modules = []
+packages = ["rag_eval"]

Or add auto-discovery:

[tool.setuptools.packages.find]
where = ["."]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rag_eval/pyproject.toml` around lines 27 - 28, The build config currently
sets py-modules = [] which disables module auto-discovery and there are no
package declarations, so pip installs can produce an empty distribution; fix by
either removing the empty py-modules entry or adding a packages discovery block
(e.g. add [tool.setuptools.packages.find] with where = ["."], or explicitly
declare package names under [tool.setuptools.packages]) so setuptools can find
and include your package sources (look for the py-modules setting and update it
or add the packages.find section).
rag_eval/evals.py (5)

3-3: Unused asyncio import.

asyncio is imported but never used anywhere in this file.

Proposed fix
-import asyncio
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rag_eval/evals.py` at line 3, The file imports asyncio but never uses it;
remove the unused import statement (the "import asyncio" line at the top of
rag_eval/evals.py) to eliminate the dead dependency and any linter warnings,
leaving other imports and code unchanged.

125-125: f-string without placeholders.

Static analysis flags this as F541.

-        logger.info(f"Dataset loaded successfully")
+        logger.info("Dataset loaded successfully")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rag_eval/evals.py` at line 125, The log statement uses an unnecessary
f-string which triggers F541; change the call to use a normal string literal in
the logger.info invocation (replace the f-string used in the
logger.info("Dataset loaded successfully") call with a plain string) so no
f-prefix is present—locate the logger.info usage in rag_eval/evals.py and remove
the f-prefix.

106-114: _score_dataset is dead code — evaluate() delegates to SingleEvaluation.evaluate_single instead.

This method is defined on EarthdataEvaluator but never called. The actual per-dataset scoring happens inside SingleEvaluation.evaluate_single (lines 342-350), which duplicates the same logic. Consider removing this method or having SingleEvaluation delegate to it.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rag_eval/evals.py` around lines 106 - 114, The _score_dataset method on
EarthdataEvaluator is dead code because evaluate() uses
SingleEvaluation.evaluate_single to score datasets; remove the duplication by
either deleting _score_dataset or refactoring SingleEvaluation.evaluate_single
to call EarthdataEvaluator._score_dataset (or vice versa). Concretely, choose
one implementation of the scoring logic (the code that builds
DatasetRelevanceInput and calls relevance_prompt.generate with llm and returns
relevance_score) and update SingleEvaluation.evaluate_single or
EarthdataEvaluator._score_dataset to delegate to that single function (keep
symbols: EarthdataEvaluator, _score_dataset, SingleEvaluation.evaluate_single,
relevance_prompt.generate, llm) and remove the redundant copy.

186-208: zip() without strict=True silently drops trailing elements.

As noted in the sibling comment about misaligned scores, using strict=True here would at least surface the length mismatch as a ValueError rather than silently producing wrong mappings.

Proposed fix (once the sentinel-on-failure fix is applied)
-                    for collection_idx, (collection, score) in enumerate(
-                        zip(collections, individual_scores)
-                    ):
+                    for collection_idx, (collection, score) in enumerate(
+                        zip(collections, individual_scores, strict=True)
+                    ):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rag_eval/evals.py` around lines 186 - 208, The loop pairing collections and
individual_scores uses zip without strict=True which silently drops trailing
items; update the pairing in the block that enumerates zip(collections,
individual_scores) to use zip(collections, individual_scores, strict=True) so a
length mismatch raises a ValueError and surfaces the misalignment (this change
applies to the loop that creates the trace via
trace.start_as_current_observation and calls scoring_span.score); if Python
versions in CI must support this, alternatively assert
len(collections)==len(individual_scores) before the loop to explicitly fail on
mismatch.

281-312: Singleton SingleEvaluation is not thread-safe and duplicates EarthdataEvaluator initialization.

The class-level singleton pattern (_llm, _embeddings, etc.) has no synchronization, so concurrent callers could race on _initialize_components. Additionally, when used from EarthdataEvaluator.evaluate(), the evaluator creates its own LLM/embeddings (lines 54-59) while SingleEvaluation lazily creates a separate set with potentially different max_tokens (10000 vs 4096 on line 302). This means two Bedrock LLM instances exist with different configs, which may cause confusion or unexpected scoring differences.

Consider either:

  • Having SingleEvaluation accept injected components, or
  • Removing the duplicate initialization in EarthdataEvaluator
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rag_eval/evals.py` around lines 281 - 312, The singleton init in
SingleEvaluation (_initialize_components) is racy and duplicates
EarthdataEvaluator.evaluate's LLM/embeddings (different max_tokens: 4096 vs
10000); make the singleton thread-safe and stop duplicate creation by 1) adding
a class-level threading.Lock and surrounding the body of
SingleEvaluation._initialize_components with lock acquisition to ensure only one
init runs, 2) allow injection of components into SingleEvaluation (add optional
parameters llm, embeddings, relevance_prompt, metrics to the initializer or a
new classmethod) and prefer injected components over creating new ones, and 3)
update EarthdataEvaluator.evaluate to pass its created LLM/embeddings into
SingleEvaluation instead of creating separate instances (and consolidate the
create_bedrock_llm max_tokens setting into one place so both use the same
value).
rag_eval/test_set_generator.py (2)

16-18: sys is imported but never used.

 import json
 import os
-import sys
 from datetime import datetime
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rag_eval/test_set_generator.py` around lines 16 - 18, The file imports the
unused module sys; remove the unused "import sys" statement from the top of the
file (in the import block alongside json and os) to clean up the namespace and
satisfy linting, or if sys was intended to be used, replace the removal by
actually using sys in the relevant function (e.g., within test set generation
code) and reference the symbol sys where needed.

428-463: Multiple unused variable assignments flagged by static analysis.

headline_extractor, headline_splitter, keyphrase_extractor, and keyphrase_similarity_builder are assigned but never used (all commented out in transforms). Remove the assignments or prefix with _ to signal intentional non-use.

Proposed cleanup — comment out the unused assignments as well
-        headline_extractor = HeadlinesExtractor(llm=self.generator_llm, max_num=20)
-        headline_splitter = HeadlineSplitter(max_tokens=1500)
+        # headline_extractor = HeadlinesExtractor(llm=self.generator_llm, max_num=20)
+        # headline_splitter = HeadlineSplitter(max_tokens=1500)
         topic_extractor = TopicDescriptionExtractor(llm=self.generator_llm)
         # ner_extractor = NERExtractor(llm=self.generator_llm)
-        keyphrase_extractor = KeyphrasesExtractor(llm=self.generator_llm)
+        # keyphrase_extractor = KeyphrasesExtractor(llm=self.generator_llm)
 
         # Step 2: Build relationships between nodes based on extracted properties
         topic_similarity_builder = JaccardSimilarityBuilder(
@@ ...
-        keyphrase_similarity_builder = JaccardSimilarityBuilder(
-            property_name="keyphrases",
-            new_property_name="keyphrase_similarity",
-            threshold=0.1,
-        )
+        # keyphrase_similarity_builder = JaccardSimilarityBuilder(
+        #     property_name="keyphrases",
+        #     new_property_name="keyphrase_similarity",
+        #     threshold=0.1,
+        # )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rag_eval/test_set_generator.py` around lines 428 - 463, Several variables are
assigned but never used—headline_extractor, headline_splitter,
keyphrase_extractor, and keyphrase_similarity_builder—triggering static analysis
warnings; either remove these unused assignments or rename them with a leading
underscore (e.g., _headline_extractor) to mark intentional non-use, and keep the
active transforms list intact (referencing HeadlinesExtractor, HeadlineSplitter,
KeyphrasesExtractor, and JaccardSimilarityBuilder instances) so the linter no
longer reports unused-variable issues.
rag_eval/mcp_client.py (1)

186-195: SSE parsing silently takes the last data: line, discarding earlier ones.

The loop overwrites result_data on each successful parse, so only the final SSE data: line is used. If the server sends multiple JSON events (e.g., progress + final result), intermediate data is silently dropped. This may be intentional, but if the server sends the actual result in an earlier event and a status/summary in the last, you'd lose the result.

Consider either documenting this "last-wins" behavior or accumulating events:

-            result_data = None
-            for line in response.text.split("\n"):
-                if line.startswith("data: "):
-                    data = line[6:].strip()  # Remove 'data: ' prefix
-                    if data and data != "[DONE]":
-                        try:
-                            result_data = json.loads(data)
-                        except json.JSONDecodeError:
-                            continue
+            # SSE may contain multiple data lines; use the last valid JSON event
+            result_data = None
+            for line in response.text.split("\n"):
+                if line.startswith("data: "):
+                    data = line[6:].strip()
+                    if data and data != "[DONE]":
+                        try:
+                            result_data = json.loads(data)
+                        except json.JSONDecodeError:
+                            continue
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rag_eval/mcp_client.py` around lines 186 - 195, The SSE parsing loop
currently overwrites result_data for each parsed "data: " line (using
response.text and json.loads), losing earlier events; change this to accumulate
parsed events (e.g., create a list like parsed_events and append each successful
json.loads(data)) instead of assigning to result_data, then decide after the
loop whether to use the first/last/merged event or return the list—update any
downstream use of result_data accordingly (or, if intended behavior is to keep
the first event, set result_data only when it is None).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pyproject.toml`:
- Around line 24-25: The dependencies "ragas" and "litellm" are evaluation-only
and should be removed from the top-level runtime dependencies in pyproject.toml;
instead add them under an optional dependency group (e.g.,
project.optional-dependencies or tool.poetry.extras) as an "rag_eval" or
"evaluation" extra so only users who need the rag_eval/ package install them;
update references to the dependency names "ragas" and "litellm" accordingly and
ensure the main MCP runtime dependency list no longer contains these two
entries.

In `@rag_eval/__init__.py`:
- Around line 1-3: The package version constant is inconsistent: update the
__version__ variable in rag_eval.__init__ to match pyproject.toml by changing
__version__ = "1.0.0" to __version__ = "0.1.0"; verify the value matches
pyproject.toml and keep them synchronized going forward (update whichever file
should be authoritative if different).

In `@rag_eval/evals.py`:
- Around line 314-321: The return type hint for evaluate_single is wrong: it
currently declares dict[str, float] but actually returns a mapping whose values
are dicts with keys "value", "comment", and "data_type". Fix the signature of
evaluate_single to reflect that structure (e.g., dict[str, dict[str, Any]] or
define a small TypedDict/NamedTuple for the result shape and use dict[str,
YourTypedDict]) and update any related imports; ensure the chosen type
explicitly documents the inner keys "value", "comment", and "data_type".
- Around line 340-360: The current loop building collection_scores skips
appending when an exception occurs during cls._relevance_prompt.generate (or in
DatasetRelevanceInput creation), which makes collection_scores shorter than
collections and causes misalignment when evaluate() zips collections with
individual_scores; change the except block inside the loop to append a sentinel
value (e.g., None or a designated numeric sentinel like -1/NaN) to
collection_scores and include the error details in the logger warning so lengths
remain matched; ensure the stored all_scores["individual_dataset_scores"]
contains the sentinel entries so downstream code in evaluate() can correctly zip
collections with individual_dataset_scores and handle the sentinel cases
appropriately.
- Around line 121-142: The evaluate method calls
langfuse.start_as_current_observation without checking that get_langfuse()
returned a non-None object; if get_langfuse() returns None this raises an
AttributeError. Fix by guarding the trace usage: after calling get_langfuse()
check if langfuse is truthy and only use langfuse.start_as_current_observation
inside that conditional (or provide a no-op context when None), updating the
evaluate function (where langfuse is defined and where
start_as_current_observation is invoked) to safely handle a None return.

In `@rag_eval/evals/datasets/earthdata_manual_sample.json`:
- Around line 37-41: Update the metadata for the record with id "q002" so its
"topic" value is "forestry" instead of "agriculture"; locate the JSON object
containing "q002" (the entry whose metadata includes "query_type":"semantic" and
expected_keywords ["forest","forestry","trees","forest management"]) and change
the "topic" field to "forestry" to align with the question and keywords.

In `@rag_eval/mcp_client.py`:
- Around line 81-83: The session initialization currently swallows exceptions
(logger.error in the except) leaving session_id as None and the client in a
broken state; update the initialization flow (in the method that sets session_id
/ handles session creation and the class that owns query(), e.g., the MCPClient
initializer or initialize_session) to either re-raise the caught exception after
logging or record the failure in a reachable attribute (e.g., self.init_error or
self.initialized = False) so callers can detect failure before calling query();
ensure logger.error still logs exc_info=True and that query() checks the new
failure flag or raises a clear exception if session_id is None or init failed.
- Around line 255-268: The format string f"...{score:.3f}" can raise if
collection["similarity_score"] is present but None; in the block building
answer_parts, coerce score to a numeric default before formatting (e.g., replace
score = collection.get("similarity_score", 0.0) with something like score =
collection.get("similarity_score") or 0.0 or score =
float(collection.get("similarity_score") or 0.0) so that the f-string in the
answer_parts.append (the line containing {score:.3f}) always receives a float
and will not raise TypeError.
- Line 7: The project imports httpx in rag_eval/mcp_client.py but httpx is not
declared as a direct dependency; add "httpx" (with a suitable version
constraint) to the dependencies section of rag_eval/pyproject.toml and/or the
top-level pyproject.toml so the package is explicitly required, then update your
lockfile/install (e.g., poetry lock / poetry install or pip-tools) to ensure
consistent installs; verify imports in rag_eval/mcp_client.py continue to
resolve after updating dependencies.
- Around line 54-79: The SSE fallback never assigns a session ID and
_list_tools() is skipped when the header path returns early; update the
initialization logic in the method that handles the response so that after
parsing SSE JSON (variable result) you set self.session_id =
result.get("sessionId") (or the correct key used by your MCP payload) if
present, and ensure _list_tools() is always called regardless of whether the
session ID came from response.headers or the parsed SSE (remove the early return
or call self._list_tools() before returning); keep existing JSONDecodeError
handling and debug logs.

In `@rag_eval/models.py`:
- Around line 10-12: The relevance_score Field currently allows any float;
enforce the 0.0–1.0 contract by adding bounds validation to the relevance_score
declaration (e.g., use Field(..., ge=0.0, le=1.0) or replace the type with
pydantic.confloat(ge=0.0, le=1.0)) so out-of-range values are rejected at model
validation in rag_eval.models.Relevance (or whatever model/class contains
relevance_score).

In `@rag_eval/README.md`:
- Around line 7-20: The quick-start API key instructions are incorrect because
the code (ragas_utils.py) uses AWS Bedrock via litellm with model
"bedrock/amazon.nova-pro-v1:0" rather than OpenAI/Anthropic/Google; update the
README section to instruct users to configure AWS credentials (e.g.,
AWS_DEFAULT_REGION, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY or use an AWS CLI
profile) and mention any litellm-specific env vars or credential providers
required for Bedrock access, and replace the OpenAI/ANTHROPIC/GOOGLE examples
with Bedrock-specific setup and a short note referencing ragas_utils.py and the
Bedrock model string so readers know which provider is actually used.
- Around line 52-63: Update the README's directory tree fenced code block to
include a language specifier (e.g., ```text) and correct the file list: replace
the non-existent rag.py entry with the actual files present in the PR
(mcp_client.py, models.py, ragas_utils.py) so the tree accurately reflects the
package; ensure the closing fence remains and the comments (e.g., "# MCP client
for RAG queries") match the new filenames to help readers locate mcp_client.py,
models.py, and ragas_utils.py.

In `@rag_eval/test_set_generator.py`:
- Around line 343-367: The catch-all except that prints an error and silently
replaces real DB results with hardcoded `entities` must be changed to avoid
producing fake data unnoticed: either re-raise the caught exception (so callers
fail fast) or make the placeholder fallback explicit via an opt-in flag (e.g.,
`allow_placeholder_fallback=False`) so the caller must opt into using the
hardcoded `entities`; update the error handling in the except block (the one
assigning `entities` in rag_eval/test_set_generator.py) to log the error and
then either `raise` the exception or consult the new opt-in parameter before
populating the hardcoded `entities`, ensuring callers are aware when fallback is
used.
- Around line 608-609: The inline comment says "Generate 20 test questions" but
the call to generator.generate_testset passes testset_size=5; update them to be
consistent by changing the argument to testset_size=20 (or alternatively update
the comment to reflect 5) so that the call to generator.generate_testset and the
comment match; locate the call to generator.generate_testset and the variable
test_data to apply the fix.
- Around line 70-83: The module currently performs provider selection and may
call exit(1) at import time (USE_OPENAI check and OPENAI_API_KEY validation);
move this logic into a runtime initializer such as
EarthdataTestSetGenerator.__init__ or a main() function so importing the module
has no side effects, replace print/exit calls with raising a clear exception or
use logger.error inside the initializer, and ensure the code references the same
symbols (USE_OPENAI flag read inside EarthdataTestSetGenerator.__init__ or
main(), and BEDROCK_CONFIG region assignment moved there) so tests can import
the class without triggering prints or process termination.
- Around line 526-538: The generated metadata in the test_data dict incorrectly
always references BEDROCK_CONFIG for "llm_model" and "embedding_model"; update
the assignment to select the correct configuration based on the active provider
(e.g., use OPENAI_CONFIG when USE_OPENAI is True, otherwise BEDROCK_CONFIG).
Locate the test_data construction in rag_eval/test_set_generator.py and change
the "llm_model" and "embedding_model" values to pull from the appropriate config
variable (reference BEDROCK_CONFIG, OPENAI_CONFIG and the USE_OPENAI flag or the
class/instance provider field) so provenance reflects the actual provider used.
- Around line 226-370: The code calls get_db_connection() before the try and
always calls conn.close() in the finally, which can raise NameError if
connection creation fails; fix by initializing conn = None before calling
get_db_connection() (or moving the get_db_connection() call inside the try) and
update the finally block to only call conn.close() if conn is not None (or
truthy). Specifically update references around get_db_connection(), conn, and
the finally block that calls conn.close() so the connection is safely closed
only when it was successfully created.

---

Nitpick comments:
In `@rag_eval/evals.py`:
- Line 3: The file imports asyncio but never uses it; remove the unused import
statement (the "import asyncio" line at the top of rag_eval/evals.py) to
eliminate the dead dependency and any linter warnings, leaving other imports and
code unchanged.
- Line 125: The log statement uses an unnecessary f-string which triggers F541;
change the call to use a normal string literal in the logger.info invocation
(replace the f-string used in the logger.info("Dataset loaded successfully")
call with a plain string) so no f-prefix is present—locate the logger.info usage
in rag_eval/evals.py and remove the f-prefix.
- Around line 106-114: The _score_dataset method on EarthdataEvaluator is dead
code because evaluate() uses SingleEvaluation.evaluate_single to score datasets;
remove the duplication by either deleting _score_dataset or refactoring
SingleEvaluation.evaluate_single to call EarthdataEvaluator._score_dataset (or
vice versa). Concretely, choose one implementation of the scoring logic (the
code that builds DatasetRelevanceInput and calls relevance_prompt.generate with
llm and returns relevance_score) and update SingleEvaluation.evaluate_single or
EarthdataEvaluator._score_dataset to delegate to that single function (keep
symbols: EarthdataEvaluator, _score_dataset, SingleEvaluation.evaluate_single,
relevance_prompt.generate, llm) and remove the redundant copy.
- Around line 186-208: The loop pairing collections and individual_scores uses
zip without strict=True which silently drops trailing items; update the pairing
in the block that enumerates zip(collections, individual_scores) to use
zip(collections, individual_scores, strict=True) so a length mismatch raises a
ValueError and surfaces the misalignment (this change applies to the loop that
creates the trace via trace.start_as_current_observation and calls
scoring_span.score); if Python versions in CI must support this, alternatively
assert len(collections)==len(individual_scores) before the loop to explicitly
fail on mismatch.
- Around line 281-312: The singleton init in SingleEvaluation
(_initialize_components) is racy and duplicates EarthdataEvaluator.evaluate's
LLM/embeddings (different max_tokens: 4096 vs 10000); make the singleton
thread-safe and stop duplicate creation by 1) adding a class-level
threading.Lock and surrounding the body of
SingleEvaluation._initialize_components with lock acquisition to ensure only one
init runs, 2) allow injection of components into SingleEvaluation (add optional
parameters llm, embeddings, relevance_prompt, metrics to the initializer or a
new classmethod) and prefer injected components over creating new ones, and 3)
update EarthdataEvaluator.evaluate to pass its created LLM/embeddings into
SingleEvaluation instead of creating separate instances (and consolidate the
create_bedrock_llm max_tokens setting into one place so both use the same
value).

In `@rag_eval/mcp_client.py`:
- Around line 186-195: The SSE parsing loop currently overwrites result_data for
each parsed "data: " line (using response.text and json.loads), losing earlier
events; change this to accumulate parsed events (e.g., create a list like
parsed_events and append each successful json.loads(data)) instead of assigning
to result_data, then decide after the loop whether to use the first/last/merged
event or return the list—update any downstream use of result_data accordingly
(or, if intended behavior is to keep the first event, set result_data only when
it is None).

In `@rag_eval/models.py`:
- Around line 29-38: The instruction string assigned to variable instruction in
rag_eval/models.py contains trailing whitespace after "user's question." —
remove that trailing space inside the triple-quoted string (or trim the string
when assigning, e.g., use .strip()) so the prompt text sent to the LLM has no
embedded trailing whitespace; update the instruction assignment in the
instruction variable to eliminate the extra space.

In `@rag_eval/pyproject.toml`:
- Around line 27-28: The build config currently sets py-modules = [] which
disables module auto-discovery and there are no package declarations, so pip
installs can produce an empty distribution; fix by either removing the empty
py-modules entry or adding a packages discovery block (e.g. add
[tool.setuptools.packages.find] with where = ["."], or explicitly declare
package names under [tool.setuptools.packages]) so setuptools can find and
include your package sources (look for the py-modules setting and update it or
add the packages.find section).

In `@rag_eval/ragas_utils.py`:
- Line 95: Replace the eager f-string logger call with lazy %-style formatting
to avoid evaluating m.name when the log level would skip it; specifically change
the logger.info(f"Calculating {m.name}...") call to use logger.info("Calculating
%s...", m.name) (referencing the same logger.info and m.name symbols in
ragas_utils.py).
- Around line 57-72: The function init_ragas_metrics currently constructs a new
RunConfig() inside the loop for every metric; instead, create one RunConfig
instance before the for loop and pass that same run_config to each
metric.init(...) so all metrics share the same RunConfig; keep the existing
handling where you set metric.llm for MetricWithLLM and metric.embeddings for
MetricWithEmbeddings, but move the RunConfig allocation out of the loop in
init_ragas_metrics and call metric.init(run_config) for each metric.

In `@rag_eval/test_set_generator.py`:
- Around line 16-18: The file imports the unused module sys; remove the unused
"import sys" statement from the top of the file (in the import block alongside
json and os) to clean up the namespace and satisfy linting, or if sys was
intended to be used, replace the removal by actually using sys in the relevant
function (e.g., within test set generation code) and reference the symbol sys
where needed.
- Around line 428-463: Several variables are assigned but never
used—headline_extractor, headline_splitter, keyphrase_extractor, and
keyphrase_similarity_builder—triggering static analysis warnings; either remove
these unused assignments or rename them with a leading underscore (e.g.,
_headline_extractor) to mark intentional non-use, and keep the active transforms
list intact (referencing HeadlinesExtractor, HeadlineSplitter,
KeyphrasesExtractor, and JaccardSimilarityBuilder instances) so the linter no
longer reports unused-variable issues.

Comment on lines +24 to +25
"ragas>=0.4.3",
"litellm>=1.81.12",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Evaluation-only dependencies added to the main project's runtime dependencies.

ragas and litellm are only used by the rag_eval/ evaluation package, not by the MCP server itself. Adding them as top-level runtime dependencies unnecessarily increases the install footprint for all deployments (including production Lambda/container).

Move these to an optional dependency group:

Suggested fix
     "instructor>=1.13.0",
     "jsonschema>=4.17.3",
-    "ragas>=0.4.3",
-    "litellm>=1.81.12",
 ]

 [project.optional-dependencies]
+eval = [
+    "ragas>=0.4.3",
+    "litellm>=1.81.12",
+]
 dev = [
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` around lines 24 - 25, The dependencies "ragas" and "litellm"
are evaluation-only and should be removed from the top-level runtime
dependencies in pyproject.toml; instead add them under an optional dependency
group (e.g., project.optional-dependencies or tool.poetry.extras) as an
"rag_eval" or "evaluation" extra so only users who need the rag_eval/ package
install them; update references to the dependency names "ragas" and "litellm"
accordingly and ensure the main MCP runtime dependency list no longer contains
these two entries.

Comment on lines +1 to +3
"""RAG evaluation package for Earthdata MCP server."""

__version__ = "1.0.0"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Version mismatch: __version__ is "1.0.0" but pyproject.toml declares "0.1.0".

These should be consistent. Since this is a new package, 0.1.0 is the more appropriate value.

Suggested fix
-__version__ = "1.0.0"
+__version__ = "0.1.0"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"""RAG evaluation package for Earthdata MCP server."""
__version__ = "1.0.0"
"""RAG evaluation package for Earthdata MCP server."""
__version__ = "0.1.0"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rag_eval/__init__.py` around lines 1 - 3, The package version constant is
inconsistent: update the __version__ variable in rag_eval.__init__ to match
pyproject.toml by changing __version__ = "1.0.0" to __version__ = "0.1.0";
verify the value matches pyproject.toml and keep them synchronized going forward
(update whichever file should be authoritative if different).

Comment on lines 121 to 142
def evaluate(self, testset_path: str):
"""Run full evaluation on a test dataset."""
langfuse = get_langfuse()
dataset = self.load_dataset(testset_path)
logger.info(f"Dataset loaded successfully")

results = []

for idx, row in enumerate(dataset):
logger.info("=" * 60)
logger.info(
f"Evaluating test case {idx + 1}/{len(dataset)}: {row['question_id']}"
)
logger.info(f"Question: {row['user_input']}")

question = row["user_input"]

try:
# Start a new trace for this evaluation
with langfuse.start_as_current_observation(
as_type="span", name="rag"
) as trace:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

get_langfuse() can return None, causing AttributeError on line 140.

get_langfuse() (from util/langfuse.py line 34-46) explicitly returns None when Langfuse fails to initialize. Line 140 unconditionally calls langfuse.start_as_current_observation(...), which will raise AttributeError: 'NoneType' object has no attribute 'start_as_current_observation'.

Proposed fix — guard against None
     def evaluate(self, testset_path: str):
         """Run full evaluation on a test dataset."""
         langfuse = get_langfuse()
+        if langfuse is None:
+            raise RuntimeError(
+                "Langfuse client failed to initialize. Check credentials and configuration."
+            )
         dataset = self.load_dataset(testset_path)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def evaluate(self, testset_path: str):
"""Run full evaluation on a test dataset."""
langfuse = get_langfuse()
dataset = self.load_dataset(testset_path)
logger.info(f"Dataset loaded successfully")
results = []
for idx, row in enumerate(dataset):
logger.info("=" * 60)
logger.info(
f"Evaluating test case {idx + 1}/{len(dataset)}: {row['question_id']}"
)
logger.info(f"Question: {row['user_input']}")
question = row["user_input"]
try:
# Start a new trace for this evaluation
with langfuse.start_as_current_observation(
as_type="span", name="rag"
) as trace:
def evaluate(self, testset_path: str):
"""Run full evaluation on a test dataset."""
langfuse = get_langfuse()
if langfuse is None:
raise RuntimeError(
"Langfuse client failed to initialize. Check credentials and configuration."
)
dataset = self.load_dataset(testset_path)
logger.info(f"Dataset loaded successfully")
results = []
for idx, row in enumerate(dataset):
logger.info("=" * 60)
logger.info(
f"Evaluating test case {idx + 1}/{len(dataset)}: {row['question_id']}"
)
logger.info(f"Question: {row['user_input']}")
question = row["user_input"]
try:
# Start a new trace for this evaluation
with langfuse.start_as_current_observation(
as_type="span", name="rag"
) as trace:
🧰 Tools
🪛 Ruff (0.15.1)

[error] 125-125: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rag_eval/evals.py` around lines 121 - 142, The evaluate method calls
langfuse.start_as_current_observation without checking that get_langfuse()
returned a non-None object; if get_langfuse() returns None this raises an
AttributeError. Fix by guarding the trace usage: after calling get_langfuse()
check if langfuse is truthy and only use langfuse.start_as_current_observation
inside that conditional (or provide a no-op context when None), updating the
evaluate function (where langfuse is defined and where
start_as_current_observation is invoked) to safely handle a None return.

Comment on lines 314 to 321
@classmethod
def evaluate_single(
cls,
question: str,
collections: list[dict],
contexts: list[str],
answer: str,
) -> dict[str, float]:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Return type hint is incorrect.

The hint says dict[str, float] but the method returns dict[str, dict] — each value is a dict with value, comment, and data_type keys.

Proposed fix
-    ) -> dict[str, float]:
+    ) -> dict[str, dict]:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@classmethod
def evaluate_single(
cls,
question: str,
collections: list[dict],
contexts: list[str],
answer: str,
) -> dict[str, float]:
`@classmethod`
def evaluate_single(
cls,
question: str,
collections: list[dict],
contexts: list[str],
answer: str,
) -> dict[str, dict]:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rag_eval/evals.py` around lines 314 - 321, The return type hint for
evaluate_single is wrong: it currently declares dict[str, float] but actually
returns a mapping whose values are dicts with keys "value", "comment", and
"data_type". Fix the signature of evaluate_single to reflect that structure
(e.g., dict[str, dict[str, Any]] or define a small TypedDict/NamedTuple for the
result shape and use dict[str, YourTypedDict]) and update any related imports;
ensure the chosen type explicitly documents the inner keys "value", "comment",
and "data_type".

Comment on lines 340 to 360
# Score individual collections
collection_scores = []
for collection in collections:
try:
prompt_input = DatasetRelevanceInput(
question=question,
dataset_title=collection.get("title", ""),
dataset_abstract=collection.get("abstract", "") or "No description",
)
result = cls._relevance_prompt.generate(data=prompt_input, llm=cls._llm)
collection_scores.append(result.relevance_score)
except Exception as e:
logger.warning(f"Error scoring collection: {e}")
continue

# Store individual scores
all_scores["individual_dataset_scores"] = {
"value": collection_scores,
"comment": "Individual relevance scores for each dataset",
"data_type": "LIST",
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Skipped collections on error cause misaligned scores in the caller's zip.

When _score_dataset fails for a collection (line 351-353), the continue skips appending a score, so collection_scores can be shorter than collections. In evaluate() at line 190-191, zip(collections, individual_scores) pairs scores with the wrong collections — the first error shifts all subsequent mappings.

Proposed fix — append a sentinel on failure
             except Exception as e:
                 logger.warning(f"Error scoring collection: {e}")
-                continue
+                collection_scores.append(0.0)  # Default score on failure
🧰 Tools
🪛 Ruff (0.15.1)

[warning] 351-351: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rag_eval/evals.py` around lines 340 - 360, The current loop building
collection_scores skips appending when an exception occurs during
cls._relevance_prompt.generate (or in DatasetRelevanceInput creation), which
makes collection_scores shorter than collections and causes misalignment when
evaluate() zips collections with individual_scores; change the except block
inside the loop to append a sentinel value (e.g., None or a designated numeric
sentinel like -1/NaN) to collection_scores and include the error details in the
logger warning so lengths remain matched; ensure the stored
all_scores["individual_dataset_scores"] contains the sentinel entries so
downstream code in evaluate() can correctly zip collections with
individual_dataset_scores and handle the sentinel cases appropriately.

Comment on lines +70 to +83
# Choose which provider to use
USE_OPENAI = os.getenv("USE_OPENAI", "false").lower() == "true"

if USE_OPENAI:
print("🔧 Using OpenAI models")
if not os.getenv("OPENAI_API_KEY"):
print("❌ Error: OPENAI_API_KEY environment variable is not set.")
print("Please set your OpenAI API key:")
print("export OPENAI_API_KEY='your_openai_api_key'")
exit(1)
else:
print("🔧 Using AWS Bedrock models")
# Set AWS region
os.environ["AWS_REGION_NAME"] = BEDROCK_CONFIG["region_name"]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Module-level side effects including exit(1) run on import.

Lines 73-83 execute unconditionally when this module is imported — printing to stdout and potentially calling exit(1). This prevents importing the class in tests or from other modules without triggering these side effects. Move this logic into main() or __init__.

Proposed fix — move provider setup into the class or main()
-# Choose which provider to use
-USE_OPENAI = os.getenv("USE_OPENAI", "false").lower() == "true"
-
-if USE_OPENAI:
-    print("🔧 Using OpenAI models")
-    if not os.getenv("OPENAI_API_KEY"):
-        print("❌ Error: OPENAI_API_KEY environment variable is not set.")
-        print("Please set your OpenAI API key:")
-        print("export OPENAI_API_KEY='your_openai_api_key'")
-        exit(1)
-else:
-    print("🔧 Using AWS Bedrock models")
-    # Set AWS region
-    os.environ["AWS_REGION_NAME"] = BEDROCK_CONFIG["region_name"]
+USE_OPENAI = os.getenv("USE_OPENAI", "false").lower() == "true"

Then validate the API key inside EarthdataTestSetGenerator.__init__ or main().

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rag_eval/test_set_generator.py` around lines 70 - 83, The module currently
performs provider selection and may call exit(1) at import time (USE_OPENAI
check and OPENAI_API_KEY validation); move this logic into a runtime initializer
such as EarthdataTestSetGenerator.__init__ or a main() function so importing the
module has no side effects, replace print/exit calls with raising a clear
exception or use logger.error inside the initializer, and ensure the code
references the same symbols (USE_OPENAI flag read inside
EarthdataTestSetGenerator.__init__ or main(), and BEDROCK_CONFIG region
assignment moved there) so tests can import the class without triggering prints
or process termination.

Comment on lines +226 to +370
entities = []
conn = get_db_connection()

try:
with conn.cursor() as cur:
# Query 1: Collections (from collections table with enriched metadata)
cur.execute(
"""
SELECT concept_id, metadata, enriched_metadata
FROM collections
ORDER BY concept_id
LIMIT %s
""",
(limit_per_type,),
)

for row in cur.fetchall():
concept_id = row[0]
metadata = (
row[1] if isinstance(row[1], dict) else json.loads(row[1])
)
enriched = (
row[2]
if isinstance(row[2], dict)
else json.loads(row[2] or "{}")
)

title = (
enriched.get("title")
or metadata.get("EntryTitle")
or metadata.get("ShortName", "")
)
abstract = enriched.get("abstract") or metadata.get("Abstract", "")

if title and abstract:
entities.append(
{
"id": concept_id,
"type": "collection",
"title": title,
"content": abstract,
}
)

# Query 2: All other entity types from embeddings table
# This matches what search_all_entity_types() queries
# Note: Multiple rows per entity (title, abstract, etc.) so we limit total rows
cur.execute(
"""
SELECT type, external_id, attribute, text_content
FROM embeddings
WHERE type IN ('variable', 'citation', 'instruments', 'platforms', 'sciencekeywords')
ORDER BY type, external_id, attribute
LIMIT %s
""",
(
limit_per_type * 10,
), # Multiply to account for multiple rows per entity
)

# Group rows by (type, external_id) since title and abstract are in different rows
entity_map = {}
for row in cur.fetchall():
entity_type = row[0]
external_id = row[1]
attribute = row[2]
text_content = row[3]

key = (entity_type, external_id)
if key not in entity_map:
entity_map[key] = {
"type": entity_type,
"id": external_id,
"attributes": {},
}

if text_content:
entity_map[key]["attributes"][attribute] = text_content

# Convert grouped entities to list format
for (entity_type, external_id), data in entity_map.items():
attrs = data["attributes"]

# Try to get title from 'title' attribute, fallback to first available
title = (
attrs.get("title")
or attrs.get("name")
or next(iter(attrs.values()), "")[:100]
)

# Try to get content from 'abstract' attribute, fallback to combining all attributes
content = attrs.get("abstract") or attrs.get("description")
if not content:
# Combine all attributes as content if no abstract
content = "\n\n".join(
f"{k}: {v}" for k, v in attrs.items() if k != "title"
)

if title and content:
entities.append(
{
"id": external_id,
"type": entity_type,
"title": title,
"content": content,
}
)

print(f"Loaded {len(entities)} total entities from database")

# Print breakdown by type
type_counts = {}
for e in entities:
t = e["type"]
type_counts[t] = type_counts.get(t, 0) + 1
print(f"Entity breakdown: {type_counts}")

except Exception as e:
print(f"Error loading entities from database: {e}")
print("Falling back to placeholder data...")

# Fallback to placeholder data
entities = [
{
"id": "C1234567890-ORNL_DAAC",
"type": "collection",
"title": "MODIS/Terra Sea Surface Temperature (SST) Daily L3 Global 4km",
"content": (
"This dataset provides daily sea surface temperature measurements "
"from the MODIS instrument aboard NASA's Terra satellite. Data is "
"provided at 4km spatial resolution globally. Temporal coverage spans "
"from 2000 to present, making it ideal for climate studies and ocean "
"monitoring applications."
),
},
{
"id": "V1234567890-VARIABLE",
"type": "variable",
"title": "Sea Surface Temperature",
"content": "Sea surface temperature measured in Celsius from satellite thermal infrared sensors.",
},
]

finally:
conn.close()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

get_db_connection() outside try block — finally: conn.close() will raise NameError if the connection fails.

If get_db_connection() on line 227 throws, conn is never assigned, and line 370 (conn.close()) raises NameError.

Proposed fix
         entities = []
-        conn = get_db_connection()
-
-        try:
+        conn = None
+        try:
+            conn = get_db_connection()
             with conn.cursor() as cur:

and update the finally block:

         finally:
-            conn.close()
+            if conn is not None:
+                conn.close()
🧰 Tools
🪛 Ruff (0.15.1)

[warning] 343-343: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rag_eval/test_set_generator.py` around lines 226 - 370, The code calls
get_db_connection() before the try and always calls conn.close() in the finally,
which can raise NameError if connection creation fails; fix by initializing conn
= None before calling get_db_connection() (or moving the get_db_connection()
call inside the try) and update the finally block to only call conn.close() if
conn is not None (or truthy). Specifically update references around
get_db_connection(), conn, and the finally block that calls conn.close() so the
connection is safely closed only when it was successfully created.

Comment on lines +343 to +367
except Exception as e:
print(f"Error loading entities from database: {e}")
print("Falling back to placeholder data...")

# Fallback to placeholder data
entities = [
{
"id": "C1234567890-ORNL_DAAC",
"type": "collection",
"title": "MODIS/Terra Sea Surface Temperature (SST) Daily L3 Global 4km",
"content": (
"This dataset provides daily sea surface temperature measurements "
"from the MODIS instrument aboard NASA's Terra satellite. Data is "
"provided at 4km spatial resolution globally. Temporal coverage spans "
"from 2000 to present, making it ideal for climate studies and ocean "
"monitoring applications."
),
},
{
"id": "V1234567890-VARIABLE",
"type": "variable",
"title": "Sea Surface Temperature",
"content": "Sea surface temperature measured in Celsius from satellite thermal infrared sensors.",
},
]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Silent fallback to hardcoded placeholder data hides database failures.

When the DB query fails, the except block (line 343) prints a message and substitutes two fake entities. A caller generating a test set would unknowingly produce questions from fabricated data, leading to meaningless evaluations. At minimum, re-raise or make the fallback opt-in.

🧰 Tools
🪛 Ruff (0.15.1)

[warning] 343-343: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rag_eval/test_set_generator.py` around lines 343 - 367, The catch-all except
that prints an error and silently replaces real DB results with hardcoded
`entities` must be changed to avoid producing fake data unnoticed: either
re-raise the caught exception (so callers fail fast) or make the placeholder
fallback explicit via an opt-in flag (e.g., `allow_placeholder_fallback=False`)
so the caller must opt into using the hardcoded `entities`; update the error
handling in the except block (the one assigning `entities` in
rag_eval/test_set_generator.py) to log the error and then either `raise` the
exception or consult the new opt-in parameter before populating the hardcoded
`entities`, ensuring callers are aware when fallback is used.

Comment on lines +526 to +538
# Convert to our standard format
test_data = {
"version": "1.0",
"generated_at": datetime.now().isoformat(),
"generator": "ragas-bedrock-nova",
"config": {
"llm_model": BEDROCK_CONFIG["llm"],
"embedding_model": BEDROCK_CONFIG["embeddings"],
"testset_size": testset_size,
"num_personas": len(self.personas),
},
"test_cases": [],
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Config metadata always references BEDROCK_CONFIG regardless of provider.

Lines 532-533 hardcode BEDROCK_CONFIG["llm"] and BEDROCK_CONFIG["embeddings"] in the output metadata even when USE_OPENAI is True, producing inaccurate provenance for the generated test set.

Proposed fix
             "config": {
-                "llm_model": BEDROCK_CONFIG["llm"],
-                "embedding_model": BEDROCK_CONFIG["embeddings"],
+                "llm_model": OPENAI_CONFIG["llm"] if USE_OPENAI else BEDROCK_CONFIG["llm"],
+                "embedding_model": OPENAI_CONFIG["embeddings"] if USE_OPENAI else BEDROCK_CONFIG["embeddings"],
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Convert to our standard format
test_data = {
"version": "1.0",
"generated_at": datetime.now().isoformat(),
"generator": "ragas-bedrock-nova",
"config": {
"llm_model": BEDROCK_CONFIG["llm"],
"embedding_model": BEDROCK_CONFIG["embeddings"],
"testset_size": testset_size,
"num_personas": len(self.personas),
},
"test_cases": [],
}
# Convert to our standard format
test_data = {
"version": "1.0",
"generated_at": datetime.now().isoformat(),
"generator": "ragas-bedrock-nova",
"config": {
"llm_model": OPENAI_CONFIG["llm"] if USE_OPENAI else BEDROCK_CONFIG["llm"],
"embedding_model": OPENAI_CONFIG["embeddings"] if USE_OPENAI else BEDROCK_CONFIG["embeddings"],
"testset_size": testset_size,
"num_personas": len(self.personas),
},
"test_cases": [],
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rag_eval/test_set_generator.py` around lines 526 - 538, The generated
metadata in the test_data dict incorrectly always references BEDROCK_CONFIG for
"llm_model" and "embedding_model"; update the assignment to select the correct
configuration based on the active provider (e.g., use OPENAI_CONFIG when
USE_OPENAI is True, otherwise BEDROCK_CONFIG). Locate the test_data construction
in rag_eval/test_set_generator.py and change the "llm_model" and
"embedding_model" values to pull from the appropriate config variable (reference
BEDROCK_CONFIG, OPENAI_CONFIG and the USE_OPENAI flag or the class/instance
provider field) so provenance reflects the actual provider used.

Comment on lines +608 to +609
test_data = generator.generate_testset(
testset_size=5, # Generate 20 test questions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Comment says "Generate 20 test questions" but testset_size=5.

     test_data = generator.generate_testset(
-        testset_size=5,  # Generate 20 test questions
+        testset_size=5,  # Generate 5 test questions
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
test_data = generator.generate_testset(
testset_size=5, # Generate 20 test questions
test_data = generator.generate_testset(
testset_size=5, # Generate 5 test questions
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rag_eval/test_set_generator.py` around lines 608 - 609, The inline comment
says "Generate 20 test questions" but the call to generator.generate_testset
passes testset_size=5; update them to be consistent by changing the argument to
testset_size=20 (or alternatively update the comment to reflect 5) so that the
call to generator.generate_testset and the comment match; locate the call to
generator.generate_testset and the variable test_data to apply the fix.

…just creating scores as done before. This allows us to track and version golden datasets to evaluate the search system.
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (5)
rag_eval/evals.py (2)

896-902: Unused assignment: result is never read.

Static analysis flags result = evaluator.run_experiment(...) as unused (F841). Either use the result or drop the assignment.

Proposed fix
-        result = evaluator.run_experiment(
+        evaluator.run_experiment(
             dataset_name=dataset_name,
             experiment_name=experiment_name,
             max_concurrency=max_concurrency,
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rag_eval/evals.py` around lines 896 - 902, The assignment to result from
evaluator.run_experiment(...) is unused; remove the unused variable or use it.
Either call evaluator.run_experiment(...) without assigning its return value
(drop "result =") if the return is unneeded, or capture and use the returned
value (e.g., log it or return it from the surrounding function) so that result
is referenced; update the try block around evaluator.run_experiment(...)
accordingly to eliminate the F841 unused-assignment warning.

682-691: Fragile positional indexing for metrics (cls._metrics[0], cls._metrics[1]).

compute_faithfulness and compute_context_precision access metrics by hard-coded list index. If the order in _initialize_components (line 597-600) changes or a metric is added before them, these will silently use the wrong metric. Store metrics in a dict keyed by name instead.

Proposed approach

In _initialize_components:

-            cls._metrics = [
-                Faithfulness(),
-                LLMContextPrecisionWithoutReference(),
-            ]
+            cls._metrics = {
+                "faithfulness": Faithfulness(),
+                "context_precision": LLMContextPrecisionWithoutReference(),
+            }
-            init_ragas_metrics(cls._metrics, cls._llm, cls._embeddings)
+            init_ragas_metrics(list(cls._metrics.values()), cls._llm, cls._embeddings)

Then in compute_faithfulness:

-            faithfulness_metric = cls._metrics[0]
+            faithfulness_metric = cls._metrics["faithfulness"]

And in compute_context_precision:

-            context_precision_metric = cls._metrics[1]
+            context_precision_metric = cls._metrics["context_precision"]

Also applies to: 716-725

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rag_eval/evals.py` around lines 682 - 691, compute_faithfulness and
compute_context_precision currently pick metrics by positional index
(cls._metrics[0], cls._metrics[1]) which is fragile; change
_initialize_components to store metrics in a dict keyed by metric name (e.g.,
{"faithfulness": metric_obj, "context_precision": metric_obj}) and update
compute_faithfulness and compute_context_precision to lookup the appropriate
metric by name from that dict before calling score_with_ragas (and similarly
update the other occurrences around lines 716-725), so the code uses the named
metric entries instead of relying on list order.
rag_eval/mcp_client.py (3)

130-145: Hardcoded "id": 1 reuses the same JSON-RPC request ID as _list_tools.

Both _list_tools (line 88) and query (line 145) use "id": 1. While not a functional bug since these are sequential calls, using distinct or auto-incrementing IDs would improve debuggability of server-side logs and is a JSON-RPC best practice.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rag_eval/mcp_client.py` around lines 130 - 145, The JSON-RPC request ID is
hardcoded to 1 in query (and also used in _list_tools), which can confuse server
logs; modify query (and optionally _list_tools) to generate unique IDs per
request (e.g., use an incrementing counter, UUID, or a timestamp) instead of the
literal "id": 1 so each JSON-RPC payload from query and _list_tools has a
distinct "id" value for better traceability.

15-19: Constructor triggers network I/O, preventing unit testing and error recovery.

_initialize_session() is called unconditionally in __init__, which means the constructor itself performs network calls. This makes it impossible to construct the client without a live server, complicating testing. Consider lazy initialization or an explicit connect() method.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rag_eval/mcp_client.py` around lines 15 - 19, The constructor currently calls
_initialize_session() which performs network I/O and prevents creating MCPClient
instances in tests; remove the unconditional call to _initialize_session() from
__init__ (keep self.session_id: Optional[str] = None and client initialization)
and implement an explicit connect() method (or lazy ensure_session() helper)
that performs the session initialization and sets self.session_id by calling
_initialize_session(); update public methods that need a session (e.g., any
send/ request methods) to either call connect()/ensure_session() at the start or
raise a clear error if session is missing, so tests can construct MCPClient
without a live server and control when network I/O happens.

184-192: Last SSE data line wins — silent overwrite of earlier results.

The loop overwrites result_data on every valid data: line. If the server sends multiple JSON-RPC results (e.g., progress notifications followed by a final result), earlier data is silently discarded. This works if the server always sends the final result last, but the intent isn't obvious. A brief comment or logging the overwrite count would clarify the contract.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rag_eval/mcp_client.py` around lines 184 - 192, The SSE loop currently
overwrites result_data for every "data: " line which silently discards earlier
JSON payloads; update the loop in rag_eval/mcp_client.py (the
response.iter_lines() handling and result_data variable) to either (a) only set
result_data if it is currently None to preserve the first result, or (b) keep
overwriting but add a counter and emit a warning via the existing logger when an
overwrite occurs (include the count and a brief context string), or (c) merge
expected fields if multiple partial JSON messages are sent — pick the strategy
that matches the RPC contract and add a short comment above the loop describing
the chosen contract (first wins / last wins / merged). Ensure JSON decoding
errors still skip lines as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@rag_eval/evals.py`:
- Around line 127-132: The constructor accepts llm_model and temperature but
never uses them; store these values on self in the class __init__ and pass them
into SingleEvaluation (or into SingleEvaluation._initialize_components) instead
of the hardcoded "amazon.nova-pro-v1:0" and 0.01. Concretely: save
self.llm_model and self.temperature in the constructor, and when creating or
initializing SingleEvaluation (refer to SingleEvaluation._initialize_components
and the code path that currently hardcodes the model/temperature), replace the
literals with the stored self.llm_model and self.temperature so the configured
values are actually applied.
- Around line 29-38: The module currently executes side effects at import
(logging.basicConfig, load_dotenv(override=True), and nest_asyncio.apply())
which will affect any importer (e.g., when using SingleEvaluation); move these
calls out of the top level and into a guarded entrypoint such as a main()
function (define main() if missing) and call them from if __name__ ==
"__main__": main() so importing rag_eval.evals no longer reconfigures logging,
overrides env vars, or patches the event loop; update references to
logging.basicConfig, load_dotenv, and nest_asyncio.apply to be only inside that
main()/entrypoint.
- Around line 210-257: Both evaluators currently duplicate the expensive call to
SingleEvaluation.compute_dataset_relevance_scores (in
avg_dataset_relevance_evaluator and max_dataset_relevance_evaluator); change to
a single combined evaluator that calls
SingleEvaluation.compute_dataset_relevance_scores once (using question and
collections), extract avg_relevance and max_relevance from the returned scores,
and return a list containing two Evaluation objects
(name="avg_dataset_relevance" and name="max_dataset_relevance") with appropriate
value and comment fields; ensure you preserve the existing None checks and
logging behavior and that run_experiment will accept the returned list of
Evaluations.

---

Duplicate comments:
In `@rag_eval/evals.py`:
- Around line 522-526: The code assumes get_langfuse() always returns a client
but it can return None; update the block around langfuse = get_langfuse() to
check for a None return before calling langfuse.get_dataset(dataset_name), and
handle the failure by logging an error (including any context) and either
raising an exception or exiting gracefully; reference get_langfuse and the
langfuse variable so the guard immediately after initialization prevents the
AttributeError when calling langfuse.get_dataset.
- Around line 628-647: The try/except in the scoring loop causes
collection_scores to be shorter than collections when an exception occurs,
breaking alignment; in the except block for the loop that builds
collection_scores (which uses DatasetRelevanceInput and
cls._relevance_prompt.generate()) append a sentinel (e.g., None or float('nan'))
to collection_scores before continuing and keep the logger.warning call so the
index positions remain aligned with collections for any downstream zips or
lookups.
- Around line 730-738: The return type hint for evaluate_single is wrong: update
the signature of classmethod evaluate_single to reflect it returns a mapping of
metric names to objects with keys "value", "comment", and "data_type" (e.g.,
dict[str, dict] or a more precise typing like dict[str, dict[str, Any]] or a
TypedDict); change the return annotation from dict[str, float] to the chosen
correct type and adjust any related docstring/comments and imports
(typing/TypedDict) as needed so callers and static checkers reflect the actual
dict[str, dict] structure produced by evaluate_single.

In `@rag_eval/mcp_client.py`:
- Around line 80-82: The except block that catches exceptions during session
initialization currently only logs (logger.error in rag_eval.mcp_client.py)
which leaves the client in a broken state; modify the handler to either re-raise
the caught exception (e.g., raise) after logging or set an explicit failure
flag/attribute (e.g., self._init_failed = True and store the exception in
self._init_exception) and then update the query() method to check that flag (or
re-raise the stored exception) before proceeding; locate the session init logic
(the try/except around session creation and the query() method) and implement
one of these two fixes so callers get an immediate, clear failure instead of
confusing downstream errors.
- Around line 253-265: The code assumes collection.get("similarity_score", 0.0)
yields a number but if the key exists with value None the format f"{score:.3f}"
will raise TypeError; fix by normalizing the value before formatting (e.g.,
retrieve score = collection.get("similarity_score") and then set score = 0.0 if
score is None or not a number, or coerce via try: score = float(score) except:
score = 0.0) so the f"{score:.3f}" call never receives None; update the block
that sets score and any use in answer_parts formatting to use the normalized
numeric score.

In `@rag_eval/models.py`:
- Around line 7-13: The relevance_score field on DatasetRelevanceScore currently
accepts any float; add validation to enforce the expected 0.0–1.0 range so
out-of-range LLM outputs cannot corrupt downstream logic. Update the
relevance_score Field declaration on the DatasetRelevanceScore model to include
bounds (e.g., ge=0.0 and le=1.0) or add a Pydantic validator on relevance_score
that raises a ValidationError if the value is <0.0 or >1.0; keep the same
description text when making this change.

---

Nitpick comments:
In `@rag_eval/evals.py`:
- Around line 896-902: The assignment to result from
evaluator.run_experiment(...) is unused; remove the unused variable or use it.
Either call evaluator.run_experiment(...) without assigning its return value
(drop "result =") if the return is unneeded, or capture and use the returned
value (e.g., log it or return it from the surrounding function) so that result
is referenced; update the try block around evaluator.run_experiment(...)
accordingly to eliminate the F841 unused-assignment warning.
- Around line 682-691: compute_faithfulness and compute_context_precision
currently pick metrics by positional index (cls._metrics[0], cls._metrics[1])
which is fragile; change _initialize_components to store metrics in a dict keyed
by metric name (e.g., {"faithfulness": metric_obj, "context_precision":
metric_obj}) and update compute_faithfulness and compute_context_precision to
lookup the appropriate metric by name from that dict before calling
score_with_ragas (and similarly update the other occurrences around lines
716-725), so the code uses the named metric entries instead of relying on list
order.

In `@rag_eval/mcp_client.py`:
- Around line 130-145: The JSON-RPC request ID is hardcoded to 1 in query (and
also used in _list_tools), which can confuse server logs; modify query (and
optionally _list_tools) to generate unique IDs per request (e.g., use an
incrementing counter, UUID, or a timestamp) instead of the literal "id": 1 so
each JSON-RPC payload from query and _list_tools has a distinct "id" value for
better traceability.
- Around line 15-19: The constructor currently calls _initialize_session() which
performs network I/O and prevents creating MCPClient instances in tests; remove
the unconditional call to _initialize_session() from __init__ (keep
self.session_id: Optional[str] = None and client initialization) and implement
an explicit connect() method (or lazy ensure_session() helper) that performs the
session initialization and sets self.session_id by calling
_initialize_session(); update public methods that need a session (e.g., any
send/ request methods) to either call connect()/ensure_session() at the start or
raise a clear error if session is missing, so tests can construct MCPClient
without a live server and control when network I/O happens.
- Around line 184-192: The SSE loop currently overwrites result_data for every
"data: " line which silently discards earlier JSON payloads; update the loop in
rag_eval/mcp_client.py (the response.iter_lines() handling and result_data
variable) to either (a) only set result_data if it is currently None to preserve
the first result, or (b) keep overwriting but add a counter and emit a warning
via the existing logger when an overwrite occurs (include the count and a brief
context string), or (c) merge expected fields if multiple partial JSON messages
are sent — pick the strategy that matches the RPC contract and add a short
comment above the loop describing the chosen contract (first wins / last wins /
merged). Ensure JSON decoding errors still skip lines as before.

Comment on lines +29 to +38
logging.basicConfig(
level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger(__name__)

# Load environment variables from .env file (override existing ones)
load_dotenv(override=True)

# Apply nest_asyncio to allow nested event loops (for PydanticPrompt.generate in evaluators)
nest_asyncio.apply()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Module-level side effects will fire on import.

logging.basicConfig(), load_dotenv(override=True), and nest_asyncio.apply() execute at import time. Any module importing rag_eval.evals (e.g., for SingleEvaluation) will have its logging reconfigured, env vars overridden, and the event loop patched. These should be guarded behind if __name__ == "__main__" or moved into main().

Proposed fix — move side effects into main()
-# Configure logging
-logging.basicConfig(
-    level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
-)
 logger = logging.getLogger(__name__)
 
-# Load environment variables from .env file (override existing ones)
-load_dotenv(override=True)
-
-# Apply nest_asyncio to allow nested event loops (for PydanticPrompt.generate in evaluators)
-nest_asyncio.apply()

Then in main():

def main():
    """Main entry point for running evaluations."""
    logging.basicConfig(
        level=logging.INFO,
        format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
    )
    load_dotenv(override=True)
    nest_asyncio.apply()
    # ... rest of main
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rag_eval/evals.py` around lines 29 - 38, The module currently executes side
effects at import (logging.basicConfig, load_dotenv(override=True), and
nest_asyncio.apply()) which will affect any importer (e.g., when using
SingleEvaluation); move these calls out of the top level and into a guarded
entrypoint such as a main() function (define main() if missing) and call them
from if __name__ == "__main__": main() so importing rag_eval.evals no longer
reconfigures logging, overrides env vars, or patches the event loop; update
references to logging.basicConfig, load_dotenv, and nest_asyncio.apply to be
only inside that main()/entrypoint.

Comment on lines +127 to +132
def __init__(
self,
mcp_server_url: str,
llm_model: str = "amazon.nova-pro-v1:0",
temperature: float = 0.01,
trace_name: str = "rag",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

llm_model and temperature constructor parameters are accepted but never used.

These parameters suggest the intent to configure the evaluation LLM, but they are not passed to SingleEvaluation or any other component. SingleEvaluation._initialize_components hardcodes "amazon.nova-pro-v1:0" and temperature=0.01 (line 588-591). Either wire these through or remove them from the constructor signature.

Proposed fix — remove unused params
     def __init__(
         self,
         mcp_server_url: str,
-        llm_model: str = "amazon.nova-pro-v1:0",
-        temperature: float = 0.01,
         trace_name: str = "rag",
     ):
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def __init__(
self,
mcp_server_url: str,
llm_model: str = "amazon.nova-pro-v1:0",
temperature: float = 0.01,
trace_name: str = "rag",
def __init__(
self,
mcp_server_url: str,
trace_name: str = "rag",
🧰 Tools
🪛 Ruff (0.15.1)

[warning] 130-130: Unused method argument: llm_model

(ARG002)


[warning] 131-131: Unused method argument: temperature

(ARG002)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rag_eval/evals.py` around lines 127 - 132, The constructor accepts llm_model
and temperature but never uses them; store these values on self in the class
__init__ and pass them into SingleEvaluation (or into
SingleEvaluation._initialize_components) instead of the hardcoded
"amazon.nova-pro-v1:0" and 0.01. Concretely: save self.llm_model and
self.temperature in the constructor, and when creating or initializing
SingleEvaluation (refer to SingleEvaluation._initialize_components and the code
path that currently hardcodes the model/temperature), replace the literals with
the stored self.llm_model and self.temperature so the configured values are
actually applied.

Comment on lines +210 to +257
def avg_dataset_relevance_evaluator(*, output, **kwargs):
"""Evaluate average dataset relevance."""
try:
question = output.get("question", "")
collections = output.get("collections", [])

logger.info(
f"Computing avg relevance for {len(collections)} collections"
)

scores = SingleEvaluation.compute_dataset_relevance_scores(
question=question,
collections=collections,
)

avg_value = scores.get("avg_relevance")
logger.info(f"Avg relevance computed: {avg_value}")

if avg_value is None:
logger.warning("avg_relevance is None, skipping evaluation")
return None

return Evaluation(
name="avg_dataset_relevance",
value=avg_value,
comment=f"Average relevance: {avg_value:.3f}",
)
except Exception as e:
logger.error(
f"Error in avg_dataset_relevance_evaluator: {e}", exc_info=True
)
return None

# Evaluator for max dataset relevance
def max_dataset_relevance_evaluator(*, output, **kwargs):
"""Evaluate max dataset relevance."""
try:
question = output.get("question", "")
collections = output.get("collections", [])

logger.info(
f"Computing max relevance for {len(collections)} collections"
)

scores = SingleEvaluation.compute_dataset_relevance_scores(
question=question,
collections=collections,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's understand the file structure and locate the relevant code
find . -type f -name "*.py" | grep -E "(evals|test)" | head -20

Repository: nasa/earthdata-mcp

Length of output: 770


🏁 Script executed:

# Check the evals.py file to understand the context and how evaluators are used
wc -l rag_eval/evals.py

Repository: nasa/earthdata-mcp

Length of output: 83


🏁 Script executed:

# Look for Langfuse imports and usage patterns
rg "run_experiment|Langfuse" --type py -A 3 -B 3 | head -100

Repository: nasa/earthdata-mcp

Length of output: 4170


🏁 Script executed:

# Search for how evaluators are registered/passed to run_experiment
rg "avg_dataset_relevance_evaluator|max_dataset_relevance_evaluator" --type py -A 2 -B 2

Repository: nasa/earthdata-mcp

Length of output: 1520


🏁 Script executed:

# Check if there are any evaluators that return lists of Evaluation objects
rg "return.*\[.*Evaluation" --type py -A 1 -B 1

Repository: nasa/earthdata-mcp

Length of output: 44


🌐 Web query:

Langfuse run_experiment evaluator return type multiple Evaluation objects list

💡 Result:

In langfuse.run_experiment(...), an item-level evaluator is allowed to return either:

  • a single Evaluation (or a single dict shaped like an Evaluation), or
  • a list of Evaluation objects (or list[dict]), or
  • an async/awaitable that resolves to one of the above.

Internally, Langfuse normalizes whatever you return into a List[Evaluation] like this:

  • if you return a single Evaluation/dict → it wraps it as [result]
  • if you return a list → it uses it as-is
  • otherwise → it treats it as [] (no evaluations) [1]

So yes: returning multiple evaluations is supported—just return [Evaluation(...), Evaluation(...), ...]. The docs even include a “Multi-metric evaluator” example that returns a list. [1][2]


Combine evaluators to eliminate duplicate scoring calls — Langfuse supports returning multiple Evaluations.

avg_dataset_relevance_evaluator and max_dataset_relevance_evaluator both independently call SingleEvaluation.compute_dataset_relevance_scores() with identical question and collections. This doubles the cost and latency per test case when compute_dataset_relevance_scores performs LLM-based scoring.

Merge them into a single evaluator returning both scores as a list—Langfuse's run_experiment() normalizes single Evaluation objects and lists of evaluations identically, so returning [avg_evaluation, max_evaluation] is fully supported.

Suggested approach — single evaluator returning both scores
-        # Evaluator for average dataset relevance
-        def avg_dataset_relevance_evaluator(*, output, **kwargs):
-            ...
-
-        # Evaluator for max dataset relevance
-        def max_dataset_relevance_evaluator(*, output, **kwargs):
-            ...
+        # Combined evaluator for dataset relevance (avg + max)
+        def dataset_relevance_evaluator(*, output, **kwargs):
+            """Evaluate dataset relevance (avg and max)."""
+            try:
+                question = output.get("question", "")
+                collections = output.get("collections", [])
+
+                scores = SingleEvaluation.compute_dataset_relevance_scores(
+                    question=question,
+                    collections=collections,
+                )
+
+                results = []
+                avg_value = scores.get("avg_relevance")
+                if avg_value is not None:
+                    results.append(Evaluation(
+                        name="avg_dataset_relevance",
+                        value=avg_value,
+                        comment=f"Average relevance: {avg_value:.3f}",
+                    ))
+
+                max_value = scores.get("max_relevance")
+                if max_value is not None:
+                    results.append(Evaluation(
+                        name="max_dataset_relevance",
+                        value=max_value,
+                        comment=f"Max relevance: {max_value:.3f}",
+                    ))
+
+                return results if results else None
+            except Exception as e:
+                logger.error(f"Error in dataset_relevance_evaluator: {e}", exc_info=True)
+                return None
🧰 Tools
🪛 Ruff (0.15.1)

[warning] 210-210: Unused function argument: kwargs

(ARG001)


[warning] 244-244: Unused function argument: kwargs

(ARG001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rag_eval/evals.py` around lines 210 - 257, Both evaluators currently
duplicate the expensive call to
SingleEvaluation.compute_dataset_relevance_scores (in
avg_dataset_relevance_evaluator and max_dataset_relevance_evaluator); change to
a single combined evaluator that calls
SingleEvaluation.compute_dataset_relevance_scores once (using question and
collections), extract avg_relevance and max_relevance from the returned scores,
and return a list containing two Evaluation objects
(name="avg_dataset_relevance" and name="max_dataset_relevance") with appropriate
value and comment fields; ensure you preserve the existing None checks and
logging behavior and that run_experiment will accept the returned list of
Evaluations.


```
rag_eval/
├── README.md # This file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These directory listings are nice but sort of hard to maintain. It looks like these are already out of date 😅

# Choose which provider to use
USE_OPENAI = os.getenv("USE_OPENAI", "false").lower() == "true"

if USE_OPENAI:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this code useful at all if were not using OpenAI at all?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might need another run through of this file. Not sure if we covered this when we paired originally. Just want to make sure we all have an understanding of how this generation is working and what its being used for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants