Conversation
📝 WalkthroughWalkthroughAdds a new rag_eval package providing end-to-end RAG evaluation for Earthdata MCP: evaluation orchestration, MCP client, Ragas metric utilities, test-set generation, models, sample dataset, Langfuse helpers, and packaging/dependency updates. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Evaluator as EarthdataEvaluator
participant Dataset as Dataset Storage
participant MCPClient as EarthdataRAGClient
participant Metrics as SingleEvaluation
participant Langfuse as Langfuse
User->>Evaluator: run_experiment(dataset_name)
Evaluator->>Dataset: load_dataset(name)
Dataset-->>Evaluator: test cases
loop per test case
Evaluator->>Langfuse: start_trace()
Langfuse-->>Evaluator: trace_id
Evaluator->>MCPClient: query(question)
MCPClient-->>Evaluator: {collections, answer}
Evaluator->>Metrics: evaluate_single(question, collections, contexts, answer)
Metrics-->>Evaluator: metric scores
Evaluator->>Langfuse: create_score(scores)
end
Evaluator->>Langfuse: flush()
Evaluator-->>User: aggregated_results
sequenceDiagram
participant User
participant Generator as EarthdataTestSetGenerator
participant DB as PostgreSQL
participant KG as KnowledgeGraph
participant LLM as Bedrock/OpenAI
participant Output as JSON
User->>Generator: generate_testset(size)
Generator->>DB: load_all_entities_from_db()
DB-->>Generator: entities
Generator->>KG: create_knowledge_graph(entities)
Generator->>KG: enrich_knowledge_graph()
KG-->>Generator: enriched_kg
Generator->>LLM: synthesize_test_cases(personas, enriched_kg)
LLM-->>Generator: test_cases
Generator->>Output: save_json(test_cases)
Output-->>User: testset file
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #30 +/- ##
===========================================
- Coverage 88.18% 76.55% -11.63%
===========================================
Files 50 57 +7
Lines 2361 2952 +591
===========================================
+ Hits 2082 2260 +178
- Misses 279 692 +413 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 18
🧹 Nitpick comments (12)
rag_eval/ragas_utils.py (2)
95-95: Use lazy%sformatting in logger call instead of f-string.f-strings are evaluated eagerly regardless of log level.
Suggested fix
- logger.info(f"Calculating {m.name}...") + logger.info("Calculating %s...", m.name)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rag_eval/ragas_utils.py` at line 95, Replace the eager f-string logger call with lazy %-style formatting to avoid evaluating m.name when the log level would skip it; specifically change the logger.info(f"Calculating {m.name}...") call to use logger.info("Calculating %s...", m.name) (referencing the same logger.info and m.name symbols in ragas_utils.py).
57-72:init_ragas_metricscreates a newRunConfigper metric — likely intended to be shared.A single
RunConfiginstance could be created outside the loop and reused for all metrics. This is a minor efficiency and clarity improvement.Suggested fix
def init_ragas_metrics(metrics, llm, embedding): + run_config = RunConfig() for metric in metrics: if isinstance(metric, MetricWithLLM): metric.llm = llm if isinstance(metric, MetricWithEmbeddings): metric.embeddings = embedding - run_config = RunConfig() metric.init(run_config)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rag_eval/ragas_utils.py` around lines 57 - 72, The function init_ragas_metrics currently constructs a new RunConfig() inside the loop for every metric; instead, create one RunConfig instance before the for loop and pass that same run_config to each metric.init(...) so all metrics share the same RunConfig; keep the existing handling where you set metric.llm for MetricWithLLM and metric.embeddings for MetricWithEmbeddings, but move the RunConfig allocation out of the loop in init_ragas_metrics and call metric.init(run_config) for each metric.rag_eval/models.py (1)
29-38: Trailing whitespace on Line 30 inside the instruction string.Line 30 (after
"...user's question.) appears to have trailing whitespace before the blank line. While this won't affect functionality, it will be embedded in the prompt text sent to the LLM.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rag_eval/models.py` around lines 29 - 38, The instruction string assigned to variable instruction in rag_eval/models.py contains trailing whitespace after "user's question." — remove that trailing space inside the triple-quoted string (or trim the string when assigning, e.g., use .strip()) so the prompt text sent to the LLM has no embedded trailing whitespace; update the instruction assignment in the instruction variable to eliminate the extra space.rag_eval/pyproject.toml (1)
27-28: Emptypy-modulessuppresses auto-discovery but nopackagesis declared.With
py-modules = []and no[tool.setuptools.packages]or[tool.setuptools.packages.find], setuptools may not discover any modules at all when this package is installed. If this is intended only foruv sync/ editable installs it might work, but apip install .may produce an empty package.Suggested fix
[tool.setuptools] -py-modules = [] +packages = ["rag_eval"]Or add auto-discovery:
[tool.setuptools.packages.find] where = ["."]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rag_eval/pyproject.toml` around lines 27 - 28, The build config currently sets py-modules = [] which disables module auto-discovery and there are no package declarations, so pip installs can produce an empty distribution; fix by either removing the empty py-modules entry or adding a packages discovery block (e.g. add [tool.setuptools.packages.find] with where = ["."], or explicitly declare package names under [tool.setuptools.packages]) so setuptools can find and include your package sources (look for the py-modules setting and update it or add the packages.find section).rag_eval/evals.py (5)
3-3: Unusedasyncioimport.
asynciois imported but never used anywhere in this file.Proposed fix
-import asyncio🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rag_eval/evals.py` at line 3, The file imports asyncio but never uses it; remove the unused import statement (the "import asyncio" line at the top of rag_eval/evals.py) to eliminate the dead dependency and any linter warnings, leaving other imports and code unchanged.
125-125: f-string without placeholders.Static analysis flags this as F541.
- logger.info(f"Dataset loaded successfully") + logger.info("Dataset loaded successfully")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rag_eval/evals.py` at line 125, The log statement uses an unnecessary f-string which triggers F541; change the call to use a normal string literal in the logger.info invocation (replace the f-string used in the logger.info("Dataset loaded successfully") call with a plain string) so no f-prefix is present—locate the logger.info usage in rag_eval/evals.py and remove the f-prefix.
106-114:_score_datasetis dead code —evaluate()delegates toSingleEvaluation.evaluate_singleinstead.This method is defined on
EarthdataEvaluatorbut never called. The actual per-dataset scoring happens insideSingleEvaluation.evaluate_single(lines 342-350), which duplicates the same logic. Consider removing this method or havingSingleEvaluationdelegate to it.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rag_eval/evals.py` around lines 106 - 114, The _score_dataset method on EarthdataEvaluator is dead code because evaluate() uses SingleEvaluation.evaluate_single to score datasets; remove the duplication by either deleting _score_dataset or refactoring SingleEvaluation.evaluate_single to call EarthdataEvaluator._score_dataset (or vice versa). Concretely, choose one implementation of the scoring logic (the code that builds DatasetRelevanceInput and calls relevance_prompt.generate with llm and returns relevance_score) and update SingleEvaluation.evaluate_single or EarthdataEvaluator._score_dataset to delegate to that single function (keep symbols: EarthdataEvaluator, _score_dataset, SingleEvaluation.evaluate_single, relevance_prompt.generate, llm) and remove the redundant copy.
186-208:zip()withoutstrict=Truesilently drops trailing elements.As noted in the sibling comment about misaligned scores, using
strict=Truehere would at least surface the length mismatch as aValueErrorrather than silently producing wrong mappings.Proposed fix (once the sentinel-on-failure fix is applied)
- for collection_idx, (collection, score) in enumerate( - zip(collections, individual_scores) - ): + for collection_idx, (collection, score) in enumerate( + zip(collections, individual_scores, strict=True) + ):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rag_eval/evals.py` around lines 186 - 208, The loop pairing collections and individual_scores uses zip without strict=True which silently drops trailing items; update the pairing in the block that enumerates zip(collections, individual_scores) to use zip(collections, individual_scores, strict=True) so a length mismatch raises a ValueError and surfaces the misalignment (this change applies to the loop that creates the trace via trace.start_as_current_observation and calls scoring_span.score); if Python versions in CI must support this, alternatively assert len(collections)==len(individual_scores) before the loop to explicitly fail on mismatch.
281-312: SingletonSingleEvaluationis not thread-safe and duplicatesEarthdataEvaluatorinitialization.The class-level singleton pattern (
_llm,_embeddings, etc.) has no synchronization, so concurrent callers could race on_initialize_components. Additionally, when used fromEarthdataEvaluator.evaluate(), the evaluator creates its own LLM/embeddings (lines 54-59) whileSingleEvaluationlazily creates a separate set with potentially differentmax_tokens(10000 vs 4096 on line 302). This means two Bedrock LLM instances exist with different configs, which may cause confusion or unexpected scoring differences.Consider either:
- Having
SingleEvaluationaccept injected components, or- Removing the duplicate initialization in
EarthdataEvaluator🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rag_eval/evals.py` around lines 281 - 312, The singleton init in SingleEvaluation (_initialize_components) is racy and duplicates EarthdataEvaluator.evaluate's LLM/embeddings (different max_tokens: 4096 vs 10000); make the singleton thread-safe and stop duplicate creation by 1) adding a class-level threading.Lock and surrounding the body of SingleEvaluation._initialize_components with lock acquisition to ensure only one init runs, 2) allow injection of components into SingleEvaluation (add optional parameters llm, embeddings, relevance_prompt, metrics to the initializer or a new classmethod) and prefer injected components over creating new ones, and 3) update EarthdataEvaluator.evaluate to pass its created LLM/embeddings into SingleEvaluation instead of creating separate instances (and consolidate the create_bedrock_llm max_tokens setting into one place so both use the same value).rag_eval/test_set_generator.py (2)
16-18:sysis imported but never used.import json import os -import sys from datetime import datetime🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rag_eval/test_set_generator.py` around lines 16 - 18, The file imports the unused module sys; remove the unused "import sys" statement from the top of the file (in the import block alongside json and os) to clean up the namespace and satisfy linting, or if sys was intended to be used, replace the removal by actually using sys in the relevant function (e.g., within test set generation code) and reference the symbol sys where needed.
428-463: Multiple unused variable assignments flagged by static analysis.
headline_extractor,headline_splitter,keyphrase_extractor, andkeyphrase_similarity_builderare assigned but never used (all commented out intransforms). Remove the assignments or prefix with_to signal intentional non-use.Proposed cleanup — comment out the unused assignments as well
- headline_extractor = HeadlinesExtractor(llm=self.generator_llm, max_num=20) - headline_splitter = HeadlineSplitter(max_tokens=1500) + # headline_extractor = HeadlinesExtractor(llm=self.generator_llm, max_num=20) + # headline_splitter = HeadlineSplitter(max_tokens=1500) topic_extractor = TopicDescriptionExtractor(llm=self.generator_llm) # ner_extractor = NERExtractor(llm=self.generator_llm) - keyphrase_extractor = KeyphrasesExtractor(llm=self.generator_llm) + # keyphrase_extractor = KeyphrasesExtractor(llm=self.generator_llm) # Step 2: Build relationships between nodes based on extracted properties topic_similarity_builder = JaccardSimilarityBuilder( @@ ... - keyphrase_similarity_builder = JaccardSimilarityBuilder( - property_name="keyphrases", - new_property_name="keyphrase_similarity", - threshold=0.1, - ) + # keyphrase_similarity_builder = JaccardSimilarityBuilder( + # property_name="keyphrases", + # new_property_name="keyphrase_similarity", + # threshold=0.1, + # )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rag_eval/test_set_generator.py` around lines 428 - 463, Several variables are assigned but never used—headline_extractor, headline_splitter, keyphrase_extractor, and keyphrase_similarity_builder—triggering static analysis warnings; either remove these unused assignments or rename them with a leading underscore (e.g., _headline_extractor) to mark intentional non-use, and keep the active transforms list intact (referencing HeadlinesExtractor, HeadlineSplitter, KeyphrasesExtractor, and JaccardSimilarityBuilder instances) so the linter no longer reports unused-variable issues.rag_eval/mcp_client.py (1)
186-195: SSE parsing silently takes the lastdata:line, discarding earlier ones.The loop overwrites
result_dataon each successful parse, so only the final SSEdata:line is used. If the server sends multiple JSON events (e.g., progress + final result), intermediate data is silently dropped. This may be intentional, but if the server sends the actual result in an earlier event and a status/summary in the last, you'd lose the result.Consider either documenting this "last-wins" behavior or accumulating events:
- result_data = None - for line in response.text.split("\n"): - if line.startswith("data: "): - data = line[6:].strip() # Remove 'data: ' prefix - if data and data != "[DONE]": - try: - result_data = json.loads(data) - except json.JSONDecodeError: - continue + # SSE may contain multiple data lines; use the last valid JSON event + result_data = None + for line in response.text.split("\n"): + if line.startswith("data: "): + data = line[6:].strip() + if data and data != "[DONE]": + try: + result_data = json.loads(data) + except json.JSONDecodeError: + continue🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rag_eval/mcp_client.py` around lines 186 - 195, The SSE parsing loop currently overwrites result_data for each parsed "data: " line (using response.text and json.loads), losing earlier events; change this to accumulate parsed events (e.g., create a list like parsed_events and append each successful json.loads(data)) instead of assigning to result_data, then decide after the loop whether to use the first/last/merged event or return the list—update any downstream use of result_data accordingly (or, if intended behavior is to keep the first event, set result_data only when it is None).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pyproject.toml`:
- Around line 24-25: The dependencies "ragas" and "litellm" are evaluation-only
and should be removed from the top-level runtime dependencies in pyproject.toml;
instead add them under an optional dependency group (e.g.,
project.optional-dependencies or tool.poetry.extras) as an "rag_eval" or
"evaluation" extra so only users who need the rag_eval/ package install them;
update references to the dependency names "ragas" and "litellm" accordingly and
ensure the main MCP runtime dependency list no longer contains these two
entries.
In `@rag_eval/__init__.py`:
- Around line 1-3: The package version constant is inconsistent: update the
__version__ variable in rag_eval.__init__ to match pyproject.toml by changing
__version__ = "1.0.0" to __version__ = "0.1.0"; verify the value matches
pyproject.toml and keep them synchronized going forward (update whichever file
should be authoritative if different).
In `@rag_eval/evals.py`:
- Around line 314-321: The return type hint for evaluate_single is wrong: it
currently declares dict[str, float] but actually returns a mapping whose values
are dicts with keys "value", "comment", and "data_type". Fix the signature of
evaluate_single to reflect that structure (e.g., dict[str, dict[str, Any]] or
define a small TypedDict/NamedTuple for the result shape and use dict[str,
YourTypedDict]) and update any related imports; ensure the chosen type
explicitly documents the inner keys "value", "comment", and "data_type".
- Around line 340-360: The current loop building collection_scores skips
appending when an exception occurs during cls._relevance_prompt.generate (or in
DatasetRelevanceInput creation), which makes collection_scores shorter than
collections and causes misalignment when evaluate() zips collections with
individual_scores; change the except block inside the loop to append a sentinel
value (e.g., None or a designated numeric sentinel like -1/NaN) to
collection_scores and include the error details in the logger warning so lengths
remain matched; ensure the stored all_scores["individual_dataset_scores"]
contains the sentinel entries so downstream code in evaluate() can correctly zip
collections with individual_dataset_scores and handle the sentinel cases
appropriately.
- Around line 121-142: The evaluate method calls
langfuse.start_as_current_observation without checking that get_langfuse()
returned a non-None object; if get_langfuse() returns None this raises an
AttributeError. Fix by guarding the trace usage: after calling get_langfuse()
check if langfuse is truthy and only use langfuse.start_as_current_observation
inside that conditional (or provide a no-op context when None), updating the
evaluate function (where langfuse is defined and where
start_as_current_observation is invoked) to safely handle a None return.
In `@rag_eval/evals/datasets/earthdata_manual_sample.json`:
- Around line 37-41: Update the metadata for the record with id "q002" so its
"topic" value is "forestry" instead of "agriculture"; locate the JSON object
containing "q002" (the entry whose metadata includes "query_type":"semantic" and
expected_keywords ["forest","forestry","trees","forest management"]) and change
the "topic" field to "forestry" to align with the question and keywords.
In `@rag_eval/mcp_client.py`:
- Around line 81-83: The session initialization currently swallows exceptions
(logger.error in the except) leaving session_id as None and the client in a
broken state; update the initialization flow (in the method that sets session_id
/ handles session creation and the class that owns query(), e.g., the MCPClient
initializer or initialize_session) to either re-raise the caught exception after
logging or record the failure in a reachable attribute (e.g., self.init_error or
self.initialized = False) so callers can detect failure before calling query();
ensure logger.error still logs exc_info=True and that query() checks the new
failure flag or raises a clear exception if session_id is None or init failed.
- Around line 255-268: The format string f"...{score:.3f}" can raise if
collection["similarity_score"] is present but None; in the block building
answer_parts, coerce score to a numeric default before formatting (e.g., replace
score = collection.get("similarity_score", 0.0) with something like score =
collection.get("similarity_score") or 0.0 or score =
float(collection.get("similarity_score") or 0.0) so that the f-string in the
answer_parts.append (the line containing {score:.3f}) always receives a float
and will not raise TypeError.
- Line 7: The project imports httpx in rag_eval/mcp_client.py but httpx is not
declared as a direct dependency; add "httpx" (with a suitable version
constraint) to the dependencies section of rag_eval/pyproject.toml and/or the
top-level pyproject.toml so the package is explicitly required, then update your
lockfile/install (e.g., poetry lock / poetry install or pip-tools) to ensure
consistent installs; verify imports in rag_eval/mcp_client.py continue to
resolve after updating dependencies.
- Around line 54-79: The SSE fallback never assigns a session ID and
_list_tools() is skipped when the header path returns early; update the
initialization logic in the method that handles the response so that after
parsing SSE JSON (variable result) you set self.session_id =
result.get("sessionId") (or the correct key used by your MCP payload) if
present, and ensure _list_tools() is always called regardless of whether the
session ID came from response.headers or the parsed SSE (remove the early return
or call self._list_tools() before returning); keep existing JSONDecodeError
handling and debug logs.
In `@rag_eval/models.py`:
- Around line 10-12: The relevance_score Field currently allows any float;
enforce the 0.0–1.0 contract by adding bounds validation to the relevance_score
declaration (e.g., use Field(..., ge=0.0, le=1.0) or replace the type with
pydantic.confloat(ge=0.0, le=1.0)) so out-of-range values are rejected at model
validation in rag_eval.models.Relevance (or whatever model/class contains
relevance_score).
In `@rag_eval/README.md`:
- Around line 7-20: The quick-start API key instructions are incorrect because
the code (ragas_utils.py) uses AWS Bedrock via litellm with model
"bedrock/amazon.nova-pro-v1:0" rather than OpenAI/Anthropic/Google; update the
README section to instruct users to configure AWS credentials (e.g.,
AWS_DEFAULT_REGION, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY or use an AWS CLI
profile) and mention any litellm-specific env vars or credential providers
required for Bedrock access, and replace the OpenAI/ANTHROPIC/GOOGLE examples
with Bedrock-specific setup and a short note referencing ragas_utils.py and the
Bedrock model string so readers know which provider is actually used.
- Around line 52-63: Update the README's directory tree fenced code block to
include a language specifier (e.g., ```text) and correct the file list: replace
the non-existent rag.py entry with the actual files present in the PR
(mcp_client.py, models.py, ragas_utils.py) so the tree accurately reflects the
package; ensure the closing fence remains and the comments (e.g., "# MCP client
for RAG queries") match the new filenames to help readers locate mcp_client.py,
models.py, and ragas_utils.py.
In `@rag_eval/test_set_generator.py`:
- Around line 343-367: The catch-all except that prints an error and silently
replaces real DB results with hardcoded `entities` must be changed to avoid
producing fake data unnoticed: either re-raise the caught exception (so callers
fail fast) or make the placeholder fallback explicit via an opt-in flag (e.g.,
`allow_placeholder_fallback=False`) so the caller must opt into using the
hardcoded `entities`; update the error handling in the except block (the one
assigning `entities` in rag_eval/test_set_generator.py) to log the error and
then either `raise` the exception or consult the new opt-in parameter before
populating the hardcoded `entities`, ensuring callers are aware when fallback is
used.
- Around line 608-609: The inline comment says "Generate 20 test questions" but
the call to generator.generate_testset passes testset_size=5; update them to be
consistent by changing the argument to testset_size=20 (or alternatively update
the comment to reflect 5) so that the call to generator.generate_testset and the
comment match; locate the call to generator.generate_testset and the variable
test_data to apply the fix.
- Around line 70-83: The module currently performs provider selection and may
call exit(1) at import time (USE_OPENAI check and OPENAI_API_KEY validation);
move this logic into a runtime initializer such as
EarthdataTestSetGenerator.__init__ or a main() function so importing the module
has no side effects, replace print/exit calls with raising a clear exception or
use logger.error inside the initializer, and ensure the code references the same
symbols (USE_OPENAI flag read inside EarthdataTestSetGenerator.__init__ or
main(), and BEDROCK_CONFIG region assignment moved there) so tests can import
the class without triggering prints or process termination.
- Around line 526-538: The generated metadata in the test_data dict incorrectly
always references BEDROCK_CONFIG for "llm_model" and "embedding_model"; update
the assignment to select the correct configuration based on the active provider
(e.g., use OPENAI_CONFIG when USE_OPENAI is True, otherwise BEDROCK_CONFIG).
Locate the test_data construction in rag_eval/test_set_generator.py and change
the "llm_model" and "embedding_model" values to pull from the appropriate config
variable (reference BEDROCK_CONFIG, OPENAI_CONFIG and the USE_OPENAI flag or the
class/instance provider field) so provenance reflects the actual provider used.
- Around line 226-370: The code calls get_db_connection() before the try and
always calls conn.close() in the finally, which can raise NameError if
connection creation fails; fix by initializing conn = None before calling
get_db_connection() (or moving the get_db_connection() call inside the try) and
update the finally block to only call conn.close() if conn is not None (or
truthy). Specifically update references around get_db_connection(), conn, and
the finally block that calls conn.close() so the connection is safely closed
only when it was successfully created.
---
Nitpick comments:
In `@rag_eval/evals.py`:
- Line 3: The file imports asyncio but never uses it; remove the unused import
statement (the "import asyncio" line at the top of rag_eval/evals.py) to
eliminate the dead dependency and any linter warnings, leaving other imports and
code unchanged.
- Line 125: The log statement uses an unnecessary f-string which triggers F541;
change the call to use a normal string literal in the logger.info invocation
(replace the f-string used in the logger.info("Dataset loaded successfully")
call with a plain string) so no f-prefix is present—locate the logger.info usage
in rag_eval/evals.py and remove the f-prefix.
- Around line 106-114: The _score_dataset method on EarthdataEvaluator is dead
code because evaluate() uses SingleEvaluation.evaluate_single to score datasets;
remove the duplication by either deleting _score_dataset or refactoring
SingleEvaluation.evaluate_single to call EarthdataEvaluator._score_dataset (or
vice versa). Concretely, choose one implementation of the scoring logic (the
code that builds DatasetRelevanceInput and calls relevance_prompt.generate with
llm and returns relevance_score) and update SingleEvaluation.evaluate_single or
EarthdataEvaluator._score_dataset to delegate to that single function (keep
symbols: EarthdataEvaluator, _score_dataset, SingleEvaluation.evaluate_single,
relevance_prompt.generate, llm) and remove the redundant copy.
- Around line 186-208: The loop pairing collections and individual_scores uses
zip without strict=True which silently drops trailing items; update the pairing
in the block that enumerates zip(collections, individual_scores) to use
zip(collections, individual_scores, strict=True) so a length mismatch raises a
ValueError and surfaces the misalignment (this change applies to the loop that
creates the trace via trace.start_as_current_observation and calls
scoring_span.score); if Python versions in CI must support this, alternatively
assert len(collections)==len(individual_scores) before the loop to explicitly
fail on mismatch.
- Around line 281-312: The singleton init in SingleEvaluation
(_initialize_components) is racy and duplicates EarthdataEvaluator.evaluate's
LLM/embeddings (different max_tokens: 4096 vs 10000); make the singleton
thread-safe and stop duplicate creation by 1) adding a class-level
threading.Lock and surrounding the body of
SingleEvaluation._initialize_components with lock acquisition to ensure only one
init runs, 2) allow injection of components into SingleEvaluation (add optional
parameters llm, embeddings, relevance_prompt, metrics to the initializer or a
new classmethod) and prefer injected components over creating new ones, and 3)
update EarthdataEvaluator.evaluate to pass its created LLM/embeddings into
SingleEvaluation instead of creating separate instances (and consolidate the
create_bedrock_llm max_tokens setting into one place so both use the same
value).
In `@rag_eval/mcp_client.py`:
- Around line 186-195: The SSE parsing loop currently overwrites result_data for
each parsed "data: " line (using response.text and json.loads), losing earlier
events; change this to accumulate parsed events (e.g., create a list like
parsed_events and append each successful json.loads(data)) instead of assigning
to result_data, then decide after the loop whether to use the first/last/merged
event or return the list—update any downstream use of result_data accordingly
(or, if intended behavior is to keep the first event, set result_data only when
it is None).
In `@rag_eval/models.py`:
- Around line 29-38: The instruction string assigned to variable instruction in
rag_eval/models.py contains trailing whitespace after "user's question." —
remove that trailing space inside the triple-quoted string (or trim the string
when assigning, e.g., use .strip()) so the prompt text sent to the LLM has no
embedded trailing whitespace; update the instruction assignment in the
instruction variable to eliminate the extra space.
In `@rag_eval/pyproject.toml`:
- Around line 27-28: The build config currently sets py-modules = [] which
disables module auto-discovery and there are no package declarations, so pip
installs can produce an empty distribution; fix by either removing the empty
py-modules entry or adding a packages discovery block (e.g. add
[tool.setuptools.packages.find] with where = ["."], or explicitly declare
package names under [tool.setuptools.packages]) so setuptools can find and
include your package sources (look for the py-modules setting and update it or
add the packages.find section).
In `@rag_eval/ragas_utils.py`:
- Line 95: Replace the eager f-string logger call with lazy %-style formatting
to avoid evaluating m.name when the log level would skip it; specifically change
the logger.info(f"Calculating {m.name}...") call to use logger.info("Calculating
%s...", m.name) (referencing the same logger.info and m.name symbols in
ragas_utils.py).
- Around line 57-72: The function init_ragas_metrics currently constructs a new
RunConfig() inside the loop for every metric; instead, create one RunConfig
instance before the for loop and pass that same run_config to each
metric.init(...) so all metrics share the same RunConfig; keep the existing
handling where you set metric.llm for MetricWithLLM and metric.embeddings for
MetricWithEmbeddings, but move the RunConfig allocation out of the loop in
init_ragas_metrics and call metric.init(run_config) for each metric.
In `@rag_eval/test_set_generator.py`:
- Around line 16-18: The file imports the unused module sys; remove the unused
"import sys" statement from the top of the file (in the import block alongside
json and os) to clean up the namespace and satisfy linting, or if sys was
intended to be used, replace the removal by actually using sys in the relevant
function (e.g., within test set generation code) and reference the symbol sys
where needed.
- Around line 428-463: Several variables are assigned but never
used—headline_extractor, headline_splitter, keyphrase_extractor, and
keyphrase_similarity_builder—triggering static analysis warnings; either remove
these unused assignments or rename them with a leading underscore (e.g.,
_headline_extractor) to mark intentional non-use, and keep the active transforms
list intact (referencing HeadlinesExtractor, HeadlineSplitter,
KeyphrasesExtractor, and JaccardSimilarityBuilder instances) so the linter no
longer reports unused-variable issues.
| "ragas>=0.4.3", | ||
| "litellm>=1.81.12", |
There was a problem hiding this comment.
Evaluation-only dependencies added to the main project's runtime dependencies.
ragas and litellm are only used by the rag_eval/ evaluation package, not by the MCP server itself. Adding them as top-level runtime dependencies unnecessarily increases the install footprint for all deployments (including production Lambda/container).
Move these to an optional dependency group:
Suggested fix
"instructor>=1.13.0",
"jsonschema>=4.17.3",
- "ragas>=0.4.3",
- "litellm>=1.81.12",
]
[project.optional-dependencies]
+eval = [
+ "ragas>=0.4.3",
+ "litellm>=1.81.12",
+]
dev = [🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pyproject.toml` around lines 24 - 25, The dependencies "ragas" and "litellm"
are evaluation-only and should be removed from the top-level runtime
dependencies in pyproject.toml; instead add them under an optional dependency
group (e.g., project.optional-dependencies or tool.poetry.extras) as an
"rag_eval" or "evaluation" extra so only users who need the rag_eval/ package
install them; update references to the dependency names "ragas" and "litellm"
accordingly and ensure the main MCP runtime dependency list no longer contains
these two entries.
| """RAG evaluation package for Earthdata MCP server.""" | ||
|
|
||
| __version__ = "1.0.0" |
There was a problem hiding this comment.
Version mismatch: __version__ is "1.0.0" but pyproject.toml declares "0.1.0".
These should be consistent. Since this is a new package, 0.1.0 is the more appropriate value.
Suggested fix
-__version__ = "1.0.0"
+__version__ = "0.1.0"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| """RAG evaluation package for Earthdata MCP server.""" | |
| __version__ = "1.0.0" | |
| """RAG evaluation package for Earthdata MCP server.""" | |
| __version__ = "0.1.0" |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@rag_eval/__init__.py` around lines 1 - 3, The package version constant is
inconsistent: update the __version__ variable in rag_eval.__init__ to match
pyproject.toml by changing __version__ = "1.0.0" to __version__ = "0.1.0";
verify the value matches pyproject.toml and keep them synchronized going forward
(update whichever file should be authoritative if different).
rag_eval/evals.py
Outdated
| def evaluate(self, testset_path: str): | ||
| """Run full evaluation on a test dataset.""" | ||
| langfuse = get_langfuse() | ||
| dataset = self.load_dataset(testset_path) | ||
| logger.info(f"Dataset loaded successfully") | ||
|
|
||
| results = [] | ||
|
|
||
| for idx, row in enumerate(dataset): | ||
| logger.info("=" * 60) | ||
| logger.info( | ||
| f"Evaluating test case {idx + 1}/{len(dataset)}: {row['question_id']}" | ||
| ) | ||
| logger.info(f"Question: {row['user_input']}") | ||
|
|
||
| question = row["user_input"] | ||
|
|
||
| try: | ||
| # Start a new trace for this evaluation | ||
| with langfuse.start_as_current_observation( | ||
| as_type="span", name="rag" | ||
| ) as trace: |
There was a problem hiding this comment.
get_langfuse() can return None, causing AttributeError on line 140.
get_langfuse() (from util/langfuse.py line 34-46) explicitly returns None when Langfuse fails to initialize. Line 140 unconditionally calls langfuse.start_as_current_observation(...), which will raise AttributeError: 'NoneType' object has no attribute 'start_as_current_observation'.
Proposed fix — guard against None
def evaluate(self, testset_path: str):
"""Run full evaluation on a test dataset."""
langfuse = get_langfuse()
+ if langfuse is None:
+ raise RuntimeError(
+ "Langfuse client failed to initialize. Check credentials and configuration."
+ )
dataset = self.load_dataset(testset_path)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def evaluate(self, testset_path: str): | |
| """Run full evaluation on a test dataset.""" | |
| langfuse = get_langfuse() | |
| dataset = self.load_dataset(testset_path) | |
| logger.info(f"Dataset loaded successfully") | |
| results = [] | |
| for idx, row in enumerate(dataset): | |
| logger.info("=" * 60) | |
| logger.info( | |
| f"Evaluating test case {idx + 1}/{len(dataset)}: {row['question_id']}" | |
| ) | |
| logger.info(f"Question: {row['user_input']}") | |
| question = row["user_input"] | |
| try: | |
| # Start a new trace for this evaluation | |
| with langfuse.start_as_current_observation( | |
| as_type="span", name="rag" | |
| ) as trace: | |
| def evaluate(self, testset_path: str): | |
| """Run full evaluation on a test dataset.""" | |
| langfuse = get_langfuse() | |
| if langfuse is None: | |
| raise RuntimeError( | |
| "Langfuse client failed to initialize. Check credentials and configuration." | |
| ) | |
| dataset = self.load_dataset(testset_path) | |
| logger.info(f"Dataset loaded successfully") | |
| results = [] | |
| for idx, row in enumerate(dataset): | |
| logger.info("=" * 60) | |
| logger.info( | |
| f"Evaluating test case {idx + 1}/{len(dataset)}: {row['question_id']}" | |
| ) | |
| logger.info(f"Question: {row['user_input']}") | |
| question = row["user_input"] | |
| try: | |
| # Start a new trace for this evaluation | |
| with langfuse.start_as_current_observation( | |
| as_type="span", name="rag" | |
| ) as trace: |
🧰 Tools
🪛 Ruff (0.15.1)
[error] 125-125: f-string without any placeholders
Remove extraneous f prefix
(F541)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@rag_eval/evals.py` around lines 121 - 142, The evaluate method calls
langfuse.start_as_current_observation without checking that get_langfuse()
returned a non-None object; if get_langfuse() returns None this raises an
AttributeError. Fix by guarding the trace usage: after calling get_langfuse()
check if langfuse is truthy and only use langfuse.start_as_current_observation
inside that conditional (or provide a no-op context when None), updating the
evaluate function (where langfuse is defined and where
start_as_current_observation is invoked) to safely handle a None return.
rag_eval/evals.py
Outdated
| @classmethod | ||
| def evaluate_single( | ||
| cls, | ||
| question: str, | ||
| collections: list[dict], | ||
| contexts: list[str], | ||
| answer: str, | ||
| ) -> dict[str, float]: |
There was a problem hiding this comment.
Return type hint is incorrect.
The hint says dict[str, float] but the method returns dict[str, dict] — each value is a dict with value, comment, and data_type keys.
Proposed fix
- ) -> dict[str, float]:
+ ) -> dict[str, dict]:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| @classmethod | |
| def evaluate_single( | |
| cls, | |
| question: str, | |
| collections: list[dict], | |
| contexts: list[str], | |
| answer: str, | |
| ) -> dict[str, float]: | |
| `@classmethod` | |
| def evaluate_single( | |
| cls, | |
| question: str, | |
| collections: list[dict], | |
| contexts: list[str], | |
| answer: str, | |
| ) -> dict[str, dict]: |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@rag_eval/evals.py` around lines 314 - 321, The return type hint for
evaluate_single is wrong: it currently declares dict[str, float] but actually
returns a mapping whose values are dicts with keys "value", "comment", and
"data_type". Fix the signature of evaluate_single to reflect that structure
(e.g., dict[str, dict[str, Any]] or define a small TypedDict/NamedTuple for the
result shape and use dict[str, YourTypedDict]) and update any related imports;
ensure the chosen type explicitly documents the inner keys "value", "comment",
and "data_type".
rag_eval/evals.py
Outdated
| # Score individual collections | ||
| collection_scores = [] | ||
| for collection in collections: | ||
| try: | ||
| prompt_input = DatasetRelevanceInput( | ||
| question=question, | ||
| dataset_title=collection.get("title", ""), | ||
| dataset_abstract=collection.get("abstract", "") or "No description", | ||
| ) | ||
| result = cls._relevance_prompt.generate(data=prompt_input, llm=cls._llm) | ||
| collection_scores.append(result.relevance_score) | ||
| except Exception as e: | ||
| logger.warning(f"Error scoring collection: {e}") | ||
| continue | ||
|
|
||
| # Store individual scores | ||
| all_scores["individual_dataset_scores"] = { | ||
| "value": collection_scores, | ||
| "comment": "Individual relevance scores for each dataset", | ||
| "data_type": "LIST", | ||
| } |
There was a problem hiding this comment.
Skipped collections on error cause misaligned scores in the caller's zip.
When _score_dataset fails for a collection (line 351-353), the continue skips appending a score, so collection_scores can be shorter than collections. In evaluate() at line 190-191, zip(collections, individual_scores) pairs scores with the wrong collections — the first error shifts all subsequent mappings.
Proposed fix — append a sentinel on failure
except Exception as e:
logger.warning(f"Error scoring collection: {e}")
- continue
+ collection_scores.append(0.0) # Default score on failure🧰 Tools
🪛 Ruff (0.15.1)
[warning] 351-351: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@rag_eval/evals.py` around lines 340 - 360, The current loop building
collection_scores skips appending when an exception occurs during
cls._relevance_prompt.generate (or in DatasetRelevanceInput creation), which
makes collection_scores shorter than collections and causes misalignment when
evaluate() zips collections with individual_scores; change the except block
inside the loop to append a sentinel value (e.g., None or a designated numeric
sentinel like -1/NaN) to collection_scores and include the error details in the
logger warning so lengths remain matched; ensure the stored
all_scores["individual_dataset_scores"] contains the sentinel entries so
downstream code in evaluate() can correctly zip collections with
individual_dataset_scores and handle the sentinel cases appropriately.
| # Choose which provider to use | ||
| USE_OPENAI = os.getenv("USE_OPENAI", "false").lower() == "true" | ||
|
|
||
| if USE_OPENAI: | ||
| print("🔧 Using OpenAI models") | ||
| if not os.getenv("OPENAI_API_KEY"): | ||
| print("❌ Error: OPENAI_API_KEY environment variable is not set.") | ||
| print("Please set your OpenAI API key:") | ||
| print("export OPENAI_API_KEY='your_openai_api_key'") | ||
| exit(1) | ||
| else: | ||
| print("🔧 Using AWS Bedrock models") | ||
| # Set AWS region | ||
| os.environ["AWS_REGION_NAME"] = BEDROCK_CONFIG["region_name"] |
There was a problem hiding this comment.
Module-level side effects including exit(1) run on import.
Lines 73-83 execute unconditionally when this module is imported — printing to stdout and potentially calling exit(1). This prevents importing the class in tests or from other modules without triggering these side effects. Move this logic into main() or __init__.
Proposed fix — move provider setup into the class or main()
-# Choose which provider to use
-USE_OPENAI = os.getenv("USE_OPENAI", "false").lower() == "true"
-
-if USE_OPENAI:
- print("🔧 Using OpenAI models")
- if not os.getenv("OPENAI_API_KEY"):
- print("❌ Error: OPENAI_API_KEY environment variable is not set.")
- print("Please set your OpenAI API key:")
- print("export OPENAI_API_KEY='your_openai_api_key'")
- exit(1)
-else:
- print("🔧 Using AWS Bedrock models")
- # Set AWS region
- os.environ["AWS_REGION_NAME"] = BEDROCK_CONFIG["region_name"]
+USE_OPENAI = os.getenv("USE_OPENAI", "false").lower() == "true"Then validate the API key inside EarthdataTestSetGenerator.__init__ or main().
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@rag_eval/test_set_generator.py` around lines 70 - 83, The module currently
performs provider selection and may call exit(1) at import time (USE_OPENAI
check and OPENAI_API_KEY validation); move this logic into a runtime initializer
such as EarthdataTestSetGenerator.__init__ or a main() function so importing the
module has no side effects, replace print/exit calls with raising a clear
exception or use logger.error inside the initializer, and ensure the code
references the same symbols (USE_OPENAI flag read inside
EarthdataTestSetGenerator.__init__ or main(), and BEDROCK_CONFIG region
assignment moved there) so tests can import the class without triggering prints
or process termination.
| entities = [] | ||
| conn = get_db_connection() | ||
|
|
||
| try: | ||
| with conn.cursor() as cur: | ||
| # Query 1: Collections (from collections table with enriched metadata) | ||
| cur.execute( | ||
| """ | ||
| SELECT concept_id, metadata, enriched_metadata | ||
| FROM collections | ||
| ORDER BY concept_id | ||
| LIMIT %s | ||
| """, | ||
| (limit_per_type,), | ||
| ) | ||
|
|
||
| for row in cur.fetchall(): | ||
| concept_id = row[0] | ||
| metadata = ( | ||
| row[1] if isinstance(row[1], dict) else json.loads(row[1]) | ||
| ) | ||
| enriched = ( | ||
| row[2] | ||
| if isinstance(row[2], dict) | ||
| else json.loads(row[2] or "{}") | ||
| ) | ||
|
|
||
| title = ( | ||
| enriched.get("title") | ||
| or metadata.get("EntryTitle") | ||
| or metadata.get("ShortName", "") | ||
| ) | ||
| abstract = enriched.get("abstract") or metadata.get("Abstract", "") | ||
|
|
||
| if title and abstract: | ||
| entities.append( | ||
| { | ||
| "id": concept_id, | ||
| "type": "collection", | ||
| "title": title, | ||
| "content": abstract, | ||
| } | ||
| ) | ||
|
|
||
| # Query 2: All other entity types from embeddings table | ||
| # This matches what search_all_entity_types() queries | ||
| # Note: Multiple rows per entity (title, abstract, etc.) so we limit total rows | ||
| cur.execute( | ||
| """ | ||
| SELECT type, external_id, attribute, text_content | ||
| FROM embeddings | ||
| WHERE type IN ('variable', 'citation', 'instruments', 'platforms', 'sciencekeywords') | ||
| ORDER BY type, external_id, attribute | ||
| LIMIT %s | ||
| """, | ||
| ( | ||
| limit_per_type * 10, | ||
| ), # Multiply to account for multiple rows per entity | ||
| ) | ||
|
|
||
| # Group rows by (type, external_id) since title and abstract are in different rows | ||
| entity_map = {} | ||
| for row in cur.fetchall(): | ||
| entity_type = row[0] | ||
| external_id = row[1] | ||
| attribute = row[2] | ||
| text_content = row[3] | ||
|
|
||
| key = (entity_type, external_id) | ||
| if key not in entity_map: | ||
| entity_map[key] = { | ||
| "type": entity_type, | ||
| "id": external_id, | ||
| "attributes": {}, | ||
| } | ||
|
|
||
| if text_content: | ||
| entity_map[key]["attributes"][attribute] = text_content | ||
|
|
||
| # Convert grouped entities to list format | ||
| for (entity_type, external_id), data in entity_map.items(): | ||
| attrs = data["attributes"] | ||
|
|
||
| # Try to get title from 'title' attribute, fallback to first available | ||
| title = ( | ||
| attrs.get("title") | ||
| or attrs.get("name") | ||
| or next(iter(attrs.values()), "")[:100] | ||
| ) | ||
|
|
||
| # Try to get content from 'abstract' attribute, fallback to combining all attributes | ||
| content = attrs.get("abstract") or attrs.get("description") | ||
| if not content: | ||
| # Combine all attributes as content if no abstract | ||
| content = "\n\n".join( | ||
| f"{k}: {v}" for k, v in attrs.items() if k != "title" | ||
| ) | ||
|
|
||
| if title and content: | ||
| entities.append( | ||
| { | ||
| "id": external_id, | ||
| "type": entity_type, | ||
| "title": title, | ||
| "content": content, | ||
| } | ||
| ) | ||
|
|
||
| print(f"Loaded {len(entities)} total entities from database") | ||
|
|
||
| # Print breakdown by type | ||
| type_counts = {} | ||
| for e in entities: | ||
| t = e["type"] | ||
| type_counts[t] = type_counts.get(t, 0) + 1 | ||
| print(f"Entity breakdown: {type_counts}") | ||
|
|
||
| except Exception as e: | ||
| print(f"Error loading entities from database: {e}") | ||
| print("Falling back to placeholder data...") | ||
|
|
||
| # Fallback to placeholder data | ||
| entities = [ | ||
| { | ||
| "id": "C1234567890-ORNL_DAAC", | ||
| "type": "collection", | ||
| "title": "MODIS/Terra Sea Surface Temperature (SST) Daily L3 Global 4km", | ||
| "content": ( | ||
| "This dataset provides daily sea surface temperature measurements " | ||
| "from the MODIS instrument aboard NASA's Terra satellite. Data is " | ||
| "provided at 4km spatial resolution globally. Temporal coverage spans " | ||
| "from 2000 to present, making it ideal for climate studies and ocean " | ||
| "monitoring applications." | ||
| ), | ||
| }, | ||
| { | ||
| "id": "V1234567890-VARIABLE", | ||
| "type": "variable", | ||
| "title": "Sea Surface Temperature", | ||
| "content": "Sea surface temperature measured in Celsius from satellite thermal infrared sensors.", | ||
| }, | ||
| ] | ||
|
|
||
| finally: | ||
| conn.close() |
There was a problem hiding this comment.
get_db_connection() outside try block — finally: conn.close() will raise NameError if the connection fails.
If get_db_connection() on line 227 throws, conn is never assigned, and line 370 (conn.close()) raises NameError.
Proposed fix
entities = []
- conn = get_db_connection()
-
- try:
+ conn = None
+ try:
+ conn = get_db_connection()
with conn.cursor() as cur:and update the finally block:
finally:
- conn.close()
+ if conn is not None:
+ conn.close()🧰 Tools
🪛 Ruff (0.15.1)
[warning] 343-343: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@rag_eval/test_set_generator.py` around lines 226 - 370, The code calls
get_db_connection() before the try and always calls conn.close() in the finally,
which can raise NameError if connection creation fails; fix by initializing conn
= None before calling get_db_connection() (or moving the get_db_connection()
call inside the try) and update the finally block to only call conn.close() if
conn is not None (or truthy). Specifically update references around
get_db_connection(), conn, and the finally block that calls conn.close() so the
connection is safely closed only when it was successfully created.
| except Exception as e: | ||
| print(f"Error loading entities from database: {e}") | ||
| print("Falling back to placeholder data...") | ||
|
|
||
| # Fallback to placeholder data | ||
| entities = [ | ||
| { | ||
| "id": "C1234567890-ORNL_DAAC", | ||
| "type": "collection", | ||
| "title": "MODIS/Terra Sea Surface Temperature (SST) Daily L3 Global 4km", | ||
| "content": ( | ||
| "This dataset provides daily sea surface temperature measurements " | ||
| "from the MODIS instrument aboard NASA's Terra satellite. Data is " | ||
| "provided at 4km spatial resolution globally. Temporal coverage spans " | ||
| "from 2000 to present, making it ideal for climate studies and ocean " | ||
| "monitoring applications." | ||
| ), | ||
| }, | ||
| { | ||
| "id": "V1234567890-VARIABLE", | ||
| "type": "variable", | ||
| "title": "Sea Surface Temperature", | ||
| "content": "Sea surface temperature measured in Celsius from satellite thermal infrared sensors.", | ||
| }, | ||
| ] |
There was a problem hiding this comment.
Silent fallback to hardcoded placeholder data hides database failures.
When the DB query fails, the except block (line 343) prints a message and substitutes two fake entities. A caller generating a test set would unknowingly produce questions from fabricated data, leading to meaningless evaluations. At minimum, re-raise or make the fallback opt-in.
🧰 Tools
🪛 Ruff (0.15.1)
[warning] 343-343: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@rag_eval/test_set_generator.py` around lines 343 - 367, The catch-all except
that prints an error and silently replaces real DB results with hardcoded
`entities` must be changed to avoid producing fake data unnoticed: either
re-raise the caught exception (so callers fail fast) or make the placeholder
fallback explicit via an opt-in flag (e.g., `allow_placeholder_fallback=False`)
so the caller must opt into using the hardcoded `entities`; update the error
handling in the except block (the one assigning `entities` in
rag_eval/test_set_generator.py) to log the error and then either `raise` the
exception or consult the new opt-in parameter before populating the hardcoded
`entities`, ensuring callers are aware when fallback is used.
| # Convert to our standard format | ||
| test_data = { | ||
| "version": "1.0", | ||
| "generated_at": datetime.now().isoformat(), | ||
| "generator": "ragas-bedrock-nova", | ||
| "config": { | ||
| "llm_model": BEDROCK_CONFIG["llm"], | ||
| "embedding_model": BEDROCK_CONFIG["embeddings"], | ||
| "testset_size": testset_size, | ||
| "num_personas": len(self.personas), | ||
| }, | ||
| "test_cases": [], | ||
| } |
There was a problem hiding this comment.
Config metadata always references BEDROCK_CONFIG regardless of provider.
Lines 532-533 hardcode BEDROCK_CONFIG["llm"] and BEDROCK_CONFIG["embeddings"] in the output metadata even when USE_OPENAI is True, producing inaccurate provenance for the generated test set.
Proposed fix
"config": {
- "llm_model": BEDROCK_CONFIG["llm"],
- "embedding_model": BEDROCK_CONFIG["embeddings"],
+ "llm_model": OPENAI_CONFIG["llm"] if USE_OPENAI else BEDROCK_CONFIG["llm"],
+ "embedding_model": OPENAI_CONFIG["embeddings"] if USE_OPENAI else BEDROCK_CONFIG["embeddings"],📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Convert to our standard format | |
| test_data = { | |
| "version": "1.0", | |
| "generated_at": datetime.now().isoformat(), | |
| "generator": "ragas-bedrock-nova", | |
| "config": { | |
| "llm_model": BEDROCK_CONFIG["llm"], | |
| "embedding_model": BEDROCK_CONFIG["embeddings"], | |
| "testset_size": testset_size, | |
| "num_personas": len(self.personas), | |
| }, | |
| "test_cases": [], | |
| } | |
| # Convert to our standard format | |
| test_data = { | |
| "version": "1.0", | |
| "generated_at": datetime.now().isoformat(), | |
| "generator": "ragas-bedrock-nova", | |
| "config": { | |
| "llm_model": OPENAI_CONFIG["llm"] if USE_OPENAI else BEDROCK_CONFIG["llm"], | |
| "embedding_model": OPENAI_CONFIG["embeddings"] if USE_OPENAI else BEDROCK_CONFIG["embeddings"], | |
| "testset_size": testset_size, | |
| "num_personas": len(self.personas), | |
| }, | |
| "test_cases": [], | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@rag_eval/test_set_generator.py` around lines 526 - 538, The generated
metadata in the test_data dict incorrectly always references BEDROCK_CONFIG for
"llm_model" and "embedding_model"; update the assignment to select the correct
configuration based on the active provider (e.g., use OPENAI_CONFIG when
USE_OPENAI is True, otherwise BEDROCK_CONFIG). Locate the test_data construction
in rag_eval/test_set_generator.py and change the "llm_model" and
"embedding_model" values to pull from the appropriate config variable (reference
BEDROCK_CONFIG, OPENAI_CONFIG and the USE_OPENAI flag or the class/instance
provider field) so provenance reflects the actual provider used.
| test_data = generator.generate_testset( | ||
| testset_size=5, # Generate 20 test questions |
There was a problem hiding this comment.
Comment says "Generate 20 test questions" but testset_size=5.
test_data = generator.generate_testset(
- testset_size=5, # Generate 20 test questions
+ testset_size=5, # Generate 5 test questions📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| test_data = generator.generate_testset( | |
| testset_size=5, # Generate 20 test questions | |
| test_data = generator.generate_testset( | |
| testset_size=5, # Generate 5 test questions |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@rag_eval/test_set_generator.py` around lines 608 - 609, The inline comment
says "Generate 20 test questions" but the call to generator.generate_testset
passes testset_size=5; update them to be consistent by changing the argument to
testset_size=20 (or alternatively update the comment to reflect 5) so that the
call to generator.generate_testset and the comment match; locate the call to
generator.generate_testset and the variable test_data to apply the fix.
…just creating scores as done before. This allows us to track and version golden datasets to evaluate the search system.
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (5)
rag_eval/evals.py (2)
896-902: Unused assignment:resultis never read.Static analysis flags
result = evaluator.run_experiment(...)as unused (F841). Either use the result or drop the assignment.Proposed fix
- result = evaluator.run_experiment( + evaluator.run_experiment( dataset_name=dataset_name, experiment_name=experiment_name, max_concurrency=max_concurrency, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rag_eval/evals.py` around lines 896 - 902, The assignment to result from evaluator.run_experiment(...) is unused; remove the unused variable or use it. Either call evaluator.run_experiment(...) without assigning its return value (drop "result =") if the return is unneeded, or capture and use the returned value (e.g., log it or return it from the surrounding function) so that result is referenced; update the try block around evaluator.run_experiment(...) accordingly to eliminate the F841 unused-assignment warning.
682-691: Fragile positional indexing for metrics (cls._metrics[0],cls._metrics[1]).
compute_faithfulnessandcompute_context_precisionaccess metrics by hard-coded list index. If the order in_initialize_components(line 597-600) changes or a metric is added before them, these will silently use the wrong metric. Store metrics in a dict keyed by name instead.Proposed approach
In
_initialize_components:- cls._metrics = [ - Faithfulness(), - LLMContextPrecisionWithoutReference(), - ] + cls._metrics = { + "faithfulness": Faithfulness(), + "context_precision": LLMContextPrecisionWithoutReference(), + } - init_ragas_metrics(cls._metrics, cls._llm, cls._embeddings) + init_ragas_metrics(list(cls._metrics.values()), cls._llm, cls._embeddings)Then in
compute_faithfulness:- faithfulness_metric = cls._metrics[0] + faithfulness_metric = cls._metrics["faithfulness"]And in
compute_context_precision:- context_precision_metric = cls._metrics[1] + context_precision_metric = cls._metrics["context_precision"]Also applies to: 716-725
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rag_eval/evals.py` around lines 682 - 691, compute_faithfulness and compute_context_precision currently pick metrics by positional index (cls._metrics[0], cls._metrics[1]) which is fragile; change _initialize_components to store metrics in a dict keyed by metric name (e.g., {"faithfulness": metric_obj, "context_precision": metric_obj}) and update compute_faithfulness and compute_context_precision to lookup the appropriate metric by name from that dict before calling score_with_ragas (and similarly update the other occurrences around lines 716-725), so the code uses the named metric entries instead of relying on list order.rag_eval/mcp_client.py (3)
130-145: Hardcoded"id": 1reuses the same JSON-RPC request ID as_list_tools.Both
_list_tools(line 88) andquery(line 145) use"id": 1. While not a functional bug since these are sequential calls, using distinct or auto-incrementing IDs would improve debuggability of server-side logs and is a JSON-RPC best practice.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rag_eval/mcp_client.py` around lines 130 - 145, The JSON-RPC request ID is hardcoded to 1 in query (and also used in _list_tools), which can confuse server logs; modify query (and optionally _list_tools) to generate unique IDs per request (e.g., use an incrementing counter, UUID, or a timestamp) instead of the literal "id": 1 so each JSON-RPC payload from query and _list_tools has a distinct "id" value for better traceability.
15-19: Constructor triggers network I/O, preventing unit testing and error recovery.
_initialize_session()is called unconditionally in__init__, which means the constructor itself performs network calls. This makes it impossible to construct the client without a live server, complicating testing. Consider lazy initialization or an explicitconnect()method.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rag_eval/mcp_client.py` around lines 15 - 19, The constructor currently calls _initialize_session() which performs network I/O and prevents creating MCPClient instances in tests; remove the unconditional call to _initialize_session() from __init__ (keep self.session_id: Optional[str] = None and client initialization) and implement an explicit connect() method (or lazy ensure_session() helper) that performs the session initialization and sets self.session_id by calling _initialize_session(); update public methods that need a session (e.g., any send/ request methods) to either call connect()/ensure_session() at the start or raise a clear error if session is missing, so tests can construct MCPClient without a live server and control when network I/O happens.
184-192: Last SSE data line wins — silent overwrite of earlier results.The loop overwrites
result_dataon every validdata:line. If the server sends multiple JSON-RPC results (e.g., progress notifications followed by a final result), earlier data is silently discarded. This works if the server always sends the final result last, but the intent isn't obvious. A brief comment or logging the overwrite count would clarify the contract.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rag_eval/mcp_client.py` around lines 184 - 192, The SSE loop currently overwrites result_data for every "data: " line which silently discards earlier JSON payloads; update the loop in rag_eval/mcp_client.py (the response.iter_lines() handling and result_data variable) to either (a) only set result_data if it is currently None to preserve the first result, or (b) keep overwriting but add a counter and emit a warning via the existing logger when an overwrite occurs (include the count and a brief context string), or (c) merge expected fields if multiple partial JSON messages are sent — pick the strategy that matches the RPC contract and add a short comment above the loop describing the chosen contract (first wins / last wins / merged). Ensure JSON decoding errors still skip lines as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@rag_eval/evals.py`:
- Around line 127-132: The constructor accepts llm_model and temperature but
never uses them; store these values on self in the class __init__ and pass them
into SingleEvaluation (or into SingleEvaluation._initialize_components) instead
of the hardcoded "amazon.nova-pro-v1:0" and 0.01. Concretely: save
self.llm_model and self.temperature in the constructor, and when creating or
initializing SingleEvaluation (refer to SingleEvaluation._initialize_components
and the code path that currently hardcodes the model/temperature), replace the
literals with the stored self.llm_model and self.temperature so the configured
values are actually applied.
- Around line 29-38: The module currently executes side effects at import
(logging.basicConfig, load_dotenv(override=True), and nest_asyncio.apply())
which will affect any importer (e.g., when using SingleEvaluation); move these
calls out of the top level and into a guarded entrypoint such as a main()
function (define main() if missing) and call them from if __name__ ==
"__main__": main() so importing rag_eval.evals no longer reconfigures logging,
overrides env vars, or patches the event loop; update references to
logging.basicConfig, load_dotenv, and nest_asyncio.apply to be only inside that
main()/entrypoint.
- Around line 210-257: Both evaluators currently duplicate the expensive call to
SingleEvaluation.compute_dataset_relevance_scores (in
avg_dataset_relevance_evaluator and max_dataset_relevance_evaluator); change to
a single combined evaluator that calls
SingleEvaluation.compute_dataset_relevance_scores once (using question and
collections), extract avg_relevance and max_relevance from the returned scores,
and return a list containing two Evaluation objects
(name="avg_dataset_relevance" and name="max_dataset_relevance") with appropriate
value and comment fields; ensure you preserve the existing None checks and
logging behavior and that run_experiment will accept the returned list of
Evaluations.
---
Duplicate comments:
In `@rag_eval/evals.py`:
- Around line 522-526: The code assumes get_langfuse() always returns a client
but it can return None; update the block around langfuse = get_langfuse() to
check for a None return before calling langfuse.get_dataset(dataset_name), and
handle the failure by logging an error (including any context) and either
raising an exception or exiting gracefully; reference get_langfuse and the
langfuse variable so the guard immediately after initialization prevents the
AttributeError when calling langfuse.get_dataset.
- Around line 628-647: The try/except in the scoring loop causes
collection_scores to be shorter than collections when an exception occurs,
breaking alignment; in the except block for the loop that builds
collection_scores (which uses DatasetRelevanceInput and
cls._relevance_prompt.generate()) append a sentinel (e.g., None or float('nan'))
to collection_scores before continuing and keep the logger.warning call so the
index positions remain aligned with collections for any downstream zips or
lookups.
- Around line 730-738: The return type hint for evaluate_single is wrong: update
the signature of classmethod evaluate_single to reflect it returns a mapping of
metric names to objects with keys "value", "comment", and "data_type" (e.g.,
dict[str, dict] or a more precise typing like dict[str, dict[str, Any]] or a
TypedDict); change the return annotation from dict[str, float] to the chosen
correct type and adjust any related docstring/comments and imports
(typing/TypedDict) as needed so callers and static checkers reflect the actual
dict[str, dict] structure produced by evaluate_single.
In `@rag_eval/mcp_client.py`:
- Around line 80-82: The except block that catches exceptions during session
initialization currently only logs (logger.error in rag_eval.mcp_client.py)
which leaves the client in a broken state; modify the handler to either re-raise
the caught exception (e.g., raise) after logging or set an explicit failure
flag/attribute (e.g., self._init_failed = True and store the exception in
self._init_exception) and then update the query() method to check that flag (or
re-raise the stored exception) before proceeding; locate the session init logic
(the try/except around session creation and the query() method) and implement
one of these two fixes so callers get an immediate, clear failure instead of
confusing downstream errors.
- Around line 253-265: The code assumes collection.get("similarity_score", 0.0)
yields a number but if the key exists with value None the format f"{score:.3f}"
will raise TypeError; fix by normalizing the value before formatting (e.g.,
retrieve score = collection.get("similarity_score") and then set score = 0.0 if
score is None or not a number, or coerce via try: score = float(score) except:
score = 0.0) so the f"{score:.3f}" call never receives None; update the block
that sets score and any use in answer_parts formatting to use the normalized
numeric score.
In `@rag_eval/models.py`:
- Around line 7-13: The relevance_score field on DatasetRelevanceScore currently
accepts any float; add validation to enforce the expected 0.0–1.0 range so
out-of-range LLM outputs cannot corrupt downstream logic. Update the
relevance_score Field declaration on the DatasetRelevanceScore model to include
bounds (e.g., ge=0.0 and le=1.0) or add a Pydantic validator on relevance_score
that raises a ValidationError if the value is <0.0 or >1.0; keep the same
description text when making this change.
---
Nitpick comments:
In `@rag_eval/evals.py`:
- Around line 896-902: The assignment to result from
evaluator.run_experiment(...) is unused; remove the unused variable or use it.
Either call evaluator.run_experiment(...) without assigning its return value
(drop "result =") if the return is unneeded, or capture and use the returned
value (e.g., log it or return it from the surrounding function) so that result
is referenced; update the try block around evaluator.run_experiment(...)
accordingly to eliminate the F841 unused-assignment warning.
- Around line 682-691: compute_faithfulness and compute_context_precision
currently pick metrics by positional index (cls._metrics[0], cls._metrics[1])
which is fragile; change _initialize_components to store metrics in a dict keyed
by metric name (e.g., {"faithfulness": metric_obj, "context_precision":
metric_obj}) and update compute_faithfulness and compute_context_precision to
lookup the appropriate metric by name from that dict before calling
score_with_ragas (and similarly update the other occurrences around lines
716-725), so the code uses the named metric entries instead of relying on list
order.
In `@rag_eval/mcp_client.py`:
- Around line 130-145: The JSON-RPC request ID is hardcoded to 1 in query (and
also used in _list_tools), which can confuse server logs; modify query (and
optionally _list_tools) to generate unique IDs per request (e.g., use an
incrementing counter, UUID, or a timestamp) instead of the literal "id": 1 so
each JSON-RPC payload from query and _list_tools has a distinct "id" value for
better traceability.
- Around line 15-19: The constructor currently calls _initialize_session() which
performs network I/O and prevents creating MCPClient instances in tests; remove
the unconditional call to _initialize_session() from __init__ (keep
self.session_id: Optional[str] = None and client initialization) and implement
an explicit connect() method (or lazy ensure_session() helper) that performs the
session initialization and sets self.session_id by calling
_initialize_session(); update public methods that need a session (e.g., any
send/ request methods) to either call connect()/ensure_session() at the start or
raise a clear error if session is missing, so tests can construct MCPClient
without a live server and control when network I/O happens.
- Around line 184-192: The SSE loop currently overwrites result_data for every
"data: " line which silently discards earlier JSON payloads; update the loop in
rag_eval/mcp_client.py (the response.iter_lines() handling and result_data
variable) to either (a) only set result_data if it is currently None to preserve
the first result, or (b) keep overwriting but add a counter and emit a warning
via the existing logger when an overwrite occurs (include the count and a brief
context string), or (c) merge expected fields if multiple partial JSON messages
are sent — pick the strategy that matches the RPC contract and add a short
comment above the loop describing the chosen contract (first wins / last wins /
merged). Ensure JSON decoding errors still skip lines as before.
| logging.basicConfig( | ||
| level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s" | ||
| ) | ||
| logger = logging.getLogger(__name__) | ||
|
|
||
| # Load environment variables from .env file (override existing ones) | ||
| load_dotenv(override=True) | ||
|
|
||
| # Apply nest_asyncio to allow nested event loops (for PydanticPrompt.generate in evaluators) | ||
| nest_asyncio.apply() |
There was a problem hiding this comment.
Module-level side effects will fire on import.
logging.basicConfig(), load_dotenv(override=True), and nest_asyncio.apply() execute at import time. Any module importing rag_eval.evals (e.g., for SingleEvaluation) will have its logging reconfigured, env vars overridden, and the event loop patched. These should be guarded behind if __name__ == "__main__" or moved into main().
Proposed fix — move side effects into main()
-# Configure logging
-logging.basicConfig(
- level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
-)
logger = logging.getLogger(__name__)
-# Load environment variables from .env file (override existing ones)
-load_dotenv(override=True)
-
-# Apply nest_asyncio to allow nested event loops (for PydanticPrompt.generate in evaluators)
-nest_asyncio.apply()Then in main():
def main():
"""Main entry point for running evaluations."""
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
)
load_dotenv(override=True)
nest_asyncio.apply()
# ... rest of main🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@rag_eval/evals.py` around lines 29 - 38, The module currently executes side
effects at import (logging.basicConfig, load_dotenv(override=True), and
nest_asyncio.apply()) which will affect any importer (e.g., when using
SingleEvaluation); move these calls out of the top level and into a guarded
entrypoint such as a main() function (define main() if missing) and call them
from if __name__ == "__main__": main() so importing rag_eval.evals no longer
reconfigures logging, overrides env vars, or patches the event loop; update
references to logging.basicConfig, load_dotenv, and nest_asyncio.apply to be
only inside that main()/entrypoint.
| def __init__( | ||
| self, | ||
| mcp_server_url: str, | ||
| llm_model: str = "amazon.nova-pro-v1:0", | ||
| temperature: float = 0.01, | ||
| trace_name: str = "rag", |
There was a problem hiding this comment.
llm_model and temperature constructor parameters are accepted but never used.
These parameters suggest the intent to configure the evaluation LLM, but they are not passed to SingleEvaluation or any other component. SingleEvaluation._initialize_components hardcodes "amazon.nova-pro-v1:0" and temperature=0.01 (line 588-591). Either wire these through or remove them from the constructor signature.
Proposed fix — remove unused params
def __init__(
self,
mcp_server_url: str,
- llm_model: str = "amazon.nova-pro-v1:0",
- temperature: float = 0.01,
trace_name: str = "rag",
):📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def __init__( | |
| self, | |
| mcp_server_url: str, | |
| llm_model: str = "amazon.nova-pro-v1:0", | |
| temperature: float = 0.01, | |
| trace_name: str = "rag", | |
| def __init__( | |
| self, | |
| mcp_server_url: str, | |
| trace_name: str = "rag", |
🧰 Tools
🪛 Ruff (0.15.1)
[warning] 130-130: Unused method argument: llm_model
(ARG002)
[warning] 131-131: Unused method argument: temperature
(ARG002)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@rag_eval/evals.py` around lines 127 - 132, The constructor accepts llm_model
and temperature but never uses them; store these values on self in the class
__init__ and pass them into SingleEvaluation (or into
SingleEvaluation._initialize_components) instead of the hardcoded
"amazon.nova-pro-v1:0" and 0.01. Concretely: save self.llm_model and
self.temperature in the constructor, and when creating or initializing
SingleEvaluation (refer to SingleEvaluation._initialize_components and the code
path that currently hardcodes the model/temperature), replace the literals with
the stored self.llm_model and self.temperature so the configured values are
actually applied.
| def avg_dataset_relevance_evaluator(*, output, **kwargs): | ||
| """Evaluate average dataset relevance.""" | ||
| try: | ||
| question = output.get("question", "") | ||
| collections = output.get("collections", []) | ||
|
|
||
| logger.info( | ||
| f"Computing avg relevance for {len(collections)} collections" | ||
| ) | ||
|
|
||
| scores = SingleEvaluation.compute_dataset_relevance_scores( | ||
| question=question, | ||
| collections=collections, | ||
| ) | ||
|
|
||
| avg_value = scores.get("avg_relevance") | ||
| logger.info(f"Avg relevance computed: {avg_value}") | ||
|
|
||
| if avg_value is None: | ||
| logger.warning("avg_relevance is None, skipping evaluation") | ||
| return None | ||
|
|
||
| return Evaluation( | ||
| name="avg_dataset_relevance", | ||
| value=avg_value, | ||
| comment=f"Average relevance: {avg_value:.3f}", | ||
| ) | ||
| except Exception as e: | ||
| logger.error( | ||
| f"Error in avg_dataset_relevance_evaluator: {e}", exc_info=True | ||
| ) | ||
| return None | ||
|
|
||
| # Evaluator for max dataset relevance | ||
| def max_dataset_relevance_evaluator(*, output, **kwargs): | ||
| """Evaluate max dataset relevance.""" | ||
| try: | ||
| question = output.get("question", "") | ||
| collections = output.get("collections", []) | ||
|
|
||
| logger.info( | ||
| f"Computing max relevance for {len(collections)} collections" | ||
| ) | ||
|
|
||
| scores = SingleEvaluation.compute_dataset_relevance_scores( | ||
| question=question, | ||
| collections=collections, | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's understand the file structure and locate the relevant code
find . -type f -name "*.py" | grep -E "(evals|test)" | head -20Repository: nasa/earthdata-mcp
Length of output: 770
🏁 Script executed:
# Check the evals.py file to understand the context and how evaluators are used
wc -l rag_eval/evals.pyRepository: nasa/earthdata-mcp
Length of output: 83
🏁 Script executed:
# Look for Langfuse imports and usage patterns
rg "run_experiment|Langfuse" --type py -A 3 -B 3 | head -100Repository: nasa/earthdata-mcp
Length of output: 4170
🏁 Script executed:
# Search for how evaluators are registered/passed to run_experiment
rg "avg_dataset_relevance_evaluator|max_dataset_relevance_evaluator" --type py -A 2 -B 2Repository: nasa/earthdata-mcp
Length of output: 1520
🏁 Script executed:
# Check if there are any evaluators that return lists of Evaluation objects
rg "return.*\[.*Evaluation" --type py -A 1 -B 1Repository: nasa/earthdata-mcp
Length of output: 44
🌐 Web query:
Langfuse run_experiment evaluator return type multiple Evaluation objects list
💡 Result:
In langfuse.run_experiment(...), an item-level evaluator is allowed to return either:
- a single
Evaluation(or a singledictshaped like an Evaluation), or - a
listofEvaluationobjects (orlist[dict]), or - an async/awaitable that resolves to one of the above.
Internally, Langfuse normalizes whatever you return into a List[Evaluation] like this:
- if you return a single
Evaluation/dict→ it wraps it as[result] - if you return a
list→ it uses it as-is - otherwise → it treats it as
[](no evaluations) [1]
So yes: returning multiple evaluations is supported—just return [Evaluation(...), Evaluation(...), ...]. The docs even include a “Multi-metric evaluator” example that returns a list. [1][2]
Combine evaluators to eliminate duplicate scoring calls — Langfuse supports returning multiple Evaluations.
avg_dataset_relevance_evaluator and max_dataset_relevance_evaluator both independently call SingleEvaluation.compute_dataset_relevance_scores() with identical question and collections. This doubles the cost and latency per test case when compute_dataset_relevance_scores performs LLM-based scoring.
Merge them into a single evaluator returning both scores as a list—Langfuse's run_experiment() normalizes single Evaluation objects and lists of evaluations identically, so returning [avg_evaluation, max_evaluation] is fully supported.
Suggested approach — single evaluator returning both scores
- # Evaluator for average dataset relevance
- def avg_dataset_relevance_evaluator(*, output, **kwargs):
- ...
-
- # Evaluator for max dataset relevance
- def max_dataset_relevance_evaluator(*, output, **kwargs):
- ...
+ # Combined evaluator for dataset relevance (avg + max)
+ def dataset_relevance_evaluator(*, output, **kwargs):
+ """Evaluate dataset relevance (avg and max)."""
+ try:
+ question = output.get("question", "")
+ collections = output.get("collections", [])
+
+ scores = SingleEvaluation.compute_dataset_relevance_scores(
+ question=question,
+ collections=collections,
+ )
+
+ results = []
+ avg_value = scores.get("avg_relevance")
+ if avg_value is not None:
+ results.append(Evaluation(
+ name="avg_dataset_relevance",
+ value=avg_value,
+ comment=f"Average relevance: {avg_value:.3f}",
+ ))
+
+ max_value = scores.get("max_relevance")
+ if max_value is not None:
+ results.append(Evaluation(
+ name="max_dataset_relevance",
+ value=max_value,
+ comment=f"Max relevance: {max_value:.3f}",
+ ))
+
+ return results if results else None
+ except Exception as e:
+ logger.error(f"Error in dataset_relevance_evaluator: {e}", exc_info=True)
+ return None🧰 Tools
🪛 Ruff (0.15.1)
[warning] 210-210: Unused function argument: kwargs
(ARG001)
[warning] 244-244: Unused function argument: kwargs
(ARG001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@rag_eval/evals.py` around lines 210 - 257, Both evaluators currently
duplicate the expensive call to
SingleEvaluation.compute_dataset_relevance_scores (in
avg_dataset_relevance_evaluator and max_dataset_relevance_evaluator); change to
a single combined evaluator that calls
SingleEvaluation.compute_dataset_relevance_scores once (using question and
collections), extract avg_relevance and max_relevance from the returned scores,
and return a list containing two Evaluation objects
(name="avg_dataset_relevance" and name="max_dataset_relevance") with appropriate
value and comment fields; ensure you preserve the existing None checks and
logging behavior and that run_experiment will accept the returned list of
Evaluations.
|
|
||
| ``` | ||
| rag_eval/ | ||
| ├── README.md # This file |
There was a problem hiding this comment.
These directory listings are nice but sort of hard to maintain. It looks like these are already out of date 😅
| # Choose which provider to use | ||
| USE_OPENAI = os.getenv("USE_OPENAI", "false").lower() == "true" | ||
|
|
||
| if USE_OPENAI: |
There was a problem hiding this comment.
Is this code useful at all if were not using OpenAI at all?
There was a problem hiding this comment.
I might need another run through of this file. Not sure if we covered this when we paired originally. Just want to make sure we all have an understanding of how this generation is working and what its being used for.
Add Offline RAG Evaluation Framework
Overview
Implements a comprehensive offline evaluation framework for the Earthdata RAG system, enabling systematic quality assessment of retrieval and generation components using multiple metrics tracked in Langfuse. The usage of offline RAG evaluation is intended to make AB Testing of different retrieval strategies consistent and controlled.
What's New
Core Evaluation System (
rag_eval/)EarthdataEvaluator- Main orchestrator for batch evaluationSingleEvaluation- Reusable singleton class for scoringEvaluation Metrics
1. Custom Dataset Relevance (LLM-based scoring)
2. Ragas Metrics (Industry-standard RAG evaluation)
Supporting Infrastructure
EarthdataRAGClientDatasetRelevancePromptBedrock Integration
Observability
Configuration
Environment variables: