Address code review comments

jwm4 · jwm4 · commit 4ad8ffd5da80 · 2025-07-01T12:13:53.000-04:00
Signed-off-by: Bill Murdock &lt;bmurdock@redhat.com&gt;
diff --git a/notebooks/evaluation/evaluate-using-sample-questions-lls-vs-li.ipynb b/notebooks/evaluation/evaluate-using-sample-questions-lls-vs-li.ipynb
@@ -440,7 +440,7 @@
     "- Content is from the URLs configured in CONTENT_URLS at the top of this notebook\n",
     "- Milvus-lite inline vector IO provider\n",
     "- granite-embedding-125m embedding model\n",
-    "- meta-llama/llama-3-3-70b-instruct generative model using the watsonx inference provider\n",
+    "- gpt-3.5-turbo generative model\n",
     "- max_tokens for output is 4096"
    ]
   },
@@ -837,7 +837,7 @@
     "- Content is from the URLs configured in CONTENT_URLS at the top of this notebook\n",
     "- Milvus vector IO provider\n",
     "- granite-embedding-125m embedding model\n",
-    "- meta-llama/llama-3-3-70b-instruct generative model using the watsonx inference provider\n",
+    "- gpt-3.5-turbo generative model\n",
     "- max_tokens for output is 4096\n",
     "- number of search results to return is 5"
    ]
diff --git a/notebooks/evaluation/make-sample-questions.ipynb b/notebooks/evaluation/make-sample-questions.ipynb
@@ -11,10 +11,10 @@
     "1. It uses the abstract description of the documents to generate a bunch of questions by calling a question generator model, which is currently set to gpt-4o.  You want a very powerful and smart model for that purpose because generating a large volume of questions from an abstract description is a pretty challenging task.\n",
     "2. It builds a vector database from the content of those documents using Docling to analyze them.\n",
     "3. It uses RAG and a reference answer generator model (also gpt-4o currently) to generate reference answers.  You really need a very powerful model to be the reference answer generator because you're going to be treating these reference answers as ground truth for the smaller and presumably less powerful models that you were trying to actually evaluate in the next notebook.\n",
-    "4. It through each of the reference answers and asks the reference answer generator model to assess whether the answer is really answering the question or just saying that it doesn't know.  This is important because often you want a separate analysis for how well each model works on those questions that have reference answers versus how well each model works on those questions where the reference behavior is do not answer because the content doesn't say.\n",
+    "4. It iterates through each of the reference answers and asks the reference answer generator model to assess whether the answer is really answering the question or just saying that it doesn't know.  This is important because often you want a separate analysis for how well each model works on those questions that have reference answers versus how well each model works on those questions where the reference behavior is do not answer because the content doesn't say.\n",
     "5. It stores all of this information in a file for use in the next notebook, [evaluate-using-sample-questions.ipynb](./evaluate-using-sample-questions.ipynb).\n",
     "\n",
-    "If you have time, you should also get a human to vet the reference answers and improve them, but that's expensive to do at scale so I think in practice often that's not going to happen."
+    "If you have time, you should also get a human to vet the reference answers and improve them, but that's expensive to do at scale."
    ]
   },
   {
diff --git a/notebooks/evaluation/run.yaml b/notebooks/evaluation/run.yaml
@@ -17,17 +17,17 @@ providers:
     provider_type: remote::watsonx
     config:
       url: ${env.WATSONX_BASE_URL:https://us-south.ml.cloud.ibm.com}
-      api_key: ${env.WATSONX_API_KEY}
-      project_id: ${env.WATSONX_PROJECT_ID}
+      api_key: ${env.WATSONX_API_KEY:key-not-set}
+      project_id: ${env.WATSONX_PROJECT_ID:project-not-set}
       timeout: 1200
   - provider_id: llama-openai-compat
     provider_type: remote::llama-openai-compat
     config:
-      api_key: ${env.LLAMA_API_KEY}
+      api_key: ${env.LLAMA_API_KEY:key-not-set}
   - provider_id: openai
     provider_type: remote::openai
     config:
-      api_key: ${env.OPENAI_API_KEY}
+      api_key: ${env.OPENAI_API_KEY:key-not-set}
   - provider_id: sentence-transformers
     provider_type: inline::sentence-transformers
     config: {}

Original file line number	Diff line number	Diff line change
`@@ -11,10 +11,10 @@`
`11`	`11`	`"1. It uses the abstract description of the documents to generate a bunch of questions by calling a question generator model, which is currently set to gpt-4o. You want a very powerful and smart model for that purpose because generating a large volume of questions from an abstract description is a pretty challenging task.\n",`
`12`	`12`	`"2. It builds a vector database from the content of those documents using Docling to analyze them.\n",`
`13`	`13`	`"3. It uses RAG and a reference answer generator model (also gpt-4o currently) to generate reference answers. You really need a very powerful model to be the reference answer generator because you're going to be treating these reference answers as ground truth for the smaller and presumably less powerful models that you were trying to actually evaluate in the next notebook.\n",`
`14`		`- "4. It through each of the reference answers and asks the reference answer generator model to assess whether the answer is really answering the question or just saying that it doesn't know. This is important because often you want a separate analysis for how well each model works on those questions that have reference answers versus how well each model works on those questions where the reference behavior is do not answer because the content doesn't say.\n",`
	`14`	`+ "4. It iterates through each of the reference answers and asks the reference answer generator model to assess whether the answer is really answering the question or just saying that it doesn't know. This is important because often you want a separate analysis for how well each model works on those questions that have reference answers versus how well each model works on those questions where the reference behavior is do not answer because the content doesn't say.\n",`
`15`	`15`	`"5. It stores all of this information in a file for use in the next notebook, [evaluate-using-sample-questions.ipynb](./evaluate-using-sample-questions.ipynb).\n",`
`16`	`16`	`"\n",`
`17`		`- "If you have time, you should also get a human to vet the reference answers and improve them, but that's expensive to do at scale so I think in practice often that's not going to happen."`
	`17`	`+ "If you have time, you should also get a human to vet the reference answers and improve them, but that's expensive to do at scale."`
`18`	`18`	`]`
`19`	`19`	`},`
`20`	`20`	`{`