Skip to content

Tentative implementation for testing prompts in chatbot (WIP)#4253

Draft
robinpaul85 wants to merge 13 commits intollm.chatbot.prototype.2from
llm.chatbot.prototype.3
Draft

Tentative implementation for testing prompts in chatbot (WIP)#4253
robinpaul85 wants to merge 13 commits intollm.chatbot.prototype.2from
llm.chatbot.prototype.3

Conversation

@robinpaul85
Copy link
Contributor

@robinpaul85 robinpaul85 commented Feb 25, 2026

Description

Implementation for automatically testing test prompts (WIP).

Checklist

Check each task that has been performed or verified to be not applicable.

  • Tests: Added and/or passed unit and integration tests, or N/A
  • Todos: Commented or documented, or N/A
  • Notable Changes: updated release.txt, prefixed a commit message with "fix:" or "feat:", added to an internal tracking document, or N/A
  • Rust: Checked to see whether Rust needs to be re-compiled because of this PR, or N/A

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a WIP pathway for automatically running chatbot “test prompts” against datasets, wiring a new testchat CLI flag into server startup and refactoring the existing chatbot unit-test script into a callable function.

Changes:

  • Add testchat argv handling in server/src/app.ts to iterate datasets and invoke chatbot tests.
  • Refactor server/routes/chat/test/chatUnitTests.ts to export test_chatbot_by_dataset() and add matrix-output validation.
  • Enable chatbot test data for the TermdbTest dataset and adjust server package/dependency versions.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.

File Description
server/src/app.ts Imports and triggers chatbot tests via a new testchat CLI argument during launch.
server/routes/chat/test/chatUnitTests.ts Exposes a dataset-scoped test runner and adds validation for additional chatbot actions.
server/package.json Changes server workspace version and several internal dependency versions.
server/dataset/termdb.test.ts Adds queries.chat.aifiles to point to chatbot test data JSON.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 157 to 163
for (const genome of Object.values(genomes)) {
for (const ds of Object.values(genome.datasets)) {
if ((ds as any)?.queries?.chat) {
test_chatbot_by_dataset(ds)
}
}
}
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The testchat branch calls test_chatbot_by_dataset(ds) without await, and handle_argv() doesn’t return an exit object afterward. This means the server can continue to start listening while tests run in the background, and async failures can become unhandled rejections. Await the test runs, aggregate pass/fail, and return an appropriate { code } (or exit) so testchat behaves like a real CLI test command.

Suggested change
for (const genome of Object.values(genomes)) {
for (const ds of Object.values(genome.datasets)) {
if ((ds as any)?.queries?.chat) {
test_chatbot_by_dataset(ds)
}
}
}
let hadFailure = false
for (const genome of Object.values(genomes)) {
for (const ds of Object.values(genome.datasets)) {
if ((ds as any)?.queries?.chat) {
try {
const result: any = await test_chatbot_by_dataset(ds as any)
// If the test function returns a result object indicating failure, mark it.
if (
result === false ||
(typeof result === 'object' &&
result !== null &&
(('success' in result && !result.success) ||
('ok' in result && !result.ok)))
) {
hadFailure = true
}
} catch (e) {
hadFailure = true
}
}
}
}
return { code: hadFailure ? 1 : 0 }

Copilot uses AI. Check for mistakes.
Comment on lines 8 to 13
const testing = true // This causes raw LLM output to be sent by the agent
const llm = serverconfig.llm
if (!llm) throw 'serverconfig.llm is not configured'
if (llm.provider !== 'SJ' && llm.provider !== 'ollama') {
throw "llm.provider must be 'SJ' or 'ollama'"
}
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This module runs serverconfig.llm validation at import time and throws if missing/unsupported. Once this file is imported by the server (e.g. via src/app.ts), that becomes a server-startup side effect even when not running chatbot tests. Move these checks inside test_chatbot_by_dataset() (or the dedicated CLI entrypoint) so importing the module can’t crash the server.

Copilot uses AI. Check for mistakes.
Comment on lines 15 to 36
export async function test_chatbot_by_dataset(ds: any) {
// Check to see if the dataset supports the AI chatbot
//console.log("dataset:", dataset.aifiles)
if (!(ds as any)?.queries?.chat.aifiles) throw 'AI dataset JSON file is missing for dataset:' + ds.label
const aifiles = (ds as any)?.queries?.chat.aifiles
const dataset_json = await readJSONFile(aifiles) // Read AI JSON data file
//console.log("dataset_json:", dataset_json)
const aiFilesDir = path.dirname(aifiles)
for (const test_data of dataset_json.TestData) {
const test_result = await run_chat_pipeline(
test_data.question,
llm,
serverconfig.aiRoute,
dataset_json,
testing, // This is not needed anymore, need to be deprecated
serverconfig.tpmasterdir + '/' + dataset_json.db,
serverconfig.tpmasterdir + '/' + dataset_json.genedb,
ds,
aiFilesDir
)
console.log('test_result:', test_result)
if (test_result.action == 'html') {
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_chatbot_by_dataset() currently only logs mismatches and continues; it never signals failure to the caller (no return value/throw), so an automated run can succeed even when prompts fail. Consider returning a boolean/result summary and throwing (or setting a non-zero exit code in the caller) when any test case fails.

Copilot uses AI. Check for mistakes.
Comment on lines 39 to 45
console.log(
'html resource request did not match for prompt' +
test_data.question +
'. LLM response :' +
test_result.response +
' Actual response: ' +
test_data.answer
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mismatch log messages concatenate for prompt directly with the question text (missing a separating space/colon), which makes logs harder to read/search. Add a delimiter (e.g., "for prompt: ") consistently for html/summary/dge/matrix cases.

Copilot uses AI. Check for mistakes.
activeTracks: ['bw 1', 'bed 1']
},
chat: {}
chat: { aifiles: './proteinpaint/server/dataset/ai/termdb.test.json' }
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The aifiles path is ./proteinpaint/server/dataset/ai/termdb.test.json, but this repo’s file lives at server/dataset/ai/termdb.test.json. Since readJSONFile() reads paths relative to process.cwd(), running from the server/ workspace will not find ./proteinpaint/... (there is no server/proteinpaint/ dir). Update this to a path that resolves correctly from the server workspace (or make it absolute via import.meta.url).

Suggested change
chat: { aifiles: './proteinpaint/server/dataset/ai/termdb.test.json' }
chat: { aifiles: './dataset/ai/termdb.test.json' }

Copilot uses AI. Check for mistakes.
Comment on lines 2 to 4
"name": "@sjcrh/proteinpaint-server",
"version": "2.177.1-0",
"version": "2.177.0",
"type": "module",
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The server package version was changed from a 2.177.1-0 workspace-aligned version to 2.177.0, which makes it inconsistent with the repo/workspace versions (root/client/front/container packages remain 2.177.1-0). Unless this is an intentional coordinated version rollback across the monorepo, revert this to keep workspace versions in sync.

Copilot uses AI. Check for mistakes.
Comment on lines 65 to 69
"@sjcrh/proteinpaint-python": "2.174.0",
"@sjcrh/proteinpaint-r": "2.152.1-0",
"@sjcrh/proteinpaint-rust": "2.177.1-0",
"@sjcrh/proteinpaint-shared": "2.177.1-0",
"@sjcrh/proteinpaint-types": "2.177.1-0",
"@sjcrh/proteinpaint-rust": "2.177.0",
"@sjcrh/proteinpaint-shared": "2.177.0",
"@sjcrh/proteinpaint-types": "2.177.0",
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several internal dependencies were downgraded (e.g. @sjcrh/proteinpaint-python to 2.174.0) while other workspaces in this repo are at 2.177.1-0. In a workspaces setup, this can cause npm to pull older registry packages instead of using the local workspace versions, and will likely desync package-lock.json. Re-align these dependency versions with the monorepo release version (or use a workspace protocol) if the downgrade wasn’t deliberate.

Copilot uses AI. Check for mistakes.
@robinpaul85 robinpaul85 requested a review from siosonel February 26, 2026 16:33
@siosonel
Copy link
Member

siosonel commented Feb 26, 2026

@robinpaul85 @compbiolover @xzhou82 In order to minimize having test/spec code in prod, and instead of importing test code in server/src/app.ts, it's better to do the following:

  1. Create a test route file like server/src/test/routes/chat.ts or similar, so there will be a /chat-tests server endpoint. The route handler will run your tests.
  2. Edit server/src/servercoonfig.js to add chat.ts to testRouteSetters array.
  3. To run the tests, just load http://localhost:3000/chat-tests. It can reply with ok if passing, also emit the test results in the server logs.

A different way to do this is have code similar to server/src/test/xfetch.integration.spec.js, which calls /ky-retry-test and //external-API-test?dslabel=GDC to run tests. So the /chat-tests route handler doesn't have to run test, just has to respond, and the integration spec code will run the tape tests.

function validate_scatter_output(output: SampleScatterType, expected: SampleScatterType): boolean {
if (output.plotName != expected.plotName) {
console.log(
'SampleScatter plotName did not match. LLM response: ' + output.plotName + ' Expected: ' + expected.plotName
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use assertions li, such as tape's test.equal(...). It would clean up the code and make tests more familiar and more easily migrated to integration or e2e spec file.

@robinpaul85 robinpaul85 force-pushed the llm.chatbot.prototype.3 branch from f8eeccc to a36dd29 Compare February 27, 2026 18:39
@robinpaul85 robinpaul85 force-pushed the llm.chatbot.prototype.2 branch from 24c96ab to db57a78 Compare February 27, 2026 20:23
@robinpaul85 robinpaul85 force-pushed the llm.chatbot.prototype.3 branch from 9ce675d to 3dbb925 Compare February 27, 2026 20:24
ds
)
console.log('test_result:', test_result)
if (test_result.type == 'html') {
Copy link
Collaborator

@xzhou82 xzhou82 Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use of detailed if-else is counter-intuitive. just strickly require test_result to be deepEqual to expected result object, and get rid of all these if-else

Copy link
Member

@siosonel siosonel Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should also use tape library to assert (like tape's test.deepEqual()); the test logs would be easier to read and pass/fail counts will be tracked automatically

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I can use that. However, using deepEqual() on the entire object may raise false errors. For e.g. in case of prompts with invalid dictionary terms the LLM usually hallucinates in a non-reproducible manner (especially if we change the LLM itself). In such a scenario the deepEqual() will raise a false error.

you may temporarily uncomment below to generate runtime route checker code,
should only uncomment when a file has been added or deleted in
shared/types/src/routes and not when modified.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the indentations are being removed. Make sure prettier is being used to reformat staged code, as triggered by git pre-commit or pre-push hools.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants