Tentative implementation for testing prompts in chatbot (WIP) by robinpaul85 · Pull Request #4253 · stjude/proteinpaint

robinpaul85 · 2026-02-25T23:15:34Z

Description

Implementation for automatically testing test prompts (WIP).

Checklist

Check each task that has been performed or verified to be not applicable.

Tests: Added and/or passed unit and integration tests, or N/A
Todos: Commented or documented, or N/A
Notable Changes: updated release.txt, prefixed a commit message with "fix:" or "feat:", added to an internal tracking document, or N/A
Rust: Checked to see whether Rust needs to be re-compiled because of this PR, or N/A

Copilot

Pull request overview

This PR adds a WIP pathway for automatically running chatbot “test prompts” against datasets, wiring a new testchat CLI flag into server startup and refactoring the existing chatbot unit-test script into a callable function.

Changes:

Add testchat argv handling in server/src/app.ts to iterate datasets and invoke chatbot tests.
Refactor server/routes/chat/test/chatUnitTests.ts to export test_chatbot_by_dataset() and add matrix-output validation.
Enable chatbot test data for the TermdbTest dataset and adjust server package/dependency versions.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.

File	Description
server/src/app.ts	Imports and triggers chatbot tests via a new `testchat` CLI argument during launch.
server/routes/chat/test/chatUnitTests.ts	Exposes a dataset-scoped test runner and adds validation for additional chatbot actions.
server/package.json	Changes server workspace version and several internal dependency versions.
server/dataset/termdb.test.ts	Adds `queries.chat.aifiles` to point to chatbot test data JSON.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-25T23:29:08Z

server/src/app.ts

+		for (const genome of Object.values(genomes)) {
+			for (const ds of Object.values(genome.datasets)) {
+				if ((ds as any)?.queries?.chat) {
+					test_chatbot_by_dataset(ds)
+				}
+			}
+		}


The testchat branch calls test_chatbot_by_dataset(ds) without await, and handle_argv() doesn’t return an exit object afterward. This means the server can continue to start listening while tests run in the background, and async failures can become unhandled rejections. Await the test runs, aggregate pass/fail, and return an appropriate { code } (or exit) so testchat behaves like a real CLI test command.

Suggested change

for (const genome of Object.values(genomes)) {

for (const ds of Object.values(genome.datasets)) {

if ((ds as any)?.queries?.chat) {

test_chatbot_by_dataset(ds)

}

}

}

let hadFailure = false

for (const genome of Object.values(genomes)) {

for (const ds of Object.values(genome.datasets)) {

if ((ds as any)?.queries?.chat) {

try {

const result: any = await test_chatbot_by_dataset(ds as any)

// If the test function returns a result object indicating failure, mark it.

if (

result === false ||

(typeof result === 'object' &&

result !== null &&

(('success' in result && !result.success) ||

('ok' in result && !result.ok)))

) {

hadFailure = true

}

} catch (e) {

hadFailure = true

}

}

}

}

return { code: hadFailure ? 1 : 0 }

Copilot · 2026-02-25T23:29:08Z

server/routes/chat/test/chatUnitTests.ts

 const testing = true // This causes raw LLM output to be sent by the agent
 const llm = serverconfig.llm
 if (!llm) throw 'serverconfig.llm is not configured'
 if (llm.provider !== 'SJ' && llm.provider !== 'ollama') {
 	throw "llm.provider must be 'SJ' or 'ollama'"
 }


This module runs serverconfig.llm validation at import time and throws if missing/unsupported. Once this file is imported by the server (e.g. via src/app.ts), that becomes a server-startup side effect even when not running chatbot tests. Move these checks inside test_chatbot_by_dataset() (or the dedicated CLI entrypoint) so importing the module can’t crash the server.

Copilot · 2026-02-25T23:29:09Z

server/routes/chat/test/chatUnitTests.ts

+export async function test_chatbot_by_dataset(ds: any) {
+	// Check to see if the dataset supports the AI chatbot
+	//console.log("dataset:", dataset.aifiles)
+	if (!(ds as any)?.queries?.chat.aifiles) throw 'AI dataset JSON file is missing for dataset:' + ds.label
+	const aifiles = (ds as any)?.queries?.chat.aifiles
+	const dataset_json = await readJSONFile(aifiles) // Read AI JSON data file
+	//console.log("dataset_json:", dataset_json)
+	const aiFilesDir = path.dirname(aifiles)
+	for (const test_data of dataset_json.TestData) {
+		const test_result = await run_chat_pipeline(
+			test_data.question,
+			llm,
+			serverconfig.aiRoute,
+			dataset_json,
+			testing, // This is not needed anymore, need to be deprecated
+			serverconfig.tpmasterdir + '/' + dataset_json.db,
+			serverconfig.tpmasterdir + '/' + dataset_json.genedb,
+			ds,
+			aiFilesDir
+		)
+		console.log('test_result:', test_result)
+		if (test_result.action == 'html') {


test_chatbot_by_dataset() currently only logs mismatches and continues; it never signals failure to the caller (no return value/throw), so an automated run can succeed even when prompts fail. Consider returning a boolean/result summary and throwing (or setting a non-zero exit code in the caller) when any test case fails.

Copilot · 2026-02-25T23:29:09Z

server/routes/chat/test/chatUnitTests.ts

+				console.log(
+					'html resource request did not match for prompt' +
+						test_data.question +
+						'. LLM response :' +
+						test_result.response +
+						' Actual response: ' +
+						test_data.answer


The mismatch log messages concatenate for prompt directly with the question text (missing a separating space/colon), which makes logs harder to read/search. Add a delimiter (e.g., "for prompt: ") consistently for html/summary/dge/matrix cases.

Copilot · 2026-02-25T23:29:09Z

server/dataset/termdb.test.ts

 				activeTracks: ['bw 1', 'bed 1']
 			},
-			chat: {}
+			chat: { aifiles: './proteinpaint/server/dataset/ai/termdb.test.json' }


The aifiles path is ./proteinpaint/server/dataset/ai/termdb.test.json, but this repo’s file lives at server/dataset/ai/termdb.test.json. Since readJSONFile() reads paths relative to process.cwd(), running from the server/ workspace will not find ./proteinpaint/... (there is no server/proteinpaint/ dir). Update this to a path that resolves correctly from the server workspace (or make it absolute via import.meta.url).

Suggested change

chat: { aifiles: './proteinpaint/server/dataset/ai/termdb.test.json' }

chat: { aifiles: './dataset/ai/termdb.test.json' }

Copilot · 2026-02-25T23:29:09Z

server/package.json

  "name": "@sjcrh/proteinpaint-server",
-  "version": "2.177.1-0",
+  "version": "2.177.0",
  "type": "module",


The server package version was changed from a 2.177.1-0 workspace-aligned version to 2.177.0, which makes it inconsistent with the repo/workspace versions (root/client/front/container packages remain 2.177.1-0). Unless this is an intentional coordinated version rollback across the monorepo, revert this to keep workspace versions in sync.

Copilot · 2026-02-25T23:29:09Z

server/package.json

+    "@sjcrh/proteinpaint-python": "2.174.0",
    "@sjcrh/proteinpaint-r": "2.152.1-0",
-    "@sjcrh/proteinpaint-rust": "2.177.1-0",
-    "@sjcrh/proteinpaint-shared": "2.177.1-0",
-    "@sjcrh/proteinpaint-types": "2.177.1-0",
+    "@sjcrh/proteinpaint-rust": "2.177.0",
+    "@sjcrh/proteinpaint-shared": "2.177.0",
+    "@sjcrh/proteinpaint-types": "2.177.0",


Several internal dependencies were downgraded (e.g. @sjcrh/proteinpaint-python to 2.174.0) while other workspaces in this repo are at 2.177.1-0. In a workspaces setup, this can cause npm to pull older registry packages instead of using the local workspace versions, and will likely desync package-lock.json. Re-align these dependency versions with the monorepo release version (or use a workspace protocol) if the downgrade wasn’t deliberate.

server/src/app.ts

siosonel · 2026-02-26T20:10:48Z

@robinpaul85 @compbiolover @xzhou82 In order to minimize having test/spec code in prod, and instead of importing test code in server/src/app.ts, it's better to do the following:

Create a test route file like server/src/test/routes/chat.ts or similar, so there will be a /chat-tests server endpoint. The route handler will run your tests.
Edit server/src/servercoonfig.js to add chat.ts to testRouteSetters array.
To run the tests, just load http://localhost:3000/chat-tests. It can reply with ok if passing, also emit the test results in the server logs.

A different way to do this is have code similar to server/src/test/xfetch.integration.spec.js, which calls /ky-retry-test and //external-API-test?dslabel=GDC to run tests. So the /chat-tests route handler doesn't have to run test, just has to respond, and the integration spec code will run the tape tests.

siosonel · 2026-02-27T18:03:46Z

server/src/test/routes/testchat.ts

+function validate_scatter_output(output: SampleScatterType, expected: SampleScatterType): boolean {
+	if (output.plotName != expected.plotName) {
+		console.log(
+			'SampleScatter plotName did not match. LLM response: ' + output.plotName + ' Expected: ' + expected.plotName


Should use assertions li, such as tape's test.equal(...). It would clean up the code and make tests more familiar and more easily migrated to integration or e2e spec file.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…localhost:3000/testchat

xzhou82 · 2026-02-27T21:28:48Z

server/src/test/routes/testchat.ts

 			ds
 		)
 		console.log('test_result:', test_result)
 		if (test_result.type == 'html') {


use of detailed if-else is counter-intuitive. just strickly require test_result to be deepEqual to expected result object, and get rid of all these if-else

should also use tape library to assert (like tape's test.deepEqual()); the test logs would be easier to read and pass/fail counts will be tracked automatically

Yes I can use that. However, using deepEqual() on the entire object may raise false errors. For e.g. in case of prompts with invalid dictionary terms the LLM usually hallucinates in a non-reproducible manner (especially if we change the LLM itself). In such a scenario the deepEqual() will raise a false error.

siosonel · 2026-02-27T23:18:09Z

server/src/app.ts

+you may temporarily uncomment below to generate runtime route checker code, 
+should only uncomment when a file has been added or deleted in 
+shared/types/src/routes and not when modified.
+*/


It looks like the indentations are being removed. Make sure prettier is being used to reformat staged code, as triggered by git pre-commit or pre-push hools.

robinpaul85 marked this pull request as draft February 25, 2026 23:16

robinpaul85 force-pushed the llm.chatbot.prototype.3 branch from a41951e to 3d4af90 Compare February 25, 2026 23:22

robinpaul85 requested review from compbiolover, Copilot and xzhou82 February 25, 2026 23:23

Copilot started reviewing on behalf of robinpaul85 February 25, 2026 23:24 View session

Copilot AI reviewed Feb 25, 2026

View reviewed changes

robinpaul85 requested a review from siosonel February 26, 2026 16:33

siosonel reviewed Feb 27, 2026

View reviewed changes

robinpaul85 force-pushed the llm.chatbot.prototype.3 branch from f8eeccc to a36dd29 Compare February 27, 2026 18:39

robinpaul85 force-pushed the llm.chatbot.prototype.2 branch from 24c96ab to db57a78 Compare February 27, 2026 20:23

robinpaul85 and others added 11 commits February 27, 2026 14:23

Accessed ds object via npm run dev testchat

990f07f

Passed ds object to test_chatbot_by_dataset()

08ecb3c

Add matrix agent validation to chat unit tests

6533700

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Printed out error messages for each agent

1d08c51

Add sample scatter agent validation to chat unit tests

de38091

Fixed lot of issues with test chat script

c86b124

Tentative implementation for running chatbot test script from http://…

ff0c044

…localhost:3000/testchat

Successfully passed genomes to testchat.ts

4a5adde

Stringified JSON in error messages

edf2f0e

Fixed merge conflicts

59f7c99

Added support type=text

3dbb925

robinpaul85 force-pushed the llm.chatbot.prototype.3 branch from 9ce675d to 3dbb925 Compare February 27, 2026 20:24

Improved summary test by making term and term2 interchangeable

012440e

xzhou82 reviewed Feb 27, 2026

View reviewed changes

siosonel reviewed Feb 27, 2026

View reviewed changes

Improved test examples in termdb.test.json

3a90c03

-		for (const genome of Object.values(genomes)) {
-			for (const ds of Object.values(genome.datasets)) {
-				if ((ds as any)?.queries?.chat) {
-					test_chatbot_by_dataset(ds)
-				}
-			}
-		}
+		let hadFailure = false
+		for (const genome of Object.values(genomes)) {
+			for (const ds of Object.values(genome.datasets)) {
+				if ((ds as any)?.queries?.chat) {
+					try {
+						const result: any = await test_chatbot_by_dataset(ds as any)
+						// If the test function returns a result object indicating failure, mark it.
+						if (
+							result === false ||
+							(typeof result === 'object' &&
+								result !== null &&
+								(('success' in result && !result.success) ||
+									('ok' in result && !result.ok)))
+						) {
+							hadFailure = true
+						}
+					} catch (e) {
+						hadFailure = true
+					}
+				}
+			}
+		}
+		return { code: hadFailure ? 1 : 0 }

	chat: { aifiles: './proteinpaint/server/dataset/ai/termdb.test.json' }
	chat: { aifiles: './dataset/ai/termdb.test.json' }

Conversation

robinpaul85 commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

siosonel commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

siosonel Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

xzhou82 Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

siosonel Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

robinpaul85 Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

siosonel Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

robinpaul85 commented Feb 25, 2026 •

edited

Loading

siosonel commented Feb 26, 2026 •

edited

Loading

xzhou82 Feb 27, 2026 •

edited

Loading

siosonel Feb 27, 2026 •

edited

Loading