-
Notifications
You must be signed in to change notification settings - Fork 79
tests: Create MCP tool usage tests #1808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
53 commits
Select commit
Hold shift + click to select a range
542f3b1
Add initial testing code
latekvo 35aac32
better typing, result guards
latekvo d7ecdf5
add allowed tools check
latekvo 3e29cf1
add basic error reporting, cleanup
latekvo 66aa875
register command for running tests
latekvo acc153d
more metadata in status reports
latekvo 7f0734c
Merge branch 'main' into @latekvo/create-ai-tests
latekvo 51d77e4
stylistic fix
latekvo 557ccf0
Merge branch '@latekvo/create-ai-tests' of https://github.com/softwar…
latekvo 9b8f3d8
initial file-reading impl (syncing work, branch switch)
latekvo 4da6725
fix workaround for not awaiting completion
latekvo a36f7ee
system-agnostic random path
latekvo 4b6b795
add undo before new chat
latekvo 898d595
ensure agent mode is used
latekvo e3eed04
remove all popups requiring human input, add todos
latekvo f68ddad
Merge remote-tracking branch 'origin/main' into @latekvo/create-ai-tests
latekvo 2b7df4d
sync
latekvo e559400
prevent more popups
latekvo 317d18a
pretty result printing
latekvo b661880
add git restore on each run
latekvo df1c643
Merge remote-tracking branch 'origin/main' into @latekvo/create-ai-tests
latekvo 106dfb7
cleanup
latekvo e3b278c
replace sleep with known util
latekvo f0b8865
add initial termination implementation
latekvo 204d571
hook up the state manager
latekvo c8021e0
remove resolved todo
latekvo b9889e2
fix: use correct status update command
latekvo 54595ba
fix invalid command name
latekvo 45a66a9
fix type errors, fix typecasting ide as state manager
latekvo 104548e
use partial state instead
latekvo 4bdb9f6
fix naming
latekvo 066f377
await test state setting
latekvo 3ceb9e0
minor comment change
latekvo 1f25e8a
prevent multiple launches of the tool tests
latekvo 5608e80
add docstrings for the command executors
latekvo 9ea5386
await global context setting
latekvo c924ef8
Merge remote-tracking branch 'origin/main' into @latekvo/create-ai-tests
latekvo a24bfab
add more test cases
latekvo d606541
simplify timeout and termination code, remove unwanted test case
latekvo cac02bd
improve output formatting, data
latekvo efc9fcf
fix bool inversion
latekvo d38f74b
finally implement early termination
latekvo ea4cf4a
fix a lot of typos and minor issues
latekvo 51bcee9
cleanup transcript directory
latekvo 336c1c0
await final edit clear
latekvo 097eb1c
fix tests failing on success
latekvo 5a5fa17
add basic Observable
latekvo a05d0f4
invoke listeners on observable set
latekvo 9ac3b45
revert Observable after further consideration
latekvo 179136b
move radonAI global position, revert Observable again
latekvo 5706a86
simplify state checking logic
latekvo 28f6573
fix workspace configs
latekvo b53a0cf
isolate types and test cases
latekvo File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,231 @@ | ||
| import { randomBytes } from "crypto"; | ||
| import { readFileSync } from "fs"; | ||
| import { mkdtemp, rm } from "fs/promises"; | ||
| import { tmpdir } from "os"; | ||
| import path from "path"; | ||
| import { window, commands, Uri, workspace, StatusBarAlignment, ThemeColor } from "vscode"; | ||
| import { Logger } from "../../Logger"; | ||
| import { exec } from "../../utilities/subprocess"; | ||
| import { Platform } from "../../utilities/platform"; | ||
| import { IDE } from "../../project/ide"; | ||
| import { testCases } from "./chatTestCases"; | ||
| import { Response, ToolCallResponse, ChatTestResult, ChatTestCase, ChatData } from "./models"; | ||
|
|
||
| export const GIT_PATH = Platform.select({ | ||
| macos: "git", | ||
| windows: "git.exe", | ||
| linux: "git", | ||
| }); | ||
|
|
||
| function isToolCallResponse(response: Response): response is ToolCallResponse { | ||
| // Smart-casting with `Exclude<string, "literal">` does not work, which is why this utility function is necessary | ||
| return response.kind === "toolInvocationSerialized"; | ||
| } | ||
|
|
||
| async function clearEdits() { | ||
| // Stop previous response - prevents pop-ups on `workbench.action.chat.newChat`. | ||
| await commands.executeCommand("workbench.action.chat.cancel"); | ||
|
|
||
| // Move cursor to input - REQUIRED for `chatEditing.acceptAllFiles`. | ||
| await commands.executeCommand("workbench.panel.chat.view.copilot.focus"); | ||
|
|
||
| // Rejection requires user confirmation, acceptance does not. | ||
| await commands.executeCommand("chatEditing.acceptAllFiles"); | ||
|
|
||
| const gitUri = workspace.workspaceFolders?.[0].uri; | ||
|
|
||
| if (!gitUri) { | ||
| // This case should never occur when a test app is loaded. | ||
| return; | ||
| } | ||
|
|
||
| // Revert all changes via git - we CANNOT use `commands.executeCommand`, as it requires user confirmation. | ||
| await exec(GIT_PATH, ["-C", gitUri.fsPath, "restore", "."]); | ||
| } | ||
|
|
||
| async function setGlobalTestsRunning(areTestsRunning: boolean) { | ||
| await commands.executeCommand("setContext", "RNIDE.MCPToolTestsRunning", areTestsRunning); | ||
| } | ||
|
|
||
| function awaitTestTerminationOrTimeout(ideInstance: IDE, testTimeout: number): Promise<boolean> { | ||
| return new Promise((resolve) => { | ||
| const disposable = ideInstance.onStateChanged(() => { | ||
| // Using partial state here is much more cumbersome and less readable. | ||
| ideInstance.getState().then((state) => { | ||
| const testsRunning = state.workspaceConfiguration.radonAI.areMCPTestsRunning; | ||
| if (testsRunning === false) { | ||
| disposable.dispose(); | ||
| clearTimeout(timeout); | ||
| resolve(false); | ||
| } | ||
| }); | ||
| }); | ||
|
|
||
| const timeout = setTimeout(() => { | ||
| disposable.dispose(); | ||
| resolve(true); | ||
| }, testTimeout); | ||
| }); | ||
| } | ||
|
|
||
| async function setTestStatus(areTestsRunning: boolean, ideInstance: IDE) { | ||
| await setGlobalTestsRunning(areTestsRunning); | ||
| await ideInstance.updateState({ | ||
| workspaceConfiguration: { | ||
| radonAI: { | ||
| areMCPTestsRunning: areTestsRunning, | ||
| }, | ||
| }, | ||
| }); | ||
| } | ||
|
|
||
| function getIdeInstance() { | ||
| const ide = IDE.getInstanceIfExists(); | ||
|
|
||
| if (!ide) { | ||
| throw new Error("IDE instance is not initialized. Ensure the Radon IDE panel is open."); | ||
| } | ||
|
|
||
| return ide; | ||
| } | ||
|
|
||
| /** | ||
| * Executor for `RNIDE.terminateChatToolTest` VSCode command. | ||
| * Terminates ongoing MCP tool tests, which were initiated by `RNIDE.testChatToolUsage` VSCode command. | ||
| */ | ||
| export async function terminateChatToolTest() { | ||
| const ideInstance = getIdeInstance(); | ||
| await setTestStatus(false, ideInstance); | ||
| } | ||
|
|
||
| /** | ||
| * Executor for `RNIDE.testChatToolUsage` VSCode command. | ||
| * Temporarily takes control over the AI chat tab, testing its responses to various prompts. | ||
| * Running this command may interfere with other VSCode functionalities as well. | ||
| */ | ||
| export async function testChatToolUsage(): Promise<void> { | ||
| const ideInstance = getIdeInstance(); | ||
| const runStatus: ChatTestResult[] = []; | ||
|
|
||
| await setTestStatus(true, ideInstance); | ||
|
|
||
| const fail = (testCase: ChatTestCase, cause: string) => { | ||
| runStatus.push({ | ||
| cause, | ||
| success: false, | ||
| prompt: testCase.prompt, | ||
| }); | ||
| }; | ||
|
|
||
| const success = (testCase: ChatTestCase) => { | ||
| runStatus.push({ | ||
| cause: null, | ||
| success: true, | ||
| prompt: testCase.prompt, | ||
| }); | ||
| }; | ||
|
|
||
| // - `showInformationMessage` cannot be programmatically dismissed | ||
| // - `showQuickPick` is a list-selection - does not look right | ||
| // - `createStatusBarItem` looks good, and can be dismissed both programmatically and by the user | ||
| const statusBar = window.createStatusBarItem(StatusBarAlignment.Left, 0); | ||
| statusBar.command = "RNIDE.terminateChatToolTest"; | ||
| statusBar.text = "$(debug-stop) MCP tests running — Terminate"; | ||
| statusBar.tooltip = "Click to terminate running E2E tests"; | ||
| statusBar.color = new ThemeColor("statusBar.foreground"); | ||
| statusBar.backgroundColor = new ThemeColor("statusBarItem.errorBackground"); | ||
| statusBar.show(); | ||
|
|
||
| const dir = await mkdtemp(path.join(tmpdir(), "radon-chat-exports-")); | ||
|
|
||
| for (const testCase of testCases) { | ||
| await clearEdits(); | ||
|
|
||
| await commands.executeCommand("workbench.action.chat.newChat"); | ||
| await commands.executeCommand("workbench.action.chat.openagent", testCase.prompt); | ||
|
|
||
| const shouldContinue = await awaitTestTerminationOrTimeout(ideInstance, 10_000); | ||
|
|
||
| if (!shouldContinue) { | ||
| fail(testCase, "User input: Test was terminated early."); | ||
| break; | ||
| } | ||
|
|
||
| const filepath = path.join(dir, randomBytes(8).toString("hex") + ".json"); | ||
|
|
||
| await commands.executeCommand("workbench.action.chat.export", Uri.parse(filepath)); | ||
|
|
||
| let chatData; | ||
| try { | ||
| const exportedText = readFileSync(filepath).toString(); | ||
| chatData = JSON.parse(exportedText) as ChatData; | ||
| } catch { | ||
| fail(testCase, "Internal error: `workbench.action.chat.export` did not work."); | ||
| continue; | ||
| } | ||
|
|
||
| if (chatData.requests.length === 0) { | ||
| fail(testCase, "Internal error: `workbench.action.chat.openagent` did not work."); | ||
| continue; | ||
| } | ||
|
|
||
| if (chatData.requests.length > 1) { | ||
| fail(testCase, "Internal error: `workbench.action.chat.newChat` did not work."); | ||
| continue; | ||
| } | ||
|
|
||
| const responses = chatData.requests[0].response; | ||
|
|
||
| const toolCalls = responses.filter((response) => isToolCallResponse(response)); | ||
|
|
||
| if (toolCalls.length === 0) { | ||
| fail(testCase, "No tools were called."); | ||
| continue; | ||
| } | ||
|
|
||
| const otherCalledTools = []; | ||
| let wasExpectedToolCalled = false; | ||
|
|
||
| for (const toolCall of toolCalls) { | ||
| if (testCase.allowedToolIds.includes(toolCall.toolId)) { | ||
| wasExpectedToolCalled = true; | ||
| success(testCase); | ||
| break; | ||
| } | ||
|
|
||
| otherCalledTools.push(toolCall.toolId); | ||
| } | ||
|
|
||
| if (!wasExpectedToolCalled) { | ||
| const expected = `Expected: ${testCase.allowedToolIds.join(" | ")}`; | ||
| const received = `Received: ${otherCalledTools.join(", ")}`; | ||
| const cause = `${expected}. ${received}`; | ||
| fail(testCase, cause); | ||
| } | ||
| } | ||
|
|
||
| await setTestStatus(false, ideInstance); | ||
|
|
||
| statusBar.hide(); | ||
| statusBar.dispose(); | ||
|
|
||
| rm(dir, { recursive: true }).catch((_e) => { | ||
| // silence the errors, it's fine | ||
| }); | ||
|
|
||
| await clearEdits(); | ||
|
|
||
| const failReasons = runStatus | ||
| .map((v) => `${v.success ? " OK " : "FAIL"}${v.cause !== null ? ` | Error: ${v.cause}` : ""}`) | ||
| .join("\n"); | ||
|
|
||
| const correctCount = runStatus | ||
| .map((v) => (v.success ? 1 : 0) as number) | ||
| .reduce((acc, v) => v + acc); | ||
|
|
||
| const totalCount = runStatus.length; | ||
| const correctPercent = ((correctCount / totalCount) * 100).toFixed(1); | ||
|
|
||
| const response = `\n=== AI TEST RESULTS ===\n${failReasons}\n# TOTAL CORRECT: ${correctCount}/${totalCount} (${correctPercent}%)`; | ||
| Logger.log(response); | ||
| } | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,94 @@ | ||
| import { ChatTestCase } from "./models"; | ||
|
|
||
| export const testCases: ChatTestCase[] = [ | ||
| { | ||
| prompt: "How to use Shared Element Transitions in Reanimated 4?", | ||
| allowedToolIds: ["query_documentation"], | ||
| }, | ||
| { | ||
| prompt: "How to use SETs in Reanimated?", | ||
| allowedToolIds: ["query_documentation"], | ||
| }, | ||
| { | ||
| prompt: "Implement an example interaction with a local LLM in my app.", | ||
| allowedToolIds: ["query_documentation"], | ||
| }, | ||
| { | ||
| prompt: "Add LLM chat to my app.", | ||
| allowedToolIds: ["query_documentation"], | ||
| }, | ||
|
|
||
| { | ||
| prompt: "My button in the center of the screen is malformed.", | ||
| allowedToolIds: ["view_component_tree", "view_screenshot"], | ||
| }, | ||
| { | ||
| prompt: "The orange button is ugly. Fix it.", | ||
| allowedToolIds: ["view_component_tree", "view_screenshot"], | ||
| }, | ||
|
|
||
| { | ||
| prompt: "Restart the app.", | ||
| allowedToolIds: ["reload_application"], | ||
| }, | ||
| { | ||
| prompt: "The app is frozen. Can you reset it?", | ||
| allowedToolIds: ["reload_application"], | ||
| }, | ||
|
|
||
| { | ||
| prompt: "Why did the app just crash?", | ||
| allowedToolIds: ["view_application_logs"], | ||
| }, | ||
| { | ||
| prompt: "Are there any errors in the logs?", | ||
| allowedToolIds: ["view_application_logs"], | ||
| }, | ||
| { | ||
| prompt: "Debug the error thrown when I clicked the login button.", | ||
| allowedToolIds: ["view_application_logs", "view_component_tree"], | ||
| }, | ||
|
|
||
| { | ||
| prompt: "Does the layout look broken to you?", | ||
| allowedToolIds: ["view_screenshot"], | ||
| }, | ||
| { | ||
| prompt: "I think the text is being cut off on the right side.", | ||
| allowedToolIds: ["view_screenshot"], | ||
| }, | ||
| { | ||
| prompt: "Verify if the dark mode colors are applied correctly.", | ||
| allowedToolIds: ["view_screenshot"], | ||
| }, | ||
| { | ||
| prompt: "Take a look at the current screen.", | ||
| allowedToolIds: ["view_screenshot"], | ||
| }, | ||
|
|
||
| { | ||
| prompt: "What is the hierarchy of the current screen?", | ||
| allowedToolIds: ["view_component_tree"], | ||
| }, | ||
| { | ||
| prompt: "Show me the props passed to the Header component.", | ||
| allowedToolIds: ["view_component_tree"], | ||
| }, | ||
| { | ||
| prompt: "Is the 'Submit' button currently inside a SafeAreaView?", | ||
| allowedToolIds: ["view_component_tree"], | ||
| }, | ||
| { | ||
| prompt: "Find the component ID for the bottom navigation bar.", | ||
| allowedToolIds: ["view_component_tree"], | ||
| }, | ||
|
|
||
| { | ||
| prompt: "Why is the banner not showing up?", | ||
| allowedToolIds: ["view_component_tree", "view_application_logs", "view_screenshot"], | ||
| }, | ||
| { | ||
| prompt: "Inspect the padding on the user profile card.", | ||
| allowedToolIds: ["view_component_tree", "view_screenshot"], | ||
| }, | ||
| ]; |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,37 @@ | ||
| export interface ChatData { | ||
| requests: Request[]; | ||
| } | ||
|
|
||
| export interface Request { | ||
| response: Response[]; | ||
| } | ||
|
|
||
| export type Response = ToolCallResponse | UnknownResponse; | ||
|
|
||
| export interface UnknownResponse { | ||
| // `Exclude<string, "literal">` resolves to `string` (does not work) | ||
| kind: unknown; | ||
| } | ||
|
|
||
| export type AllowedToolId = | ||
| | "query_documentation" | ||
| | "view_screenshot" | ||
| | "view_component_tree" | ||
| | "view_application_logs" | ||
| | "reload_application"; | ||
|
|
||
| export interface ToolCallResponse { | ||
| kind: "toolInvocationSerialized"; | ||
| toolId: AllowedToolId; | ||
| } | ||
|
|
||
| export interface ChatTestCase { | ||
| prompt: string; | ||
| allowedToolIds: AllowedToolId[]; | ||
| } | ||
|
|
||
| export interface ChatTestResult { | ||
| prompt: string; | ||
| success: boolean; | ||
| cause: string | null; | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.