-
Notifications
You must be signed in to change notification settings - Fork 229
Description
Our instruction files are starting to become more important as we use them to guide developers to the right tools, example prompts, example code, etc...
To aid in testing this, we need to build out some code to run our instructions in some form of test harness, which lets us set some expectations and make sure that the responses have some relevance to the truth.
If we can just automate this, partway, I think we'll have something worthwhile. Getting to a fully automated setup will be difficult until Copilot has an official API.
So:
- Use Playwright to automate input to VSCode Copilot on the web (ie, just use a codespace)
- Feed it a prompt
- Use the VSCode copilot export, to get the chat as JSON
- Parse the JSON for our request, get the response and use some form of eval framework to ensure the prompt makes sense. Some particular behaviors that I've seen:
- Instructions can sometimes loop in an unpredictable manner
- Instructions can hallucinate parameters to existing tools
- Instructions can be selectively followed
This is a task that can be broken apart and worked on, with stubs in place to do this. I have a demo on a very simple method to do this: https://microsoft-my.sharepoint.com/:v:/p/ripark/EcddcghEhkpIjS0AkA6BOFkBWxqKz1ggUPnTrgVcCrjqgQ?e=CKf8dE
I'm sure there's also some prior art, so some research would help us as well.