Currently, there is limited automated testing to validate the agent’s extraction accuracy or detect regressions when prompts or models are updated.
We also need a set of accurate, well-defined examples that can consistently test and benchmark the model’s performance.
As a starting point, some data can be imported from the aboutcode-org/vulnerablecode repository, but it must be carefully reviewed and validated before inclusion in the dataset.