AIOps test harness and evaluation

- Agentic benchmarking
- Frontier model comparison
- Testing Llama Stack
- Langchain
- CrewAI
- Whitepaper
- Take whitepaper & demo to bootstrap and "Eval AIOps harness"