This repo contains end-to-end examples of GenAI/LLM applications and evaluation pipelines set up using continuous-eval.
Checkout continuous-eval repo and documentation for more information.
| Example Name | App Framework | Eval Framework | Description |
|---|---|---|---|
| Simple RAG | Langchain | continuous-eval | Simple QA chatbot over select Paul Graham essays |
| Complex RAG | Langchain | continuous-eval | Complex QA chatbot over select Paul Graham essays |
| ReAct Agent | LlamaIndex | continuous-eval | QA over Uber financial dataset using agents |
| Sentiment Classification | LlamaIndex | continuous-eval | Single label classification of sentence sentiment |
| Simple RAG | Haystack | continuous-eval | Simple QA chatbot over select Paul Graham essays |
| Customer Support | OpenAI Swarm | continuous-eval | Customer support agent using tools |
In order to run the examples, you need to have Python 3.11 (suggested) and Poetry installed. Then, clone this repo and install the dependencies:
git clone https://github.com/relari-ai/examples.git && cd examples
poetry use 3.11
poetry install --with haystack --with langchain --with llama-indexNote that the --with flags are optional and only needed if you want to run the examples for the respective frameworks.
Each example is in a subfolder: examples/<FRAMEWORK>/<APP_NAME>/.
Some examples have just one script to execute (e.g. Haystack's Simple RAG), some have multiple:
pipeline.pydefines the application pipeline and the evaluation metrics / tests.app.pycontains the LLM application. Run this script to get the outputs (saved asresults.jsonl)eval.pyruns the metrics / tests defined bypipeline.py(saved asmetrics_results.jsonandtest_results.json)
To run the examples, you can use the following command:
poetry run python -m examples.<FRAMEWORK>.<APP_NAME>.appfor example poetry run python3 -m examples.swarm.customer_support.eval.
To run the evaluation metrics and tests, use:
poetry run python3 -m examples.<FRAMEWORK>.<APP_NAME>.evalDepending on the application, the source data for the application (documents and embeddings in Chroma vectorstore) and evaluation (golden dataset) is also provided. Note that for the evaluation golden dataset, there are always two files:
dataset.jsonlcontains the inputs (questions) and reference module outputs (ground truths)manifest.yamldefines the structure of the dataset for the evaluators.
Tweak metrics and tests in pipeline.py to try out different metrics.