LLM Evaluation for the Bat‑Adapt Project (Open Booster Challenge — City Risks & Resilience)
Bat‑Adapt is a research and evaluation project that leverages large language models (LLMs) to assess or support resilience and risk‑related decision processes in the Bat‑Adapt initiative, presented at the Open Booster Challenge focused on city risks and resilience. This repository contains evaluation code and associated materials to benchmark LLM outputs relevant to the project.
The Bat‑Adapt project is focused on:
- Evaluating LLM performance for tasks associated with risk assessment and resilience planning in urban contexts.
- Providing scripts and tools to run LLM evaluations programmatically.
- Supporting repeatable and rigorous analysis of model outputs.
Although this repository currently has limited documentation, it includes at least:
📦 Bat‑Adapt
├── README.md
└── llm‑evaluation.py # Python script for executing LLM evaluation logic
(Note: Repo layout and files are visible from GitHub’s repository listing.) ([GitHub][2])
These instructions help you get set‑up to run and evaluate LLMs using the provided scripts.
Ensure you have the following installed:
- Python 3.9 or later
pip(Python package manager)- Internet access for model APIs (if applicable)
- Optional: Virtual environment tool such as
venvorconda
This project may not include a requirements.txt, but the common dependencies for LLM evaluations typically include:
pip install openai transformers datasets numpy pandasModify based on the actual imports in llm‑evaluation.py.
Assuming llm‑evaluation.py drives experiments, you may run:
python llm‑evaluation.pyThis script likely:
- Loads a set of input prompts
- Sends them to a configured LLM
- Records outputs and compares them against references
👉 You’ll want to inspect or modify this script to configure:
- Model endpoint or API keys
- Evaluation metrics (accuracy, coherence, relevance)
- Dataset or prompt files if any
The evaluation workflow generally includes:
- Loading a dataset of tasks relevant to city resilience and risk (possibly local prompts or test cases).
- Sending these tasks to an LLM (OpenAI, Hugging Face models, etc.).
- Capturing responses from the model.
- Computing evaluation metrics like relevance, correctness, or alignment with expected output.
- Generating a report of results.
You can customize this workflow to measure fluency, coherence, factuality, or task‑specific performance using established frameworks. ([GitHub Docs][3])
Below is a template of what such an evaluation script might look like internally. You should tailor it to your repository’s code.
import openai
import json
openai.api_key = "YOUR_API_KEY"
def evaluate(prompt):
response = openai.ChatCompletion.create(
model="gpt‑4o‑2024‑05‑13",
messages=[{"role":"user","content":prompt}],
max_tokens=512,
)
return response["choices"][0]["message"]["content"]
def run_evaluation(prompts_file):
with open(prompts_file) as f:
prompts = json.load(f)
results = {}
for idx, p in enumerate(prompts):
result_text = evaluate(p)
results[f"case_{idx}"] = result_text
with open("results.json","w") as out:
json.dump(results, out, indent=2)
if __name__ == "__main__":
run_evaluation("prompts.json")Replace model name, dataset, and prompt schema based on your actual setup.
Contributions are welcome! You can help by:
- Adding a requirements.txt
- Expanding evaluation metrics and analysis
- Adding example datasets and prompts
- Improving documentation and notebooks demonstrating usage
- Creating visualizations of evaluation results
This repository does not list a specific license — you may want to add one (e.g., MIT License) for open reuse.
- The repository currently has no stars or forks but contains evaluation logic for an LLM focused on resilience tasks. ([GitHub][2])
- The README above assumes the script
llm‑evaluation.pyis central to its purpose; adjust as needed if additional content is present.