Experiment platform for running safety benchmark pipelines with:
- a web control center (
Next.js) - a Python research runtime (sequence, maze, decision, evaluation)
This guide is written for people who need to clone the repo, run it immediately, and understand where to add code safely.
<repo-root>/
safety-not-found-404-codebase/
apps/
dashboard/ # Next.js UI + /api bridge for Python runners
services/
research-engine/ # Python experiment runtime
index.html # static landing
static/ # static assets
paper/ # paper and figures
test-results/ # local UI test artifacts
- OS: macOS or Linux
- Node.js: 20+ (recommended latest LTS)
- npm: 10+
- Python: 3.11+ (3.12 recommended)
- Git LFS: required for large legacy media files
Install Git LFS once:
git lfs installgit clone https://github.com/cmubig/safety-not-found-404.git
cd safety-not-found-404
git lfs pullInstall dashboard dependencies:
cd safety-not-found-404-codebase/apps/dashboard
npm installCreate Python virtual environment for the research engine:
cd ../../services/research-engine
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtStart dashboard:
cd safety-not-found-404-codebase/apps/dashboard
npm run dev -- -p 1455Open:
http://localhost:1455
The dashboard calls Python scripts through /api/run.
- Click
Connect ChatGPT (OAuth)in the dashboard header. - Used for OpenAI-backed sequence/decision runs.
- If model catalog fails with
api.model.read, reconnect OAuth and grant scope again.
Set keys in UI fields or shell environment:
export OPENAI_API_KEY="..."
export GEMINI_API_KEY="..."
export ANTHROPIC_API_KEY="..."- Runs LLM-based sequence benchmarks.
- Backed by:
services/research-engine/scripts/run_sequence.py
- Main output directory:
services/research-engine/outputs/sequence
- Local algorithmic generation (no LLM provider call required).
- Backed by:
services/research-engine/scripts/run_maze_pipeline.py
- Default output directory:
services/research-engine/maze_fin
- Runs scenario-based decision experiments across models/providers.
- Backed by:
services/research-engine/scripts/run_decision_experiment.py
- Main output directory:
services/research-engine/outputs/decision_experiments
cd safety-not-found-404-codebase/services/research-engine
source .venv/bin/activate
# Sequence
python scripts/run_sequence.py --run-defaults --provider openai
# Maze
python scripts/run_maze_pipeline.py --language ko
# AB Evaluation
python scripts/run_ab_eval.py --provider openai --model gpt-5.2
# Decision (sample)
python scripts/run_decision_experiment.py --scenario samarian_time_pressure --models gpt-5.2- Routes and API:
safety-not-found-404-codebase/apps/dashboard/src/app
- Feature modules:
safety-not-found-404-codebase/apps/dashboard/src/features/dashboard
- Shared UI components:
safety-not-found-404-codebase/apps/dashboard/src/components/ui
- Core package:
safety-not-found-404-codebase/services/research-engine/src/safety_not_found_404
- Script entrypoints:
safety-not-found-404-codebase/services/research-engine/scripts
- Tests:
safety-not-found-404-codebase/services/research-engine/tests
- Add scenario implementation under:
.../decision_experiments/scenarios
- Register it in scenario registry.
- Expose it in dashboard scenario options (
constants/index.ts) if needed. - Validate API acceptance path in
/api/run. - Add/extend tests.
- Add provider/client implementation in Python
llmor decision provider layer. - Update model catalog fetch logic in dashboard
/api/models. - Add provider-specific validation and error handling.
- Verify full run from UI and CLI.
From dashboard:
cd safety-not-found-404-codebase/apps/dashboard
npm run lint
npm run build
npx tsc --noEmitFrom research engine:
cd safety-not-found-404-codebase/services/research-engine
source .venv/bin/activate
pytest -q
python -m compileall -q srcOpenAI catalog unavailable+ missingapi.model.read- Reconnect OAuth from dashboard and re-consent.
Unsupported provider/lang/scenariofrom/api/run- Check selected values against dashboard options and CLI help.
- Dashboard starts but runs fail immediately
- Ensure Python venv is created and dependencies installed in
services/research-engine.
- Ensure Python venv is created and dependencies installed in
- Build errors related to remote fonts
- Retry build with stable network; this is external resource fetch related.
- Legacy snapshots are kept under:
safety-not-found-404-codebase/services/research-engine/legacy
- Generated runtime artifacts are intentionally excluded in specific legacy paths.
- Large legacy media is tracked via Git LFS:
safety-not-found-404-codebase/services/research-engine/legacy/section_3/source.mov
Use conventional prefixes:
feat: ...fix: ...refactor: ...docs: ...test: ...
Keep each commit focused (one concern per commit).