A web-based platform for conducting qualitative evaluations of AI deliverables on the RLI public set. Supports AWS S3 and local filesystem storage.
RLI Example Deliverables: Sample AI deliverables for testing the evaluation platform
RLI Public Set: Complete public evaluation dataset
These datasets are automatically downloaded by the setup script below. You can also download them manually using the links above.
Get started in 2-3 minutes:
# Clone the repository
git clone https://github.com/centerforaisafety/rli_evaluation_platform.git
cd rli_evaluation_platform
# Run automated setup
python setup.pyThe setup script will:
- β Install dependencies
- β Authenticate with HuggingFace
- β Download the RLI Public Set (10 tasks with human deliverables)
- β Download example AI outputs from frontier models
- β Configure environment variables
- β Set up Autodesk Forge for 3D file viewing (optional)
After setup completes:
# Start the backend server
cd evaluation_platform
npm run server
# In another terminal, start the frontend
cd evaluation_platform
npm run devVisit http://localhost:5173 and login with your admin password!
# Skip example AI deliverables (evaluate your own models only)
python setup.py --no-examples
# Configure for S3 storage mode
python setup.py --s3
# Deploy to Fly.io after setup
python setup.py --deploy-flyio
# Use a custom benchmark directory
python setup.py --benchmark-dir /path/to/benchmarks
# Clean repository to initial state (deletes all data and config)
python setup.py --cleanNote: If datasets already exist in your benchmark directory, the setup script will detect them and ask if you want to redownload (defaults to "no"). This makes re-running setup much faster when you only need to reconfigure environment variables or reinstall dependencies.
- The RLI public set
- Example AI deliverables for each project in the public set
- The evaluation platform for evaluating AI deliverables against human reference deliverables
Project structure:
benchmarks/public_001/
βββ human_deliverable/ # Professional human-created output
βββ project/ # Project brief and inputs
β βββ brief.md
β βββ inputs/
βββ ai_deliverable/ # AI model outputs (optional)
βββ grok_4/
βββ manus/
βββ sonnet_4_5/
If you prefer manual setup or need custom configuration, follow these detailed instructions.
- Node.js 20.x (matches the Docker image and production deployments)
- npm 9+ (bundled with Node.js 20)
- Python 3.7+ (for dataset download)
- AWS credentials (only for S3 mode)
git clone https://github.com/centerforaisafety/rli_evaluation_platform.git
cd rli_evaluation_platformcd evaluation_platform
npm install --legacy-peer-depspip install huggingface_hub
# Download public set only
python -c "
from huggingface_hub import snapshot_download
from pathlib import Path
snapshot_download(
repo_id='cais/rli-public-set',
repo_type='dataset',
local_dir='./temp_public_set',
local_dir_use_symlinks=False
)
print('Downloaded public set to ./temp_public_set')
"
# Move to benchmarks directory
mkdir -p benchmarks
mv temp_public_set/public_* benchmarks/
rm -rf temp_public_setpip install huggingface_hub
# Download public set
python -c "
from huggingface_hub import snapshot_download
snapshot_download(
repo_id='cais/rli-public-set',
repo_type='dataset',
local_dir='./temp_public_set',
local_dir_use_symlinks=False
)
"
# Download example AI deliverables
python -c "
from huggingface_hub import snapshot_download
snapshot_download(
repo_id='cais/rli-example-deliverables',
repo_type='dataset',
local_dir='./temp_ai_deliverables',
local_dir_use_symlinks=False
)
"
# Merge datasets
mkdir -p benchmarks
for task in temp_public_set/public_*; do
task_id=$(basename $task)
mkdir -p benchmarks/$task_id
cp -r $task/* benchmarks/$task_id/
# Add AI deliverables if they exist
if [ -d "temp_ai_deliverables/$task_id/ai_deliverable" ]; then
cp -r temp_ai_deliverables/$task_id/ai_deliverable benchmarks/$task_id/
fi
done
# Cleanup
rm -rf temp_public_set temp_ai_deliverables- Visit https://huggingface.co/datasets/cais/rli-public-set
- Download the files manually
- Extract to
evaluation_platform/public/tasks/
Expected structure after download:
benchmarks/
βββ public_001/
β βββ human_deliverable/
β βββ project/
β β βββ brief.md
β β βββ inputs/
β βββ ai_deliverable/ # Only if you downloaded examples
β βββ grok_4/
β βββ manus/
β βββ sonnet_4_5/
βββ public_002/
β βββ ...
...
βββ public_010/
Create a .env file in the evaluation_platform/ directory:
cd evaluation_platformFor Local Mode (recommended for RLI Public Set):
# Storage mode: local
STORAGE_MODE=local
# Benchmark Directory (where evaluation datasets are stored)
BENCHMARK_DIR=./benchmarks
# Server port
PORT=5001
# Authentication (REQUIRED)
ADMIN_PASSWORD=your-secure-admin-password
JWT_SECRET=your-long-random-jwt-secret-string
# Autodesk Forge (for 3D file viewing)
# Get credentials at: https://aps.autodesk.com/myapps
AUTODESK_CLIENT_ID=your-client-id
AUTODESK_CLIENT_SECRET=your-client-secret
AUTODESK_BUCKET=your-unique-bucket-name # Must be globally unique
AUTODESK_CALLBACK_URL=http://localhost:5001For S3 Mode:
# Storage mode: s3
STORAGE_MODE=s3
# AWS S3 Configuration
AWS_ACCESS_KEY_ID=your-access-key
AWS_SECRET_ACCESS_KEY=your-secret-key
S3_BUCKET_NAME=your-bucket-name
S3_REGION=us-east-1
S3_TASKS_PREFIX=tasks/
# Server port
PORT=5001
# Authentication (REQUIRED)
ADMIN_PASSWORD=your-admin-password
JWT_SECRET=your-jwt-secret
# Autodesk Forge (for 3D file viewing)
AUTODESK_CLIENT_ID=your-client-id
AUTODESK_CLIENT_SECRET=your-client-secret
AUTODESK_BUCKET=your-unique-bucket-name # Must be globally unique
AUTODESK_CALLBACK_URL=http://localhost:5001Generate secure secrets:
# Generate JWT secret
python -c "import secrets; print(secrets.token_urlsafe(32))"
# Generate admin password
python -c "import secrets; print(secrets.token_urlsafe(16))"Development Mode:
# Terminal 1: Start backend server
npm run server
# Terminal 2: Start frontend dev server
npm run devProduction Mode:
npm run build
npm run serverVisit:
- Development:
http://localhost:5173 - Production:
http://localhost:5001
- Visit the application URL
- You'll be redirected to the login page
- Login with your admin password
Via Web Dashboard:
- Login as admin
- Click "Generate AI vs Human Comparisons"
- Set required number of completions (1-10)
- Get comparison links to share with evaluators
Via Command Line:
# Login as admin
TOKEN=$(curl -s -X POST http://localhost:5001/api/auth/login \
-H "Content-Type: application/json" \
-d '{"password": "your-admin-password"}' | jq -r '.token')
export AUTH_TOKEN=$TOKEN
# Generate all AI vs Human comparisons
curl -X POST http://localhost:5001/api/generator/generate-all-ai-vs-human \
-H "Authorization: Bearer $AUTH_TOKEN" \
-H "Content-Type: application/json" \
-d '{"saveToFile": true}'
# Get comparison links
curl -X GET "http://localhost:5001/api/comparisons/links/all?format=txt" \
-H "Authorization: Bearer $AUTH_TOKEN" \
-o comparison_links.txtFor detailed command-line workflows, see COMMAND_LINE_WORKFLOW.md.
- Admin generates comparisons using the dashboard or CLI
- Admin sets required number of completions (1-10)
- Admin gets list of incomplete comparison links
- Admin shares links with evaluators (no tokens needed for evaluators)
- Evaluators click links and complete evaluations in their browser
- Admin monitors progress via dashboard or CLI
When STORAGE_MODE=local, the application uses the local filesystem:
- Set
STORAGE_MODE=localin your.envfile - Set
BENCHMARK_DIRto point to your datasets:BENCHMARK_DIR=./benchmarks # or any path you prefer - Optionally configure data directory:
DATA_DIR=./evaluation_platform/data # where platform writes results
Best for:
- RLI Public Set evaluation
- Development and testing
- Small to medium task sets
- Offline or air-gapped environments
When STORAGE_MODE=s3, the application reads task files directly from your S3 bucket:
S3 Bucket Structure:
your-bucket/
βββ tasks/
βββ task001-Task_Name/
β βββ Model1/
β β βββ output.txt
β βββ Model2/
β β βββ output.txt
β βββ human1/
β β βββ output.txt
β βββ brief/
β βββ brief.md
βββ task002-Another_Task/
βββ ...
S3 Permissions Required:
s3:ListBucket- to list tasks and directoriess3:GetObject- to read files
S3 Manifest CSV:
The app reads comparison metadata from evaluation_platform/public/s3_manifest.csv when running in S3 mode. Each row describes one evaluation bundle:
task_id: Unique identifier for the taskagent: Name or version of the AI systemrepetition: ISO-8601 date stamp or run identifiers3_path_ai_artifact: Full S3 URI pointing at the AI-produced artifact folders3_path_human_artifact: Full S3 URI pointing at the human-produced artifact folders3_path_task_definition: Full S3 URI pointing at the task definition or brief folder
Caching:
Files are automatically cached locally in .cache/ after first access for better performance.
Best for:
- Production deployments
- Large task sets
- Multiple evaluators accessing simultaneously
- Centralized task storage
To verify your S3 configuration is working:
-
Start the server:
cd evaluation_platform npm run server -
Check the console output. You should see:
Server running on port 5001 Storage mode: s3 S3 bucket: your-bucket-name S3 region: us-east-1 S3 tasks prefix: tasks/ Data directory: /path/to/data -
Visit
http://localhost:5001and check if tasks are listed -
Monitor the server console for any S3 errors
-
Create a test script to verify S3 access:
// test-s3-connection.js const { listTasks } = require('./server/services/s3Service'); async function test() { try { const tasks = await listTasks(); console.log('Found tasks:', tasks); } catch (error) { console.error('S3 Error:', error); } } test();
Run with:
node test-s3-connection.js
To view 3D and CAD files (.dwg, .fbx, .3dm, .step, etc.), you need Autodesk Forge credentials:
- Go to https://aps.autodesk.com/myapps
- Create a new app (or use existing)
- Copy your Client ID and Client Secret
Add to your .env file:
AUTODESK_CLIENT_ID=your-client-id
AUTODESK_CLIENT_SECRET=your-client-secret
AUTODESK_BUCKET=rli-models # Choose a unique bucket name
AUTODESK_CALLBACK_URL=http://localhost:5001 # Or your deployment URLThe bucket is automatically created on first use by the ensureBucketExists() function in server/routes/autodesk.ts.
Important Notes:
- Bucket names must be globally unique across all Autodesk users
- The default name
rli-modelsmay already be taken - choose your own unique name (e.g.,rli-models-yourname-2025) - Files uploaded to Autodesk are cached by bucket+filename - if you had upload issues, change your bucket name to force fresh uploads
- Autodesk requires one bucket per domain (callback URL locked)
If bucket creation fails:
409 Conflictβ Bucket name already exists globally β choose a different unique name403 Forbiddenβ Check API app hasbucket:createscope
docker build -t rli_evaluation_platform .docker run -d \
-p 5001:5001 \
-e STORAGE_MODE=local \
-e ADMIN_PASSWORD=your-admin-password \
-e JWT_SECRET=your-jwt-secret \
-v $(pwd)/evaluation_platform/data:/app/data \
--name rli-evaluation-platform \
rli_evaluation_platformversion: '3.8'
services:
app:
build: .
ports:
- "5001:5001"
environment:
- STORAGE_MODE=${STORAGE_MODE}
- ADMIN_PASSWORD=${ADMIN_PASSWORD}
- JWT_SECRET=${JWT_SECRET}
- AUTODESK_CLIENT_ID=${AUTODESK_CLIENT_ID}
- AUTODESK_CLIENT_SECRET=${AUTODESK_CLIENT_SECRET}
- AUTODESK_BUCKET=${AUTODESK_BUCKET}
- AUTODESK_CALLBACK_URL=${AUTODESK_CALLBACK_URL}
volumes:
- app-data:/app/data
- app-cache:/app/.cache
volumes:
app-data:
app-cache:Then run: docker-compose up -d
-
Install Fly CLI:
curl -L https://fly.io/install.sh | sh -
Sign up and login:
fly auth signup # or fly auth login
# Run setup with Fly.io deployment
python setup.py --deploy-flyio-
Launch the app (first time only):
fly launch
- Choose a unique app name
- Select a region close to your users
- Skip database and Redis setup
- Deploy now: No (set secrets first)
-
Create a volume for persistent data:
fly volumes create app_data --size 1 --region <your-region>
-
Set environment secrets:
fly secrets set ADMIN_PASSWORD=your-admin-password fly secrets set JWT_SECRET=your-jwt-secret fly secrets set STORAGE_MODE=local # or s3 fly secrets set AUTODESK_CLIENT_ID=your-autodesk-client-id fly secrets set AUTODESK_CLIENT_SECRET=your-autodesk-client-secret fly secrets set AUTODESK_BUCKET=your-bucket-name fly secrets set AUTODESK_CALLBACK_URL=https://your-app.fly.dev
For S3 mode, also set:
fly secrets set AWS_ACCESS_KEY_ID=your-key fly secrets set AWS_SECRET_ACCESS_KEY=your-secret fly secrets set S3_BUCKET_NAME=your-bucket fly secrets set S3_REGION=your-region fly secrets set S3_TASKS_PREFIX=tasks/
-
Deploy:
fly deploy
-
View your app:
fly open
# View logs
fly logs
# Check app status
fly status
# SSH into the container
fly ssh consoleAfter making changes:
fly deploy# Scale to multiple instances
fly scale count 2
# Scale memory/CPU
fly scale vm shared-cpu-2xThe platform supports various file types for comparison:
- Text:
.txt,.json,.yml,.yaml,.js,.jsx,.ts,.tsx,.css,.py,.java,.go,.php,.rb,.swift,.xml,.sql,.sh,.c,.cpp,.cs, and more - Markdown:
.md - HTML:
.html,.htm - PDF:
.pdf - Spreadsheets:
.csv(tab- and semicolon-delimited variants supported) - Microsoft Office:
.ppt,.pptx,.doc,.docx,.xls,.xlsx - LaTeX:
.tex - Jupyter Notebooks:
.ipynb - Anki:
.apkg
- Images:
.jpg,.jpeg,.png,.gif,.bmp,.webp,.svg,.ico,.avif - TIFF:
.tif,.tiff - Video:
.mp4,.m4v,.mkv,.webm,.mov,.avi,.wmv - Audio:
.mp3,.wav,.ogg,.aac,.m4a - MIDI:
.midi,.mid
- PSD:
.psd(limited support; complex layer effects may flatten) - 3D Models:
.obj,.mtl,.stl,.gltf,.glb - Autodesk / CAD:
.dwg,.dxf,.dwf,.dwt,.skp,.stp,.step,.ipt,.3dm,.3ds,.fbx,.rvt,.ifc,.catpart,.catproduct,.cgr,.dae,.dgn,.f3d,.gbxml,.iam,.idw
- SQL Databases:
.sqlite,.db(large files may load slowly) - WebGL / Website:
.htmlentry points with associated.jsand.cssassets for interactive WebGL builds, React apps, Vite, Three.js, web games, and visualization tools
-
Create task directory structure:
mkdir -p benchmarks/my_task_011/{human_deliverable,project/inputs,ai_deliverable/my_model} -
Add your content to the directories:
benchmarks/my_task_011/ βββ human_deliverable/ β βββ [your human reference files] βββ project/ β βββ brief.md β βββ inputs/ β βββ [input files] βββ ai_deliverable/ βββ my_model/ βββ [your AI model output] -
Ensure
BENCHMARK_DIRin.envpoints to your benchmarks directory -
Regenerate comparisons:
curl -X POST http://localhost:5001/api/generator/generate-all \ -H "Content-Type: application/json" \ -d '{"saveToFile": true}'
-
Upload your task files to S3:
s3://your-bucket/tasks/task011-New_Task/ βββ Model1/ β βββ output.txt βββ Model2/ β βββ output.txt βββ human1/ β βββ output.txt βββ brief/ βββ brief.md -
Add the task entry to
public/s3_manifest.csv:task_id,agent,repetition,s3_path_ai_artifact,s3_path_human_artifact,s3_path_task_definition TASK011,agent_name,2025-01-01,s3://bucket/tasks/task011/Model1,s3://bucket/tasks/task011/human1,s3://bucket/tasks/task011/brief
-
Regenerate comparisons as above
Problem: 401 Client Error: Unauthorized when downloading datasets
Solution:
- Run
huggingface-cli login - Ensure you have access to the cais organization datasets
- Contact the CAIS team to request access if needed
- Access Denied: Check your AWS credentials and bucket permissions
- No Tasks Found: Verify your bucket structure has a
tasks/prefix - Slow Performance: Files are cached after first access - subsequent loads are faster
- Cache Issues: Delete
.cache/directory to force fresh downloads
- Ensure
.envfile is in theevaluation_platform/directory - Restart the server after changing
.env - Check that variables are not commented out
- Verify Autodesk Forge credentials are set in
.env - Check that bucket name is unique and valid
- View server logs for Autodesk API errors
- Ensure your Autodesk app has the required scopes
# Find process using port 5001
lsof -i :5001
# Kill the process
kill -9 <PID>
# Or use a different port
PORT=5001 npm run serverIf you need to start fresh or clean up after testing, you can reset the repository to its initial state:
python setup.py --cleanThis will delete:
benchmarks/- Downloaded datasets (RLI public set and AI deliverables)evaluation_platform/data/- Generated comparisons and evaluation metadataevaluation_platform/.env- Environment configurationevaluation_platform/node_modules/- Installed npm packagesevaluation_platform/dist/andevaluation_platform/dist-server/- Build artifacts.rli_datasets_temp/- Temporary download cache
After cleaning, you can run python setup.py again for a fresh installation.
When to use --clean:
- When you want to test the setup process from scratch
- When switching between different configurations (local β S3)
- When troubleshooting persistent issues with cached data
- Never commit
.envto version control - it's already in.gitignore - Use IAM roles with minimal required permissions for production
- Consider using AWS Secrets Manager for production secrets
- Admin password should be strong and unique
- JWT secret should be a long, random string (32+ characters)
- Access tokens expire after 7 days by default
- All admin endpoints are protected by authentication middleware
- Command Line Workflow Guide - Detailed CLI usage for admins
- RLI Public Set on HuggingFace
- Example AI Deliverables on HuggingFace
- Comparison metadata is always stored locally in the
data/directory - S3 files are streamed and cached for optimal performance
- The cache directory (
.cache/) can be safely deleted to free space - Both storage modes can be used interchangeably by changing
STORAGE_MODE - The app automatically adds base URLs to HTML files for proper resource loading
- Search for
-herein config files to find placeholder values to replace
/evaluation_platform/src- React frontend code/evaluation_platform/server- Express backend code/evaluation_platform/public- Static web assets (fonts, soundfonts, icons)/evaluation_platform/data- Generated comparisons and evaluation metadata/evaluation_platform/.cache- Cached S3 files (S3 mode only)/benchmarks- RLI public set and evaluation datasets (created by setup.py)/setup.py- Automated setup script