RLI Evaluation Platform

A web-based platform for conducting qualitative evaluations of AI deliverables on the RLI public set. Supports AWS S3 and local filesystem storage.

📦 Example Datasets

RLI Example Deliverables: Sample AI deliverables for testing the evaluation platform

🔗 https://huggingface.co/datasets/cais/rli-example-deliverables

RLI Public Set: Complete public evaluation dataset

🔗 https://huggingface.co/datasets/cais/rli-public-set

These datasets are automatically downloaded by the setup script below. You can also download them manually using the links above.

🚀 Quick Start

Get started in 2-3 minutes:

# Clone the repository
git clone https://github.com/centerforaisafety/rli_evaluation_platform.git
cd rli_evaluation_platform

# Run automated setup
python setup.py

The setup script will:

✅ Install dependencies
✅ Authenticate with HuggingFace
✅ Download the RLI Public Set (10 tasks with human deliverables)
✅ Download example AI outputs from frontier models
✅ Configure environment variables
✅ Set up Autodesk Forge for 3D file viewing (optional)

After setup completes:

# Start the backend server
cd evaluation_platform
npm run server

# In another terminal, start the frontend
cd evaluation_platform
npm run dev

Visit http://localhost:5173 and login with your admin password!

Setup Options

# Skip example AI deliverables (evaluate your own models only)
python setup.py --no-examples

# Configure for S3 storage mode
python setup.py --s3

# Deploy to Fly.io after setup
python setup.py --deploy-flyio

# Use a custom benchmark directory
python setup.py --benchmark-dir /path/to/benchmarks

# Clean repository to initial state (deletes all data and config)
python setup.py --clean

Note: If datasets already exist in your benchmark directory, the setup script will detect them and ask if you want to redownload (defaults to "no"). This makes re-running setup much faster when you only need to reconfigure environment variables or reinstall dependencies.

What You Get

The RLI public set
Example AI deliverables for each project in the public set
The evaluation platform for evaluating AI deliverables against human reference deliverables

Project structure:

benchmarks/public_001/
├── human_deliverable/    # Professional human-created output
├── project/              # Project brief and inputs
│   ├── brief.md
│   └── inputs/
└── ai_deliverable/       # AI model outputs (optional)
    ├── grok_4/
    ├── manus/
    └── sonnet_4_5/

📖 Manual Setup

If you prefer manual setup or need custom configuration, follow these detailed instructions.

Prerequisites

Node.js 20.x (matches the Docker image and production deployments)
npm 9+ (bundled with Node.js 20)
Python 3.7+ (for dataset download)
AWS credentials (only for S3 mode)

1. Clone the Repository

git clone https://github.com/centerforaisafety/rli_evaluation_platform.git
cd rli_evaluation_platform

2. Install Dependencies

cd evaluation_platform
npm install --legacy-peer-deps

3. Download RLI Public Set

Option A: Using Python Script

pip install huggingface_hub

# Download public set only
python -c "
from huggingface_hub import snapshot_download
from pathlib import Path

snapshot_download(
    repo_id='cais/rli-public-set',
    repo_type='dataset',
    local_dir='./temp_public_set',
    local_dir_use_symlinks=False
)
print('Downloaded public set to ./temp_public_set')
"

# Move to benchmarks directory
mkdir -p benchmarks
mv temp_public_set/public_* benchmarks/
rm -rf temp_public_set

Option B: Download with Example AI Deliverables

pip install huggingface_hub

# Download public set
python -c "
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id='cais/rli-public-set',
    repo_type='dataset',
    local_dir='./temp_public_set',
    local_dir_use_symlinks=False
)
"

# Download example AI deliverables
python -c "
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id='cais/rli-example-deliverables',
    repo_type='dataset',
    local_dir='./temp_ai_deliverables',
    local_dir_use_symlinks=False
)
"

# Merge datasets
mkdir -p benchmarks
for task in temp_public_set/public_*; do
    task_id=$(basename $task)
    mkdir -p benchmarks/$task_id
    cp -r $task/* benchmarks/$task_id/
    
    # Add AI deliverables if they exist
    if [ -d "temp_ai_deliverables/$task_id/ai_deliverable" ]; then
        cp -r temp_ai_deliverables/$task_id/ai_deliverable benchmarks/$task_id/
    fi
done

# Cleanup
rm -rf temp_public_set temp_ai_deliverables

Option C: Manual Download from HuggingFace

Visit https://huggingface.co/datasets/cais/rli-public-set
Download the files manually
Extract to evaluation_platform/public/tasks/

Expected structure after download:

benchmarks/
├── public_001/
│   ├── human_deliverable/
│   ├── project/
│   │   ├── brief.md
│   │   └── inputs/
│   └── ai_deliverable/      # Only if you downloaded examples
│       ├── grok_4/
│       ├── manus/
│       └── sonnet_4_5/
├── public_002/
│   └── ...
...
├── public_010/

4. Configure Environment Variables

Create a .env file in the evaluation_platform/ directory:

cd evaluation_platform

For Local Mode (recommended for RLI Public Set):

# Storage mode: local
STORAGE_MODE=local

# Benchmark Directory (where evaluation datasets are stored)
BENCHMARK_DIR=./benchmarks

# Server port
PORT=5001

# Authentication (REQUIRED)
ADMIN_PASSWORD=your-secure-admin-password
JWT_SECRET=your-long-random-jwt-secret-string

# Autodesk Forge (for 3D file viewing)
# Get credentials at: https://aps.autodesk.com/myapps
AUTODESK_CLIENT_ID=your-client-id
AUTODESK_CLIENT_SECRET=your-client-secret
AUTODESK_BUCKET=your-unique-bucket-name  # Must be globally unique
AUTODESK_CALLBACK_URL=http://localhost:5001

For S3 Mode:

# Storage mode: s3
STORAGE_MODE=s3

# AWS S3 Configuration
AWS_ACCESS_KEY_ID=your-access-key
AWS_SECRET_ACCESS_KEY=your-secret-key
S3_BUCKET_NAME=your-bucket-name
S3_REGION=us-east-1
S3_TASKS_PREFIX=tasks/

# Server port
PORT=5001

# Authentication (REQUIRED)
ADMIN_PASSWORD=your-admin-password
JWT_SECRET=your-jwt-secret

# Autodesk Forge (for 3D file viewing)
AUTODESK_CLIENT_ID=your-client-id
AUTODESK_CLIENT_SECRET=your-client-secret
AUTODESK_BUCKET=your-unique-bucket-name  # Must be globally unique
AUTODESK_CALLBACK_URL=http://localhost:5001

Generate secure secrets:

# Generate JWT secret
python -c "import secrets; print(secrets.token_urlsafe(32))"

# Generate admin password
python -c "import secrets; print(secrets.token_urlsafe(16))"

5. Start the Application

Development Mode:

# Terminal 1: Start backend server
npm run server

# Terminal 2: Start frontend dev server
npm run dev

Production Mode:

npm run build
npm run server

Visit:

Development: http://localhost:5173
Production: http://localhost:5001

🎯 Usage

Authentication

Visit the application URL
You'll be redirected to the login page
Login with your admin password

Generate Comparisons

Via Web Dashboard:

Login as admin
Click "Generate AI vs Human Comparisons"
Set required number of completions (1-10)
Get comparison links to share with evaluators

Via Command Line:

# Login as admin
TOKEN=$(curl -s -X POST http://localhost:5001/api/auth/login \
  -H "Content-Type: application/json" \
  -d '{"password": "your-admin-password"}' | jq -r '.token')

export AUTH_TOKEN=$TOKEN

# Generate all AI vs Human comparisons
curl -X POST http://localhost:5001/api/generator/generate-all-ai-vs-human \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"saveToFile": true}'

# Get comparison links
curl -X GET "http://localhost:5001/api/comparisons/links/all?format=txt" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -o comparison_links.txt

For detailed command-line workflows, see COMMAND_LINE_WORKFLOW.md.

Workflow

Admin generates comparisons using the dashboard or CLI
Admin sets required number of completions (1-10)
Admin gets list of incomplete comparison links
Admin shares links with evaluators (no tokens needed for evaluators)
Evaluators click links and complete evaluations in their browser
Admin monitors progress via dashboard or CLI

🛠️ Storage Modes

Local Mode (Default)

When STORAGE_MODE=local, the application uses the local filesystem:

Set STORAGE_MODE=local in your .env file

Set BENCHMARK_DIR to point to your datasets:

BENCHMARK_DIR=./benchmarks  # or any path you prefer

Optionally configure data directory:

DATA_DIR=./evaluation_platform/data  # where platform writes results

Best for:

RLI Public Set evaluation
Development and testing
Small to medium task sets
Offline or air-gapped environments

S3 Mode

When STORAGE_MODE=s3, the application reads task files directly from your S3 bucket:

S3 Bucket Structure:

your-bucket/
└── tasks/
    ├── task001-Task_Name/
    │   ├── Model1/
    │   │   └── output.txt
    │   ├── Model2/
    │   │   └── output.txt
    │   ├── human1/
    │   │   └── output.txt
    │   └── brief/
    │       └── brief.md
    └── task002-Another_Task/
        └── ...

S3 Permissions Required:

s3:ListBucket - to list tasks and directories
s3:GetObject - to read files

S3 Manifest CSV:

The app reads comparison metadata from evaluation_platform/public/s3_manifest.csv when running in S3 mode. Each row describes one evaluation bundle:

task_id: Unique identifier for the task
agent: Name or version of the AI system
repetition: ISO-8601 date stamp or run identifier
s3_path_ai_artifact: Full S3 URI pointing at the AI-produced artifact folder
s3_path_human_artifact: Full S3 URI pointing at the human-produced artifact folder
s3_path_task_definition: Full S3 URI pointing at the task definition or brief folder

Caching: Files are automatically cached locally in .cache/ after first access for better performance.

Best for:

Production deployments
Large task sets
Multiple evaluators accessing simultaneously
Centralized task storage

🧪 Testing S3 Connection

To verify your S3 configuration is working:

Start the server:
```
cd evaluation_platform
npm run server
```

Check the console output. You should see:

Server running on port 5001
Storage mode: s3
S3 bucket: your-bucket-name
S3 region: us-east-1
S3 tasks prefix: tasks/
Data directory: /path/to/data

Visit http://localhost:5001 and check if tasks are listed
Monitor the server console for any S3 errors

Create a test script to verify S3 access:

// test-s3-connection.js
const { listTasks } = require('./server/services/s3Service');

async function test() {
  try {
    const tasks = await listTasks();
    console.log('Found tasks:', tasks);
  } catch (error) {
    console.error('S3 Error:', error);
  }
}

test();

Run with: node test-s3-connection.js

🔧 Autodesk Forge Setup

To view 3D and CAD files (.dwg, .fbx, .3dm, .step, etc.), you need Autodesk Forge credentials:

1. Get API Credentials

Go to https://aps.autodesk.com/myapps
Create a new app (or use existing)
Copy your Client ID and Client Secret

2. Configure Environment Variables

Add to your .env file:

AUTODESK_CLIENT_ID=your-client-id
AUTODESK_CLIENT_SECRET=your-client-secret
AUTODESK_BUCKET=rli-models  # Choose a unique bucket name
AUTODESK_CALLBACK_URL=http://localhost:5001  # Or your deployment URL

3. Bucket Creation

The bucket is automatically created on first use by the ensureBucketExists() function in server/routes/autodesk.ts.

Important Notes:

Bucket names must be globally unique across all Autodesk users
The default name rli-models may already be taken - choose your own unique name (e.g., rli-models-yourname-2025)
Files uploaded to Autodesk are cached by bucket+filename - if you had upload issues, change your bucket name to force fresh uploads
Autodesk requires one bucket per domain (callback URL locked)

If bucket creation fails:

409 Conflict → Bucket name already exists globally → choose a different unique name
403 Forbidden → Check API app has bucket:create scope

🐳 Docker Support

Building with S3 Support

docker build -t rli_evaluation_platform .

Running with Environment Variables

docker run -d \
  -p 5001:5001 \
  -e STORAGE_MODE=local \
  -e ADMIN_PASSWORD=your-admin-password \
  -e JWT_SECRET=your-jwt-secret \
  -v $(pwd)/evaluation_platform/data:/app/data \
  --name rli-evaluation-platform \
  rli_evaluation_platform

Docker Compose

version: '3.8'
services:
  app:
    build: .
    ports:
      - "5001:5001"
    environment:
      - STORAGE_MODE=${STORAGE_MODE}
      - ADMIN_PASSWORD=${ADMIN_PASSWORD}
      - JWT_SECRET=${JWT_SECRET}
      - AUTODESK_CLIENT_ID=${AUTODESK_CLIENT_ID}
      - AUTODESK_CLIENT_SECRET=${AUTODESK_CLIENT_SECRET}
      - AUTODESK_BUCKET=${AUTODESK_BUCKET}
      - AUTODESK_CALLBACK_URL=${AUTODESK_CALLBACK_URL}
    volumes:
      - app-data:/app/data
      - app-cache:/app/.cache

volumes:
  app-data:
  app-cache:

Then run: docker-compose up -d

☁️ Deployment with Fly.io

Prerequisites

Install Fly CLI:
```
curl -L https://fly.io/install.sh | sh
```
Sign up and login:
```
fly auth signup
# or
fly auth login
```

Automated Deployment

# Run setup with Fly.io deployment
python setup.py --deploy-flyio

Manual Deployment Steps

Launch the app (first time only):
```
fly launch
```
- Choose a unique app name
- Select a region close to your users
- Skip database and Redis setup
- Deploy now: No (set secrets first)

Create a volume for persistent data:

fly volumes create app_data --size 1 --region <your-region>

Set environment secrets:

fly secrets set ADMIN_PASSWORD=your-admin-password
fly secrets set JWT_SECRET=your-jwt-secret
fly secrets set STORAGE_MODE=local  # or s3
fly secrets set AUTODESK_CLIENT_ID=your-autodesk-client-id
fly secrets set AUTODESK_CLIENT_SECRET=your-autodesk-client-secret
fly secrets set AUTODESK_BUCKET=your-bucket-name
fly secrets set AUTODESK_CALLBACK_URL=https://your-app.fly.dev

For S3 mode, also set:

fly secrets set AWS_ACCESS_KEY_ID=your-key
fly secrets set AWS_SECRET_ACCESS_KEY=your-secret
fly secrets set S3_BUCKET_NAME=your-bucket
fly secrets set S3_REGION=your-region
fly secrets set S3_TASKS_PREFIX=tasks/

Deploy:
```
fly deploy
```
View your app:
```
fly open
```

Monitoring and Logs

# View logs
fly logs

# Check app status
fly status

# SSH into the container
fly ssh console

Updating the App

After making changes:

fly deploy

Scaling

# Scale to multiple instances
fly scale count 2

# Scale memory/CPU
fly scale vm shared-cpu-2x

📁 File Support

The platform supports various file types for comparison:

Documents

Text: .txt, .json, .yml, .yaml, .js, .jsx, .ts, .tsx, .css, .py, .java, .go, .php, .rb, .swift, .xml, .sql, .sh, .c, .cpp, .cs, and more
Markdown: .md
HTML: .html, .htm
PDF: .pdf
Spreadsheets: .csv (tab- and semicolon-delimited variants supported)
Microsoft Office: .ppt, .pptx, .doc, .docx, .xls, .xlsx
LaTeX: .tex
Jupyter Notebooks: .ipynb
Anki: .apkg

Media

Images: .jpg, .jpeg, .png, .gif, .bmp, .webp, .svg, .ico, .avif
TIFF: .tif, .tiff
Video: .mp4, .m4v, .mkv, .webm, .mov, .avi, .wmv
Audio: .mp3, .wav, .ogg, .aac, .m4a
MIDI: .midi, .mid

Design & 3D

PSD: .psd (limited support; complex layer effects may flatten)
3D Models: .obj, .mtl, .stl, .gltf, .glb
Autodesk / CAD: .dwg, .dxf, .dwf, .dwt, .skp, .stp, .step, .ipt, .3dm, .3ds, .fbx, .rvt, .ifc, .catpart, .catproduct, .cgr, .dae, .dgn, .f3d, .gbxml, .iam, .idw

Data & Interactive

SQL Databases: .sqlite, .db (large files may load slowly)
WebGL / Website: .html entry points with associated .js and .css assets for interactive WebGL builds, React apps, Vite, Three.js, web games, and visualization tools

➕ Adding New Tasks

For Local Mode

Create task directory structure:

mkdir -p benchmarks/my_task_011/{human_deliverable,project/inputs,ai_deliverable/my_model}

Add your content to the directories:

benchmarks/my_task_011/
├── human_deliverable/
│   └── [your human reference files]
├── project/
│   ├── brief.md
│   └── inputs/
│       └── [input files]
└── ai_deliverable/
    └── my_model/
        └── [your AI model output]

Ensure BENCHMARK_DIR in .env points to your benchmarks directory

Regenerate comparisons:

curl -X POST http://localhost:5001/api/generator/generate-all \
  -H "Content-Type: application/json" \
  -d '{"saveToFile": true}'

For S3 Mode

Upload your task files to S3:

s3://your-bucket/tasks/task011-New_Task/
├── Model1/
│   └── output.txt
├── Model2/
│   └── output.txt
├── human1/
│   └── output.txt
└── brief/
    └── brief.md

Add the task entry to public/s3_manifest.csv:

task_id,agent,repetition,s3_path_ai_artifact,s3_path_human_artifact,s3_path_task_definition
TASK011,agent_name,2025-01-01,s3://bucket/tasks/task011/Model1,s3://bucket/tasks/task011/human1,s3://bucket/tasks/task011/brief

Regenerate comparisons as above

🔍 Troubleshooting

HuggingFace Authentication Issues

Problem: 401 Client Error: Unauthorized when downloading datasets

Solution:

Run huggingface-cli login
Ensure you have access to the cais organization datasets
Contact the CAIS team to request access if needed

S3 Connection Issues

Access Denied: Check your AWS credentials and bucket permissions
No Tasks Found: Verify your bucket structure has a tasks/ prefix
Slow Performance: Files are cached after first access - subsequent loads are faster
Cache Issues: Delete .cache/ directory to force fresh downloads

Environment Variables Not Loading

Ensure .env file is in the evaluation_platform/ directory
Restart the server after changing .env
Check that variables are not commented out

3D Files Not Displaying

Verify Autodesk Forge credentials are set in .env
Check that bucket name is unique and valid
View server logs for Autodesk API errors
Ensure your Autodesk app has the required scopes

Port Already in Use

# Find process using port 5001
lsof -i :5001

# Kill the process
kill -9 <PID>

# Or use a different port
PORT=5001 npm run server

🧹 Resetting the Repository

If you need to start fresh or clean up after testing, you can reset the repository to its initial state:

python setup.py --clean

This will delete:

benchmarks/ - Downloaded datasets (RLI public set and AI deliverables)
evaluation_platform/data/ - Generated comparisons and evaluation metadata
evaluation_platform/.env - Environment configuration
evaluation_platform/node_modules/ - Installed npm packages
evaluation_platform/dist/ and evaluation_platform/dist-server/ - Build artifacts
.rli_datasets_temp/ - Temporary download cache

After cleaning, you can run python setup.py again for a fresh installation.

When to use --clean:

When you want to test the setup process from scratch
When switching between different configurations (local ↔ S3)
When troubleshooting persistent issues with cached data

🔒 Security Notes

Never commit .env to version control - it's already in .gitignore
Use IAM roles with minimal required permissions for production
Consider using AWS Secrets Manager for production secrets
Admin password should be strong and unique
JWT secret should be a long, random string (32+ characters)
Access tokens expire after 7 days by default
All admin endpoints are protected by authentication middleware

📚 Additional Resources

Command Line Workflow Guide - Detailed CLI usage for admins
RLI Public Set on HuggingFace
Example AI Deliverables on HuggingFace

💡 Technical Notes

Comparison metadata is always stored locally in the data/ directory
S3 files are streamed and cached for optimal performance
The cache directory (.cache/) can be safely deleted to free space
Both storage modes can be used interchangeably by changing STORAGE_MODE
The app automatically adds base URLs to HTML files for proper resource loading
Search for -here in config files to find placeholder values to replace

Project Structure

/evaluation_platform/src - React frontend code
/evaluation_platform/server - Express backend code
/evaluation_platform/public - Static web assets (fonts, soundfonts, icons)
/evaluation_platform/data - Generated comparisons and evaluation metadata
/evaluation_platform/.cache - Cached S3 files (S3 mode only)
/benchmarks - RLI public set and evaluation datasets (created by setup.py)
/setup.py - Automated setup script

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
evaluation_platform		evaluation_platform
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
fly.toml		fly.toml
package-lock.json		package-lock.json
package.json		package.json
setup.py		setup.py

centerforaisafety/rli_evaluation_platform

Folders and files

Latest commit

History

Repository files navigation

RLI Evaluation Platform

📦 Example Datasets

🚀 Quick Start

Setup Options

What You Get

📖 Manual Setup

Prerequisites

1. Clone the Repository

2. Install Dependencies

3. Download RLI Public Set

Option A: Using Python Script

Option B: Download with Example AI Deliverables

Option C: Manual Download from HuggingFace

4. Configure Environment Variables

5. Start the Application

🎯 Usage

Authentication

Generate Comparisons

Workflow

🛠️ Storage Modes

Local Mode (Default)

S3 Mode

🧪 Testing S3 Connection

🔧 Autodesk Forge Setup

1. Get API Credentials

2. Configure Environment Variables

3. Bucket Creation

🐳 Docker Support

Building with S3 Support

Running with Environment Variables

Docker Compose

☁️ Deployment with Fly.io

Prerequisites

Automated Deployment

Manual Deployment Steps

Monitoring and Logs

Updating the App

Scaling

📁 File Support

Documents

Media

Design & 3D

Data & Interactive

➕ Adding New Tasks

For Local Mode

For S3 Mode

🔍 Troubleshooting

HuggingFace Authentication Issues

S3 Connection Issues

Environment Variables Not Loading

3D Files Not Displaying

Port Already in Use

🧹 Resetting the Repository

🔒 Security Notes

📚 Additional Resources

💡 Technical Notes

Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages