Skip to content

Public repository for the Remote Labor Index (RLI)

Notifications You must be signed in to change notification settings

centerforaisafety/rli_evaluation_platform

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

RLI Evaluation Platform

A web-based platform for conducting qualitative evaluations of AI deliverables on the RLI public set. Supports AWS S3 and local filesystem storage.

πŸ“¦ Example Datasets

RLI Example Deliverables: Sample AI deliverables for testing the evaluation platform

RLI Public Set: Complete public evaluation dataset

These datasets are automatically downloaded by the setup script below. You can also download them manually using the links above.

πŸš€ Quick Start

Get started in 2-3 minutes:

# Clone the repository
git clone https://github.com/centerforaisafety/rli_evaluation_platform.git
cd rli_evaluation_platform

# Run automated setup
python setup.py

The setup script will:

  • βœ… Install dependencies
  • βœ… Authenticate with HuggingFace
  • βœ… Download the RLI Public Set (10 tasks with human deliverables)
  • βœ… Download example AI outputs from frontier models
  • βœ… Configure environment variables
  • βœ… Set up Autodesk Forge for 3D file viewing (optional)

After setup completes:

# Start the backend server
cd evaluation_platform
npm run server

# In another terminal, start the frontend
cd evaluation_platform
npm run dev

Visit http://localhost:5173 and login with your admin password!

Setup Options

# Skip example AI deliverables (evaluate your own models only)
python setup.py --no-examples

# Configure for S3 storage mode
python setup.py --s3

# Deploy to Fly.io after setup
python setup.py --deploy-flyio

# Use a custom benchmark directory
python setup.py --benchmark-dir /path/to/benchmarks

# Clean repository to initial state (deletes all data and config)
python setup.py --clean

Note: If datasets already exist in your benchmark directory, the setup script will detect them and ask if you want to redownload (defaults to "no"). This makes re-running setup much faster when you only need to reconfigure environment variables or reinstall dependencies.

What You Get

  • The RLI public set
  • Example AI deliverables for each project in the public set
  • The evaluation platform for evaluating AI deliverables against human reference deliverables

Project structure:

benchmarks/public_001/
β”œβ”€β”€ human_deliverable/    # Professional human-created output
β”œβ”€β”€ project/              # Project brief and inputs
β”‚   β”œβ”€β”€ brief.md
β”‚   └── inputs/
└── ai_deliverable/       # AI model outputs (optional)
    β”œβ”€β”€ grok_4/
    β”œβ”€β”€ manus/
    └── sonnet_4_5/

πŸ“– Manual Setup

If you prefer manual setup or need custom configuration, follow these detailed instructions.

Prerequisites

  • Node.js 20.x (matches the Docker image and production deployments)
  • npm 9+ (bundled with Node.js 20)
  • Python 3.7+ (for dataset download)
  • AWS credentials (only for S3 mode)

1. Clone the Repository

git clone https://github.com/centerforaisafety/rli_evaluation_platform.git
cd rli_evaluation_platform

2. Install Dependencies

cd evaluation_platform
npm install --legacy-peer-deps

3. Download RLI Public Set

Option A: Using Python Script

pip install huggingface_hub

# Download public set only
python -c "
from huggingface_hub import snapshot_download
from pathlib import Path

snapshot_download(
    repo_id='cais/rli-public-set',
    repo_type='dataset',
    local_dir='./temp_public_set',
    local_dir_use_symlinks=False
)
print('Downloaded public set to ./temp_public_set')
"

# Move to benchmarks directory
mkdir -p benchmarks
mv temp_public_set/public_* benchmarks/
rm -rf temp_public_set

Option B: Download with Example AI Deliverables

pip install huggingface_hub

# Download public set
python -c "
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id='cais/rli-public-set',
    repo_type='dataset',
    local_dir='./temp_public_set',
    local_dir_use_symlinks=False
)
"

# Download example AI deliverables
python -c "
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id='cais/rli-example-deliverables',
    repo_type='dataset',
    local_dir='./temp_ai_deliverables',
    local_dir_use_symlinks=False
)
"

# Merge datasets
mkdir -p benchmarks
for task in temp_public_set/public_*; do
    task_id=$(basename $task)
    mkdir -p benchmarks/$task_id
    cp -r $task/* benchmarks/$task_id/
    
    # Add AI deliverables if they exist
    if [ -d "temp_ai_deliverables/$task_id/ai_deliverable" ]; then
        cp -r temp_ai_deliverables/$task_id/ai_deliverable benchmarks/$task_id/
    fi
done

# Cleanup
rm -rf temp_public_set temp_ai_deliverables

Option C: Manual Download from HuggingFace

  1. Visit https://huggingface.co/datasets/cais/rli-public-set
  2. Download the files manually
  3. Extract to evaluation_platform/public/tasks/

Expected structure after download:

benchmarks/
β”œβ”€β”€ public_001/
β”‚   β”œβ”€β”€ human_deliverable/
β”‚   β”œβ”€β”€ project/
β”‚   β”‚   β”œβ”€β”€ brief.md
β”‚   β”‚   └── inputs/
β”‚   └── ai_deliverable/      # Only if you downloaded examples
β”‚       β”œβ”€β”€ grok_4/
β”‚       β”œβ”€β”€ manus/
β”‚       └── sonnet_4_5/
β”œβ”€β”€ public_002/
β”‚   └── ...
...
β”œβ”€β”€ public_010/

4. Configure Environment Variables

Create a .env file in the evaluation_platform/ directory:

cd evaluation_platform

For Local Mode (recommended for RLI Public Set):

# Storage mode: local
STORAGE_MODE=local

# Benchmark Directory (where evaluation datasets are stored)
BENCHMARK_DIR=./benchmarks

# Server port
PORT=5001

# Authentication (REQUIRED)
ADMIN_PASSWORD=your-secure-admin-password
JWT_SECRET=your-long-random-jwt-secret-string

# Autodesk Forge (for 3D file viewing)
# Get credentials at: https://aps.autodesk.com/myapps
AUTODESK_CLIENT_ID=your-client-id
AUTODESK_CLIENT_SECRET=your-client-secret
AUTODESK_BUCKET=your-unique-bucket-name  # Must be globally unique
AUTODESK_CALLBACK_URL=http://localhost:5001

For S3 Mode:

# Storage mode: s3
STORAGE_MODE=s3

# AWS S3 Configuration
AWS_ACCESS_KEY_ID=your-access-key
AWS_SECRET_ACCESS_KEY=your-secret-key
S3_BUCKET_NAME=your-bucket-name
S3_REGION=us-east-1
S3_TASKS_PREFIX=tasks/

# Server port
PORT=5001

# Authentication (REQUIRED)
ADMIN_PASSWORD=your-admin-password
JWT_SECRET=your-jwt-secret

# Autodesk Forge (for 3D file viewing)
AUTODESK_CLIENT_ID=your-client-id
AUTODESK_CLIENT_SECRET=your-client-secret
AUTODESK_BUCKET=your-unique-bucket-name  # Must be globally unique
AUTODESK_CALLBACK_URL=http://localhost:5001

Generate secure secrets:

# Generate JWT secret
python -c "import secrets; print(secrets.token_urlsafe(32))"

# Generate admin password
python -c "import secrets; print(secrets.token_urlsafe(16))"

5. Start the Application

Development Mode:

# Terminal 1: Start backend server
npm run server

# Terminal 2: Start frontend dev server
npm run dev

Production Mode:

npm run build
npm run server

Visit:

  • Development: http://localhost:5173
  • Production: http://localhost:5001

🎯 Usage

Authentication

  1. Visit the application URL
  2. You'll be redirected to the login page
  3. Login with your admin password

Generate Comparisons

Via Web Dashboard:

  1. Login as admin
  2. Click "Generate AI vs Human Comparisons"
  3. Set required number of completions (1-10)
  4. Get comparison links to share with evaluators

Via Command Line:

# Login as admin
TOKEN=$(curl -s -X POST http://localhost:5001/api/auth/login \
  -H "Content-Type: application/json" \
  -d '{"password": "your-admin-password"}' | jq -r '.token')

export AUTH_TOKEN=$TOKEN

# Generate all AI vs Human comparisons
curl -X POST http://localhost:5001/api/generator/generate-all-ai-vs-human \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"saveToFile": true}'

# Get comparison links
curl -X GET "http://localhost:5001/api/comparisons/links/all?format=txt" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -o comparison_links.txt

For detailed command-line workflows, see COMMAND_LINE_WORKFLOW.md.

Workflow

  1. Admin generates comparisons using the dashboard or CLI
  2. Admin sets required number of completions (1-10)
  3. Admin gets list of incomplete comparison links
  4. Admin shares links with evaluators (no tokens needed for evaluators)
  5. Evaluators click links and complete evaluations in their browser
  6. Admin monitors progress via dashboard or CLI

πŸ› οΈ Storage Modes

Local Mode (Default)

When STORAGE_MODE=local, the application uses the local filesystem:

  1. Set STORAGE_MODE=local in your .env file
  2. Set BENCHMARK_DIR to point to your datasets:
    BENCHMARK_DIR=./benchmarks  # or any path you prefer
  3. Optionally configure data directory:
    DATA_DIR=./evaluation_platform/data  # where platform writes results

Best for:

  • RLI Public Set evaluation
  • Development and testing
  • Small to medium task sets
  • Offline or air-gapped environments

S3 Mode

When STORAGE_MODE=s3, the application reads task files directly from your S3 bucket:

S3 Bucket Structure:

your-bucket/
└── tasks/
    β”œβ”€β”€ task001-Task_Name/
    β”‚   β”œβ”€β”€ Model1/
    β”‚   β”‚   └── output.txt
    β”‚   β”œβ”€β”€ Model2/
    β”‚   β”‚   └── output.txt
    β”‚   β”œβ”€β”€ human1/
    β”‚   β”‚   └── output.txt
    β”‚   └── brief/
    β”‚       └── brief.md
    └── task002-Another_Task/
        └── ...

S3 Permissions Required:

  • s3:ListBucket - to list tasks and directories
  • s3:GetObject - to read files

S3 Manifest CSV:

The app reads comparison metadata from evaluation_platform/public/s3_manifest.csv when running in S3 mode. Each row describes one evaluation bundle:

  • task_id: Unique identifier for the task
  • agent: Name or version of the AI system
  • repetition: ISO-8601 date stamp or run identifier
  • s3_path_ai_artifact: Full S3 URI pointing at the AI-produced artifact folder
  • s3_path_human_artifact: Full S3 URI pointing at the human-produced artifact folder
  • s3_path_task_definition: Full S3 URI pointing at the task definition or brief folder

Caching: Files are automatically cached locally in .cache/ after first access for better performance.

Best for:

  • Production deployments
  • Large task sets
  • Multiple evaluators accessing simultaneously
  • Centralized task storage

πŸ§ͺ Testing S3 Connection

To verify your S3 configuration is working:

  1. Start the server:

    cd evaluation_platform
    npm run server
  2. Check the console output. You should see:

    Server running on port 5001
    Storage mode: s3
    S3 bucket: your-bucket-name
    S3 region: us-east-1
    S3 tasks prefix: tasks/
    Data directory: /path/to/data
    
  3. Visit http://localhost:5001 and check if tasks are listed

  4. Monitor the server console for any S3 errors

  5. Create a test script to verify S3 access:

    // test-s3-connection.js
    const { listTasks } = require('./server/services/s3Service');
    
    async function test() {
      try {
        const tasks = await listTasks();
        console.log('Found tasks:', tasks);
      } catch (error) {
        console.error('S3 Error:', error);
      }
    }
    
    test();

    Run with: node test-s3-connection.js


πŸ”§ Autodesk Forge Setup

To view 3D and CAD files (.dwg, .fbx, .3dm, .step, etc.), you need Autodesk Forge credentials:

1. Get API Credentials

  1. Go to https://aps.autodesk.com/myapps
  2. Create a new app (or use existing)
  3. Copy your Client ID and Client Secret

2. Configure Environment Variables

Add to your .env file:

AUTODESK_CLIENT_ID=your-client-id
AUTODESK_CLIENT_SECRET=your-client-secret
AUTODESK_BUCKET=rli-models  # Choose a unique bucket name
AUTODESK_CALLBACK_URL=http://localhost:5001  # Or your deployment URL

3. Bucket Creation

The bucket is automatically created on first use by the ensureBucketExists() function in server/routes/autodesk.ts.

Important Notes:

  • Bucket names must be globally unique across all Autodesk users
  • The default name rli-models may already be taken - choose your own unique name (e.g., rli-models-yourname-2025)
  • Files uploaded to Autodesk are cached by bucket+filename - if you had upload issues, change your bucket name to force fresh uploads
  • Autodesk requires one bucket per domain (callback URL locked)

If bucket creation fails:

  • 409 Conflict β†’ Bucket name already exists globally β†’ choose a different unique name
  • 403 Forbidden β†’ Check API app has bucket:create scope

🐳 Docker Support

Building with S3 Support

docker build -t rli_evaluation_platform .

Running with Environment Variables

docker run -d \
  -p 5001:5001 \
  -e STORAGE_MODE=local \
  -e ADMIN_PASSWORD=your-admin-password \
  -e JWT_SECRET=your-jwt-secret \
  -v $(pwd)/evaluation_platform/data:/app/data \
  --name rli-evaluation-platform \
  rli_evaluation_platform

Docker Compose

version: '3.8'
services:
  app:
    build: .
    ports:
      - "5001:5001"
    environment:
      - STORAGE_MODE=${STORAGE_MODE}
      - ADMIN_PASSWORD=${ADMIN_PASSWORD}
      - JWT_SECRET=${JWT_SECRET}
      - AUTODESK_CLIENT_ID=${AUTODESK_CLIENT_ID}
      - AUTODESK_CLIENT_SECRET=${AUTODESK_CLIENT_SECRET}
      - AUTODESK_BUCKET=${AUTODESK_BUCKET}
      - AUTODESK_CALLBACK_URL=${AUTODESK_CALLBACK_URL}
    volumes:
      - app-data:/app/data
      - app-cache:/app/.cache

volumes:
  app-data:
  app-cache:

Then run: docker-compose up -d


☁️ Deployment with Fly.io

Prerequisites

  1. Install Fly CLI:

    curl -L https://fly.io/install.sh | sh
  2. Sign up and login:

    fly auth signup
    # or
    fly auth login

Automated Deployment

# Run setup with Fly.io deployment
python setup.py --deploy-flyio

Manual Deployment Steps

  1. Launch the app (first time only):

    fly launch
    • Choose a unique app name
    • Select a region close to your users
    • Skip database and Redis setup
    • Deploy now: No (set secrets first)
  2. Create a volume for persistent data:

    fly volumes create app_data --size 1 --region <your-region>
  3. Set environment secrets:

    fly secrets set ADMIN_PASSWORD=your-admin-password
    fly secrets set JWT_SECRET=your-jwt-secret
    fly secrets set STORAGE_MODE=local  # or s3
    fly secrets set AUTODESK_CLIENT_ID=your-autodesk-client-id
    fly secrets set AUTODESK_CLIENT_SECRET=your-autodesk-client-secret
    fly secrets set AUTODESK_BUCKET=your-bucket-name
    fly secrets set AUTODESK_CALLBACK_URL=https://your-app.fly.dev

    For S3 mode, also set:

    fly secrets set AWS_ACCESS_KEY_ID=your-key
    fly secrets set AWS_SECRET_ACCESS_KEY=your-secret
    fly secrets set S3_BUCKET_NAME=your-bucket
    fly secrets set S3_REGION=your-region
    fly secrets set S3_TASKS_PREFIX=tasks/
  4. Deploy:

    fly deploy
  5. View your app:

    fly open

Monitoring and Logs

# View logs
fly logs

# Check app status
fly status

# SSH into the container
fly ssh console

Updating the App

After making changes:

fly deploy

Scaling

# Scale to multiple instances
fly scale count 2

# Scale memory/CPU
fly scale vm shared-cpu-2x

πŸ“ File Support

The platform supports various file types for comparison:

Documents

  • Text: .txt, .json, .yml, .yaml, .js, .jsx, .ts, .tsx, .css, .py, .java, .go, .php, .rb, .swift, .xml, .sql, .sh, .c, .cpp, .cs, and more
  • Markdown: .md
  • HTML: .html, .htm
  • PDF: .pdf
  • Spreadsheets: .csv (tab- and semicolon-delimited variants supported)
  • Microsoft Office: .ppt, .pptx, .doc, .docx, .xls, .xlsx
  • LaTeX: .tex
  • Jupyter Notebooks: .ipynb
  • Anki: .apkg

Media

  • Images: .jpg, .jpeg, .png, .gif, .bmp, .webp, .svg, .ico, .avif
  • TIFF: .tif, .tiff
  • Video: .mp4, .m4v, .mkv, .webm, .mov, .avi, .wmv
  • Audio: .mp3, .wav, .ogg, .aac, .m4a
  • MIDI: .midi, .mid

Design & 3D

  • PSD: .psd (limited support; complex layer effects may flatten)
  • 3D Models: .obj, .mtl, .stl, .gltf, .glb
  • Autodesk / CAD: .dwg, .dxf, .dwf, .dwt, .skp, .stp, .step, .ipt, .3dm, .3ds, .fbx, .rvt, .ifc, .catpart, .catproduct, .cgr, .dae, .dgn, .f3d, .gbxml, .iam, .idw

Data & Interactive

  • SQL Databases: .sqlite, .db (large files may load slowly)
  • WebGL / Website: .html entry points with associated .js and .css assets for interactive WebGL builds, React apps, Vite, Three.js, web games, and visualization tools

βž• Adding New Tasks

For Local Mode

  1. Create task directory structure:

    mkdir -p benchmarks/my_task_011/{human_deliverable,project/inputs,ai_deliverable/my_model}
  2. Add your content to the directories:

    benchmarks/my_task_011/
    β”œβ”€β”€ human_deliverable/
    β”‚   └── [your human reference files]
    β”œβ”€β”€ project/
    β”‚   β”œβ”€β”€ brief.md
    β”‚   └── inputs/
    β”‚       └── [input files]
    └── ai_deliverable/
        └── my_model/
            └── [your AI model output]
    
  3. Ensure BENCHMARK_DIR in .env points to your benchmarks directory

  4. Regenerate comparisons:

    curl -X POST http://localhost:5001/api/generator/generate-all \
      -H "Content-Type: application/json" \
      -d '{"saveToFile": true}'

For S3 Mode

  1. Upload your task files to S3:

    s3://your-bucket/tasks/task011-New_Task/
    β”œβ”€β”€ Model1/
    β”‚   └── output.txt
    β”œβ”€β”€ Model2/
    β”‚   └── output.txt
    β”œβ”€β”€ human1/
    β”‚   └── output.txt
    └── brief/
        └── brief.md
    
  2. Add the task entry to public/s3_manifest.csv:

    task_id,agent,repetition,s3_path_ai_artifact,s3_path_human_artifact,s3_path_task_definition
    TASK011,agent_name,2025-01-01,s3://bucket/tasks/task011/Model1,s3://bucket/tasks/task011/human1,s3://bucket/tasks/task011/brief
  3. Regenerate comparisons as above


πŸ” Troubleshooting

HuggingFace Authentication Issues

Problem: 401 Client Error: Unauthorized when downloading datasets

Solution:

  1. Run huggingface-cli login
  2. Ensure you have access to the cais organization datasets
  3. Contact the CAIS team to request access if needed

S3 Connection Issues

  1. Access Denied: Check your AWS credentials and bucket permissions
  2. No Tasks Found: Verify your bucket structure has a tasks/ prefix
  3. Slow Performance: Files are cached after first access - subsequent loads are faster
  4. Cache Issues: Delete .cache/ directory to force fresh downloads

Environment Variables Not Loading

  1. Ensure .env file is in the evaluation_platform/ directory
  2. Restart the server after changing .env
  3. Check that variables are not commented out

3D Files Not Displaying

  1. Verify Autodesk Forge credentials are set in .env
  2. Check that bucket name is unique and valid
  3. View server logs for Autodesk API errors
  4. Ensure your Autodesk app has the required scopes

Port Already in Use

# Find process using port 5001
lsof -i :5001

# Kill the process
kill -9 <PID>

# Or use a different port
PORT=5001 npm run server

🧹 Resetting the Repository

If you need to start fresh or clean up after testing, you can reset the repository to its initial state:

python setup.py --clean

This will delete:

  • benchmarks/ - Downloaded datasets (RLI public set and AI deliverables)
  • evaluation_platform/data/ - Generated comparisons and evaluation metadata
  • evaluation_platform/.env - Environment configuration
  • evaluation_platform/node_modules/ - Installed npm packages
  • evaluation_platform/dist/ and evaluation_platform/dist-server/ - Build artifacts
  • .rli_datasets_temp/ - Temporary download cache

After cleaning, you can run python setup.py again for a fresh installation.

When to use --clean:

  • When you want to test the setup process from scratch
  • When switching between different configurations (local ↔ S3)
  • When troubleshooting persistent issues with cached data

πŸ”’ Security Notes

  • Never commit .env to version control - it's already in .gitignore
  • Use IAM roles with minimal required permissions for production
  • Consider using AWS Secrets Manager for production secrets
  • Admin password should be strong and unique
  • JWT secret should be a long, random string (32+ characters)
  • Access tokens expire after 7 days by default
  • All admin endpoints are protected by authentication middleware

πŸ“š Additional Resources


πŸ’‘ Technical Notes

  • Comparison metadata is always stored locally in the data/ directory
  • S3 files are streamed and cached for optimal performance
  • The cache directory (.cache/) can be safely deleted to free space
  • Both storage modes can be used interchangeably by changing STORAGE_MODE
  • The app automatically adds base URLs to HTML files for proper resource loading
  • Search for -here in config files to find placeholder values to replace

Project Structure

  • /evaluation_platform/src - React frontend code
  • /evaluation_platform/server - Express backend code
  • /evaluation_platform/public - Static web assets (fonts, soundfonts, icons)
  • /evaluation_platform/data - Generated comparisons and evaluation metadata
  • /evaluation_platform/.cache - Cached S3 files (S3 mode only)
  • /benchmarks - RLI public set and evaluation datasets (created by setup.py)
  • /setup.py - Automated setup script

About

Public repository for the Remote Labor Index (RLI)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors