Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
7814d4a
initial evaluations app
warmbowski Mar 5, 2026
8b0060b
set up tiered eval config and docker image build/run locally
warmbowski Mar 5, 2026
c9b9ec5
get cloud run working, start adding skiff2 configs
warmbowski Mar 9, 2026
98f0e92
change tier name
warmbowski Mar 9, 2026
a4d1f41
configure skiff2 build, push, and deploy
warmbowski Mar 9, 2026
a179b7c
cleanup unused code and add .env support
warmbowski Mar 9, 2026
98f80df
replace shell scripts with python
warmbowski Mar 9, 2026
69de4ed
remove scheule prop
warmbowski Mar 9, 2026
2170148
fix registry path and comment out storage from jobs for now
warmbowski Mar 9, 2026
f253e7e
refactor to use local model provider config instead of model presets
warmbowski Mar 9, 2026
3e7d640
remove click dep, and remove tier-list and job-list args
warmbowski Mar 9, 2026
b769fed
remove some logger formatting
warmbowski Mar 10, 2026
528cc75
refactor harness overrides
warmbowski Mar 10, 2026
516febb
make ci deploy and local deploy use the same script
warmbowski Mar 10, 2026
9ac16f7
remove unused method
warmbowski Mar 10, 2026
7758de7
test ci build and deploy
warmbowski Mar 10, 2026
7256d6f
fix lint issues
warmbowski Mar 10, 2026
d73f798
workaround for skiff2 setup action.yaml
warmbowski Mar 10, 2026
501753b
fix path
warmbowski Mar 10, 2026
075110b
fix action path
warmbowski Mar 10, 2026
fc9272a
troubleshooting actions.yaml
warmbowski Mar 10, 2026
7211936
fix filename
warmbowski Mar 10, 2026
742d5a9
troubleshooting
warmbowski Mar 10, 2026
5c01dcc
fix running skiff2 action
warmbowski Mar 10, 2026
3676ddf
test ci actions
warmbowski Mar 10, 2026
cc1b3af
trigger gcr build and deploy
warmbowski Mar 10, 2026
0d85973
fix ci to only run on main
warmbowski Mar 10, 2026
b404573
fix some of the lint rules
warmbowski Mar 10, 2026
70eddae
generate jobs from templaet and deploy
warmbowski Mar 11, 2026
52a0dbf
convert to using terraform and pydantic settings
warmbowski Mar 12, 2026
d0b5420
add standard logger to replace print statements
warmbowski Mar 12, 2026
7380688
add uv setup to ci build
warmbowski Mar 12, 2026
c9a0628
add github token to ci so that it can load private repo
warmbowski Mar 12, 2026
07efd97
remove test branch from ci build on push
warmbowski Mar 12, 2026
8530bab
refactor to one cloud job that takes args for tier, and add support f…
warmbowski Mar 13, 2026
84fd8d8
fix adhoc api_base
warmbowski Mar 13, 2026
1782d06
refactor updated-env-vars format and add validation to parsing
warmbowski Mar 17, 2026
42da390
fix lint
warmbowski Mar 17, 2026
7e40dc4
refactor build-and-push-evals action to use docker actions
warmbowski Mar 17, 2026
2f417f4
fix delimiter in adhoc overriedes
warmbowski Mar 17, 2026
0b4a597
add parser unit tests
warmbowski Mar 17, 2026
fd40615
fix tests in ci
warmbowski Mar 17, 2026
b52959a
use pydantic settings, + small fixes
warmbowski Mar 23, 2026
8e1221b
move run-local into app and update settings defaults
warmbowski Mar 23, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 52 additions & 0 deletions .github/actions/skiff2/setup/action.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
name: Setup GCP and Docker
description: Sets up Google Cloud authentication, Cloud SDK, and Docker for GCR
author: Skiff
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skiff2 shared actions aren't available to public repos. this will be removed when @CalebOuellette gets a fix in

# This is a temp workaround until skiff2 shared actions are accessible from public repos
# Please remove this file and switch to shared action when available.

inputs:
workload_identity_provider:
description: "Workload Identity Provider resource name (e.g. projects/123/locations/global/workloadIdentityPools/my-pool/providers/my-provider)"
required: true
service_account:
description: "Service account email to impersonate"
required: true
project_id:
description: "GCP project ID"
required: true

runs:
using: composite
steps:
- name: Check branch is main
shell: bash
run: |
if [ "${{ github.ref }}" != "refs/heads/main" ]; then
echo "This action can only run on the main branch. Current ref: ${{ github.ref }}"
exit 1
fi

- name: Checkout calling repository
uses: actions/checkout@v4

- name: Authenticate to Google Cloud
uses: google-github-actions/auth@v2
with:
workload_identity_provider: ${{ inputs.workload_identity_provider }}
service_account: ${{ inputs.service_account }}

- name: Set up Cloud SDK
uses: google-github-actions/setup-gcloud@v2
with:
project_id: ${{ inputs.project_id }}

- name: Configure Docker for GCR
shell: bash
run: gcloud auth configure-docker

- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3

branding:
icon: cloud
color: blue
114 changes: 114 additions & 0 deletions .github/workflows/build-and-push-evals.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
name: Build and Deploy Evaluations Cloud Run Jobs

on:
push:
branches:
- main
paths:
- 'apps/evaluations/**'
- '.github/workflows/build-and-push-evals.yml'
pull_request:
paths:
- 'apps/evaluations/**'
- '.github/workflows/build-and-push-evals.yml'
workflow_dispatch:

permissions:
contents: read
id-token: write

env:
SERVICE_NAME: evaluations
REGISTRY: us-west1-docker.pkg.dev
REPO: model-evals

jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6

- name: Setup uv
uses: astral-sh/setup-uv@v7

- name: Run Tests
working-directory: apps/evaluations
run: uv run --only-group dev pytest -v

build-and-deploy:
needs: test
if: github.event_name == 'push' || github.event_name == 'workflow_dispatch'
runs-on: ubuntu-latest
environment:
name: ${{ github.ref_name }}
steps:
- uses: actions/checkout@v6 # remove this when switching back to shared action

- name: Skiff2 Setup
id: setup
uses: ./.github/actions/skiff2/setup # temporary workaround until share action is available
with:
workload_identity_provider: ${{ vars.SKIFF2_WORKLOAD_IDENTITY_PROVIDER }}
service_account: ${{ vars.SKIFF2_SERVICE_ACCOUNT }}
project_id: ${{ vars.SKIFF2_PROJECT_ID }}

# Configure Docker for Artifact Registry
- name: Configure Docker
run: gcloud auth configure-docker ${REGISTRY} --quiet

- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3

# Custom build step for evaluations (handles GITHUB_TOKEN for private repo)
# Once olmo-eval-internal is public, this can be replaced with Skiff2 Build
- name: Build and Push Evaluations Image
id: build
uses: docker/build-push-action@v6
with:
context: apps/evaluations
file: apps/evaluations/Dockerfile
platforms: linux/amd64
push: true
tags: |
${{ env.REGISTRY }}/${{ vars.SKIFF2_PROJECT_ID }}/${{ env.REPO }}/${{ env.SERVICE_NAME }}:latest
${{ env.REGISTRY }}/${{ vars.SKIFF2_PROJECT_ID }}/${{ env.REPO }}/${{ env.SERVICE_NAME }}:${{ github.sha }}
cache-from: type=gha
cache-to: type=gha,mode=max
secrets: |
GITHUB_TOKEN=${{ secrets.OLMO_EVAL_INTERNAL_TOKEN }}

# Setup uv for Python package management
- name: Setup uv
uses: astral-sh/setup-uv@v7

# Configure git to use token for private repo access
- name: Configure Git for Private Repos
run: git config --global url."https://${{ secrets.OLMO_EVAL_INTERNAL_TOKEN }}@github.com/".insteadOf "https://github.com/"

# Generate Terraform variables from Python tier configs
- name: Generate Terraform Variables
working-directory: apps/evaluations
run: uv run generate-tfvars -o terraform/terraform.tfvars.json

# Setup Terraform
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
with:
terraform_version: "1.5"

# Deploy with Terraform
- name: Terraform Init
working-directory: apps/evaluations/terraform
run: terraform init

- name: Terraform Plan
working-directory: apps/evaluations/terraform
run: |
terraform plan \
-var="project_id=${{ vars.SKIFF2_PROJECT_ID }}" \
-var="image_tag=${{ github.sha }}" \
-out=tfplan

- name: Terraform Apply
working-directory: apps/evaluations/terraform
run: terraform apply -auto-approve tfplan
2 changes: 1 addition & 1 deletion .github/workflows/verify-api.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ jobs:
uses: ./.github/actions/set-up-uv

- name: Test with pytest
run: uv run pytest --ignore ./apps/flask-api/e2e --ignore ./apps/api/e2e
run: uv run pytest --ignore ./apps/flask-api/e2e --ignore ./apps/api/e2e --ignore ./apps/evaluations

type-check:
runs-on: ubuntu-latest
Expand Down
17 changes: 17 additions & 0 deletions apps/evaluations/.env.local.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Local environment variables for evaluations
# Copy to .env.local and fill in values

# Required for Docker build (private repo access)
GITHUB_TOKEN=

# Required for running evaluations
LITELLM_PROXY_API_KEY=

# Required for storage (Postgres)
PGHOST=
PGPASSWORD=

# Required for storage (S3)
AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=

11 changes: 11 additions & 0 deletions apps/evaluations/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Python
__pycache__/
*.py[cod]
*.egg-info/

# Build artifacts
dist/
build/

# Local environment
.env.local
77 changes: 77 additions & 0 deletions apps/evaluations/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# Evaluations Docker Image for Cloud Run Jobs
#
# Docker image with which to run a list of model evals basedb on tier configuration,
# and will run each individual model eval as it's own Google Cloud Job
#
# Build (requires GitHub token for private repo access):
# docker build --platform linux/amd64 --secret id=GITHUB_TOKEN \
# -t evaluations -f apps/evaluations/Dockerfile apps/evaluations
#
# Once olmo-eval-internal is public, remove --secret GITHUB_TOKEN
#
# Run tier (local mode, no storage):
# docker run -e EVAL_TIER=standard -e CLOUD_RUN_TASK_INDEX=0 -e LOCAL=true \
# -e LITELLM_PROXY_API_KEY=$LITELLM_PROXY_API_KEY evaluations
#
# Run builds/evals locally with helper script:
# uv run run-local --tier standard --build
# uv run run-local --build-only
#

# ============================================================================
# Stage 1: Builder
# ============================================================================
FROM --platform=linux/amd64 ghcr.io/astral-sh/uv:python3.14-bookworm-slim AS builder

ENV UV_COMPILE_BYTECODE=1 UV_LINK_MODE=copy
ENV UV_PYTHON_DOWNLOADS=0

# Install git for cloning olmo-eval-internal
RUN apt-get update -qq && \
apt-get install -y --no-install-recommends git && \
rm -rf /var/lib/apt/lists/*

# GitHub token for private repo access (mounted as secret)
RUN --mount=type=secret,id=GITHUB_TOKEN \
git config --global url."https://$(cat /run/secrets/GITHUB_TOKEN)@github.com/".insteadOf "https://github.com/"

WORKDIR /app

# Copy evaluations package
COPY src /app/src
COPY pyproject.toml /app/pyproject.toml

# Install evaluations package (pulls olmo-eval-internal from git)
RUN --mount=type=cache,target=/root/.cache/uv \
uv pip install --system /app

# ============================================================================
# Stage 2: Runtime
# ============================================================================
FROM --platform=linux/amd64 python:3.14-slim-bookworm AS runner

# Install runtime dependencies
RUN apt-get update -qq && \
apt-get install -y --no-install-recommends ca-certificates && \
rm -rf /var/lib/apt/lists/*

# Setup non-root user
RUN groupadd --system --gid 999 nonroot \
&& useradd --system --gid 999 --uid 999 --create-home nonroot

# Copy installed packages from builder
COPY --from=builder /usr/local/lib/python3.14/site-packages /usr/local/lib/python3.14/site-packages
COPY --from=builder /usr/local/bin/olmo-eval /usr/local/bin/olmo-eval
COPY --from=builder /usr/local/bin/evaluations /usr/local/bin/evaluations

WORKDIR /app

# Use non-root user
USER nonroot

ENV PYTHONUNBUFFERED=1
ENV TERM=dumb
ENV NO_COLOR=1

# Use Python CLI as entrypoint
ENTRYPOINT ["python", "-m", "evaluations.cli"]
Loading
Loading