-
Notifications
You must be signed in to change notification settings - Fork 39
feat(model-eval-ingest): sync promoted bench evals #3258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
941c2fe
feat(model-eval-ingest): sync promoted bench evals
lambertjosh e515b70
fix(model-eval-ingest): address CI and review feedback
lambertjosh 8d91b92
fix(model-eval-ingest): resolve review follow-ups
lambertjosh e150d89
fix(model-eval-ingest): document local sync env wiring
lambertjosh e23ff4d
docs(model-eval-ingest): document local worker setup
lambertjosh a09a73e
docs(model-eval-ingest): align local setup with dev env sync
lambertjosh 975147b
fix(model-eval-ingest): clarify idempotent sync result
lambertjosh f94e285
fix(model-eval-ingest): retain deployed dev sync secret
lambertjosh c60f22e
fix(model-eval-ingest): align audit storage conventions
lambertjosh 702245e
fix(model-eval-ingest): widen average cost storage
lambertjosh 6469ebb
fix(model-eval-ingest): rename admin benchmark surface
lambertjosh adb88ca
Merge origin/main into rogue-timbale
lambertjosh 78944b8
Merge origin/main into rogue-timbale
lambertjosh 1517140
Merge branch 'main' into rogue-timbale
lambertjosh File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
217 changes: 217 additions & 0 deletions
217
apps/web/src/app/admin/model-eval-ingest/ModelEvalIngestContent.tsx
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,217 @@ | ||
| 'use client'; | ||
|
|
||
| import { useState } from 'react'; | ||
| import { useMutation, useQuery } from '@tanstack/react-query'; | ||
| import { ExternalLink, RefreshCw } from 'lucide-react'; | ||
| import { toast } from 'sonner'; | ||
| import { useTRPC } from '@/lib/trpc/utils'; | ||
| import { Button } from '@/components/ui/button'; | ||
| import { Card, CardContent, CardDescription, CardHeader, CardTitle } from '@/components/ui/card'; | ||
| import { | ||
| Table, | ||
| TableBody, | ||
| TableCell, | ||
| TableHead, | ||
| TableHeader, | ||
| TableRow, | ||
| } from '@/components/ui/table'; | ||
|
|
||
| const PAGE_SIZE = 50; | ||
|
|
||
| export function ModelEvalIngestContent() { | ||
| const trpc = useTRPC(); | ||
| const [page, setPage] = useState(1); | ||
| const historyQuery = useQuery( | ||
| trpc.admin.modelEvalIngest.list.queryOptions({ page, limit: PAGE_SIZE }) | ||
| ); | ||
| const syncMutation = useMutation( | ||
| trpc.admin.modelEvalIngest.syncNow.mutationOptions({ | ||
| onSuccess: result => { | ||
| toast.success(formatSyncToast(result)); | ||
| void historyQuery.refetch(); | ||
| }, | ||
| onError: error => toast.error(error.message || 'Model eval sync failed'), | ||
| }) | ||
| ); | ||
| const repullMutation = useMutation( | ||
| trpc.admin.modelEvalIngest.repullPromotion.mutationOptions({ | ||
| onSuccess: result => { | ||
| toast.success( | ||
| `Promotion re-pull fetched ${result.fetched} records and refreshed ${result.cacheRecomputes} caches` | ||
| ); | ||
| void historyQuery.refetch(); | ||
| }, | ||
| onError: error => toast.error(error.message || 'Promotion re-pull failed'), | ||
| }) | ||
| ); | ||
|
|
||
| const rows = historyQuery.data?.rows ?? []; | ||
| const pagination = historyQuery.data?.pagination; | ||
|
|
||
| return ( | ||
| <div className="flex w-full flex-col gap-6"> | ||
| <div className="flex flex-col justify-between gap-4 lg:flex-row lg:items-start"> | ||
| <div className="space-y-2"> | ||
| <h2 className="text-2xl font-bold">Model Benchmarks</h2> | ||
| <p className="text-muted-foreground max-w-4xl"> | ||
| Audit promoted kilo-bench evals that cloud has pulled, then refresh the public Kilo | ||
| Bench cache on demand. Bench remains the aggregate source; this table is the cloud-side | ||
| ingest history. | ||
| </p> | ||
| </div> | ||
| <Button onClick={() => syncMutation.mutate()} disabled={syncMutation.isPending}> | ||
| <RefreshCw className={`size-4 ${syncMutation.isPending ? 'animate-spin' : ''}`} /> | ||
| {syncMutation.isPending ? 'Syncing...' : 'Sync now'} | ||
| </Button> | ||
| </div> | ||
|
|
||
| <Card> | ||
| <CardHeader> | ||
| <CardTitle>Promotion history</CardTitle> | ||
| <CardDescription> | ||
| Rows are append-only and deduplicated by bench eval name. Promoter email and bench links | ||
| stay admin-only here, never in the public model-stats cache. | ||
| </CardDescription> | ||
| </CardHeader> | ||
| <CardContent className="flex flex-col gap-4"> | ||
| <div className="overflow-x-auto rounded-lg border"> | ||
| <Table> | ||
| <TableHeader> | ||
| <TableRow> | ||
| <TableHead>Bench eval</TableHead> | ||
| <TableHead>Model</TableHead> | ||
| <TableHead>Task</TableHead> | ||
| <TableHead className="text-right">Score</TableHead> | ||
| <TableHead className="text-right">Trials</TableHead> | ||
| <TableHead>Promoted</TableHead> | ||
| <TableHead>Promoter</TableHead> | ||
| <TableHead>Ingested</TableHead> | ||
| <TableHead className="text-right">Action</TableHead> | ||
| </TableRow> | ||
| </TableHeader> | ||
| <TableBody> | ||
| {rows.length === 0 ? ( | ||
| <TableRow> | ||
| <TableCell colSpan={9} className="text-muted-foreground h-24 text-center"> | ||
| {historyQuery.isLoading | ||
| ? 'Loading ingest history...' | ||
| : 'No ingested promotions yet.'} | ||
| </TableCell> | ||
| </TableRow> | ||
| ) : ( | ||
| rows.map(row => ( | ||
| <TableRow key={row.id}> | ||
| <TableCell className="min-w-64"> | ||
| <a | ||
| href={row.benchEvalUrl} | ||
| target="_blank" | ||
| rel="noreferrer" | ||
| className="inline-flex max-w-80 items-center gap-1 text-sm text-blue-400 hover:text-blue-300" | ||
| > | ||
| <span className="truncate">{row.benchEvalName}</span> | ||
| <ExternalLink className="size-3 shrink-0" /> | ||
| </a> | ||
| </TableCell> | ||
| <TableCell className="min-w-56 font-mono text-xs"> | ||
| <div>{row.model}</div> | ||
| <div className="text-muted-foreground"> | ||
| {row.provider} | ||
| {row.variant ? ` / ${row.variant}` : ''} | ||
| </div> | ||
| </TableCell> | ||
| <TableCell>{row.taskSource}</TableCell> | ||
| <TableCell className="text-right font-mono tabular-nums"> | ||
| {formatScore(row.overallScore)} | ||
| <div className="text-muted-foreground text-xs"> | ||
| total {formatScore(row.totalScore)} | ||
| </div> | ||
| </TableCell> | ||
| <TableCell className="text-right font-mono tabular-nums"> | ||
| {row.nTotalTrials} | ||
| <div className="text-muted-foreground text-xs">{row.nErrored} errored</div> | ||
| </TableCell> | ||
| <TableCell className="whitespace-nowrap"> | ||
| {formatTimestamp(row.promotedAt)} | ||
| </TableCell> | ||
| <TableCell className="whitespace-nowrap text-sm"> | ||
| {row.promotedByEmail} | ||
| </TableCell> | ||
| <TableCell className="whitespace-nowrap"> | ||
| {formatTimestamp(row.createdAt)} | ||
| </TableCell> | ||
| <TableCell className="text-right"> | ||
| <Button | ||
| variant="secondary" | ||
| size="sm" | ||
| onClick={() => | ||
| repullMutation.mutate({ promotionName: row.benchEvalName }) | ||
| } | ||
| disabled={repullMutation.isPending} | ||
| > | ||
| Repull | ||
| </Button> | ||
| </TableCell> | ||
| </TableRow> | ||
| )) | ||
| )} | ||
| </TableBody> | ||
| </Table> | ||
| </div> | ||
|
|
||
| <div className="flex items-center justify-between gap-3 text-sm"> | ||
| <p className="text-muted-foreground"> | ||
| {pagination ? `${pagination.total} ingested promotion rows` : 'Loading row count...'} | ||
| </p> | ||
| <div className="flex gap-2"> | ||
| <Button | ||
| variant="secondary" | ||
| onClick={() => setPage(current => Math.max(1, current - 1))} | ||
| disabled={page <= 1 || historyQuery.isFetching} | ||
| > | ||
| Previous | ||
| </Button> | ||
| <Button | ||
| variant="secondary" | ||
| onClick={() => setPage(current => current + 1)} | ||
| disabled={!pagination || page >= pagination.totalPages || historyQuery.isFetching} | ||
| > | ||
| Next | ||
| </Button> | ||
| </div> | ||
| </div> | ||
| </CardContent> | ||
| </Card> | ||
| </div> | ||
| ); | ||
| } | ||
|
|
||
| function formatTimestamp(value: string): string { | ||
| return new Date(value).toLocaleString(); | ||
| } | ||
|
|
||
| function formatScore(value: number): string { | ||
| return value.toFixed(4); | ||
| } | ||
|
|
||
| function formatSyncToast(result: { | ||
| inserted: number; | ||
| alreadyHad: number; | ||
| fetched: number; | ||
| }): string { | ||
| if (result.inserted > 0) { | ||
| const inserted = `Bench sync inserted ${formatCount(result.inserted, 'new promotion')}.`; | ||
| return result.alreadyHad > 0 | ||
| ? `${inserted} ${formatCount(result.alreadyHad, 'existing promotion')} rechecked.` | ||
| : inserted; | ||
| } | ||
|
|
||
| if (result.alreadyHad > 0) { | ||
| return `Bench sync is up to date; ${formatCount(result.alreadyHad, 'existing promotion')} rechecked.`; | ||
| } | ||
|
|
||
| return `Bench sync is up to date; ${formatCount(result.fetched, 'promotion')} returned.`; | ||
| } | ||
|
|
||
| function formatCount(count: number, label: string): string { | ||
| return `${count} ${label}${count === 1 ? '' : 's'}`; | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,17 @@ | ||
| import { BreadcrumbItem, BreadcrumbPage } from '@/components/ui/breadcrumb'; | ||
| import AdminPage from '@/app/admin/components/AdminPage'; | ||
| import { ModelEvalIngestContent } from './ModelEvalIngestContent'; | ||
|
|
||
| const breadcrumbs = ( | ||
| <BreadcrumbItem> | ||
| <BreadcrumbPage>Model Benchmarks</BreadcrumbPage> | ||
| </BreadcrumbItem> | ||
| ); | ||
|
|
||
| export default function ModelEvalIngestPage() { | ||
| return ( | ||
| <AdminPage breadcrumbs={breadcrumbs}> | ||
| <ModelEvalIngestContent /> | ||
| </AdminPage> | ||
| ); | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,45 @@ | ||
| import 'server-only'; | ||
| import * as z from 'zod'; | ||
| import { INTERNAL_API_SECRET, MODEL_EVAL_INGEST_URL } from '@/lib/config.server'; | ||
|
|
||
| const ModelEvalSyncResultSchema = z.object({ | ||
| success: z.literal(true), | ||
| inserted: z.number().int().nonnegative(), | ||
| alreadyHad: z.number().int().nonnegative(), | ||
| cacheRecomputes: z.number().int().nonnegative(), | ||
| fetched: z.number().int().nonnegative(), | ||
| }); | ||
| const ModelEvalSyncErrorSchema = z.object({ error: z.string().optional() }); | ||
|
|
||
| export type ModelEvalSyncResult = z.infer<typeof ModelEvalSyncResultSchema>; | ||
|
|
||
| type ModelEvalSyncRequest = { | ||
| promotionName?: string; | ||
| }; | ||
|
|
||
| export async function syncModelEvalPromotions( | ||
| request: ModelEvalSyncRequest = {} | ||
| ): Promise<ModelEvalSyncResult> { | ||
| if (!MODEL_EVAL_INGEST_URL) { | ||
| throw new Error('MODEL_EVAL_INGEST_URL is not configured'); | ||
| } | ||
|
|
||
| const response = await fetch(`${MODEL_EVAL_INGEST_URL}/internal/sync`, { | ||
| method: 'POST', | ||
| headers: { | ||
| 'Content-Type': 'application/json', | ||
| 'x-internal-api-key': INTERNAL_API_SECRET, | ||
| }, | ||
| body: JSON.stringify(request), | ||
| }); | ||
|
|
||
| const body: unknown = await response.json(); | ||
| if (!response.ok) { | ||
| const errorBody = ModelEvalSyncErrorSchema.safeParse(body); | ||
| throw new Error( | ||
| errorBody.success && errorBody.data.error ? errorBody.data.error : `HTTP ${response.status}` | ||
| ); | ||
| } | ||
|
|
||
| return ModelEvalSyncResultSchema.parse(body); | ||
| } | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought you had moved to using cloudflare access service credentials? If so, you'd need to plumb through your client_id/secret and add the headers:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jrf0110 - I did for the rest of it. This is authentication between the Vercel app and the Worker, to manually trigger a re-scan for promoted models. (Otherwise it's 15min)
This was an existing secret/pattern used by other app->worker requests, so just re-used.
I can swap it out and implement a new Client Access based approach if we want.