Skip to content

Cost freezing take 3#67

Merged
ca16 merged 1 commit intomainfrom
chloea-cost-freezing-take-3
Aug 22, 2025
Merged

Cost freezing take 3#67
ca16 merged 1 commit intomainfrom
chloea-cost-freezing-take-3

Conversation

@ca16
Copy link
Copy Markdown
Collaborator

@ca16 ca16 commented Aug 22, 2025

Related to https://github.com/allenai/astabench-issues/issues/391. Builds on #63, #66.

Pinning to 1.67.4.post1 seems to work.

I was wrong about this (I thought I'd tested with this commit, but looks like CI_tests didn't run for that, and I missed it).

CI_test does seem to pass for this, which is the range in this PR too.

Why does this work when pinning doesn't?

So I think there are two different, incompatible, litellm requirements, but they are not in the core requirements of asta-bench. sqa wants 1.68.0 like @mdarcy220 remembered, and futurehouse wants 1.67.4.post1. AFAICT we build separate images off an image that has the core requirements for sqa and futurehouse (so we want to be able to build either, but we aren't trying to install both sets of requirements at the same time). So I think that's why the range works and the pinning doesn't... So this PR shifts to @jbragg 's preference of using a range.

@ca16 ca16 requested review from jbragg and mdarcy220 August 22, 2025 00:46
Copy link
Copy Markdown
Contributor

@mdarcy220 mdarcy220 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3️⃣

Copy link
Copy Markdown
Collaborator

@jbragg jbragg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ofc like Mike I do want fixed behavior during scoring

@ca16
Copy link
Copy Markdown
Collaborator Author

ca16 commented Aug 22, 2025

summary stats from test run:

    "overall": {
      "score": 0.3148783028247495,
      "score_stderr": null,
      "cost": 0.03517972139633324,
      "cost_stderr": null
    },
    "tag/lit": {
      "score": 0.359073892718989,
      "score_stderr": null,
      "cost": 0.0432097658085206,
      "cost_stderr": null
    },
    "tag/data": {
      "score": 0.2692091729748633,
      "score_stderr": null,
      "cost": 0.011127473221757321,
      "cost_stderr": null
    },
    "tag/code": {
      "score": 0.5052960102960103,
      "score_stderr": null,
      "cost": 0.05169765155505505,
      "cost_stderr": null
    },
    "tag/discovery": {
      "score": 0.12593413530913533,
      "score_stderr": null,
      "cost": 0.034683994999999995,
      "cost_stderr": null
    },
    "task/paper_finder_test": {
      "score": 0.1680882137565247,
      "score_stderr": 0.01736167991412297,
      "cost": 0.0384160127340824,
      "cost_stderr": 0.004496229162562603
    },
    "task/paper_finder_litqa2_test": {
      "score": 0.6133333333333333,
      "score_stderr": 0.05661099544085763,
      "cost": 0.11436817,
      "cost_stderr": 0.017359728699967197
    },
    "task/sqa_test": {
      "score": 0.2672615497644688,
      "score_stderr": 0.03773623587370571,
      "cost": 0.0270534035,
      "cost_stderr": 0.0019422575674411598
    },
    "task/arxivdigestables_test": {
      "score": 0.3209458073549626,
      "score_stderr": 0.016795287448771803,
      "cost": 0.012757592,
      "cost_stderr": 0.000598238094979927
    },
    "task/litqa2_test": {
      "score": 0.7466666666666667,
      "score_stderr": 0.05055844297598726,
      "cost": 0.07485594,
      "cost_stderr": 0.014531272631798089
    },
    "task/discoverybench_test": {
      "score": 0.2692091729748633,
      "score_stderr": 0.024402794451474107,
      "cost": 0.011127473221757321,
      "cost_stderr": 0.0006222390294466313
    },
    "task/core_bench_test": {
      "score": 0.4594594594594595,
      "score_stderr": 0.08305895907471071,
      "cost": 0.04720825405405405,
      "cost_stderr": 0.007319669323511871
    },
    "task/ds1000_test": {
      "score": 0.71,
      "score_stderr": 0.015133811749341808,
      "cost": 0.002989897277777778,
      "cost_stderr": 0.0000549948574693956
    },
    "task/e2e_discovery_test": {
      "score": 0.09482323232323234,
      "score_stderr": 0.03868709687867958,
      "cost": 0.02972574375,
      "cost_stderr": 0.0029628950153995407
    },
    "task/e2e_discovery_hard_test": {
      "score": 0.1570450382950383,
      "score_stderr": 0.04217558424596078,
      "cost": 0.03964224625,
      "cost_stderr": 0.004099116765299048
    },
    "task/super_test": {
      "score": 0.3464285714285715,
      "score_stderr": 0.0673848147211148,
      "cost": 0.10489480333333333,
      "cost_stderr": 0.023534508573437658
    }
  }
}

@ca16 ca16 merged commit 960c611 into main Aug 22, 2025
4 checks passed
@ca16 ca16 deleted the chloea-cost-freezing-take-3 branch August 22, 2025 01:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants