Add price limit for samples #1930

andrew-aisi · 2025-05-27T22:44:40Z

This PR contains:

New features

What is the current behavior? (You can also link to an open issue here)

Users cannot set a price based limit for samples (#1152).

What is the new behavior?

Users can set a price limit for samples if they provide a json file that maps model names to per-token prices for output, cached input, and unique input. As an example, the model_prices_and_context_window.json from litellm provides this information.

Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)

No

Other information:

My goal here is to allow for cost-normalized comparisons, not to ensure end users are actually billed a specified amount. As such, this design ignores the fact that Inspect could load model responses from its own internal cache. Assuming token prices are correctly specified, a sample should never cost the user more than the limit they've set, but it may cost less due to local caching. Note that provider-side caching is accounted for in the cost calculations.

I don't like the current setup where the pricing file path is passed all the way to individual samples that each have to read it. I'd prefer to store pricing info globally, but wasn't sure how to best do so - any guidance would be appreciated.

In this design I was thinking it could be good to support a scenario where multiple models are used for a single sample (e.g., with a multi-agent design), but I'm not sure this is something Inspect actually supports.

This is related to, but doesn't solve #980. I think it could be extended to do this fairly easily, but wanted to get initial feedback on this approach first. I think that issue might be best solved by tracking actual cost details via API instead of based on tokens, but perhaps this "idealized cost" could actually be worth reporting to the user (I'm inclined to think it might be more confusing.)

Leaving this as a draft for now due to the above, but also this still needs:

Find a better way to store pricing information globally
Documentation
Validate web UI changes actually work as intended
Think through edge cases in pricing that I've missed and address (e.g., if reasoning token output price differs from regular output tokens)

craigwalton-dsit

Thanks for pioneering this Andrew, nice job! As requested, I've done a pass over this and have left some initial thoughts as comments.

I think the biggest remaining questions are around token price schemas, validation, whether we support all the different token cost methodologies (e.g. reasoning, cache writes, batched, etc.).

src/inspect_ai/util/_limit.py

craigwalton-dsit · 2025-05-28T13:57:51Z

src/inspect_ai/util/_limit.py

+    return _PriceLimit(limit, price_file)
+
+
+def record_model_usage_price(usage: ModelUsage) -> None:


I think we'll have to also accept a model: str parameter here as we'll need to cost the tokens based on which model it is (I appreciate you mentioned this in a todo) - it is entirely possible that within a single sample you can use multiple models - easiest way is get_model("other model").generate().

I'm also weighing up wether we should merge record_model_usage_price() and record_model_usage().

I'm also weighing up wether we should merge record_model_usage_price() and record_model_usage().

I've been thinking about this too. I think it could work to just use a single tree tracking token usage and then in the _CostLimit and _TokenLimit implementations check if the limits are exceeded. But I wasn't sure if this would cause problems. Will hack on this and see if I get somewhere.

Ah I meant you could keep the two trees structures (might be more understandable that way), but just combine record_model_usage_price() and record_model_usage() and call out to the 2 trees within the one function.

src/inspect_ai/util/_limit.py

craigwalton-dsit · 2025-05-28T14:24:35Z

src/inspect_ai/util/_limit.py

+            prices = json.load(f)
+
+        # TODO: validate price format is correct!
+        # TODO: Can we load this file elsewhere instead of in every sample?


I think this is a good idea, rather than loading + validating the file every time we instantiate a _PriceLimit.

I'd lean towards putting this in its own module (e.g. _cost.py or _price.py).

Something as simple as a global variable in that module would be my starting point. Here's a precedent of this sort of thing

inspect_ai/src/inspect_ai/_util/entrypoints.py

Line 9 in 0ae0df9

global _inspect_ai_eps_loaded_all

I used an lru cache within the _cost.py file to cache reads of the file. But I'm wondering if it might be better to initialize a global with the path to the costs file (and perhaps another global with the contents) instead of passing that path around everywhere.

src/inspect_ai/_cli/eval.py

craigwalton-dsit · 2025-05-28T14:47:16Z

src/inspect_ai/model/_model.py

    set_model_usage(model, usage, sample_model_usage_context_var.get(None))
    set_model_usage(model, usage, model_usage_context_var.get(None))
    record_model_usage(usage)
+    record_model_usage_price(usage)


I think that this record_and_check_model_usage() function is only called from _model.generate() if the inspect cache was not hit. Which I don't think is what you were hoping for?

inspect_ai/src/inspect_ai/model/_model.py

Line 617 in 0ae0df9

return existing, event

I'd recommend double checking this claim though - this is just based on my skimming the source.

src/inspect_ai/util/_limit.py

CHANGELOG.md

andrew-aisi · 2025-06-04T00:43:27Z

tests/test_sample_limits.py

+    }
+    price_file.write_text(json.dumps(data))
+
+    model = get_model(


@craigwalton-dsit any tips for generating a single trajectory using two (mock) models in a unit test? I don't think I can just do eval(model=[model1, model2]) if I want to generate a single trajectory using both models

My first attempt would be something like defining your own solver (rather than using the default generate()) and within the solver do something like

get_model("mockllm/model1").generate(...) get_model("mockllm/model2").generate(...)

I've not tried this before, but it looks like mockllm doesn't care about what model name you give it.

andrew-aisi · 2025-06-09T02:41:37Z

I incorporated the suggested changes and squashed the commits to keep the history from getting too messy. I think this is now working, but it looks like I'll need to fix a few more things to make CI happy.

Beyond the basic functionality, I have a few open questions and could use help writing docs (or I can get to that eventually)

Should we show the running/final cost per sample in the UI somewhere? Right now the cost numbers aren't shown or stored anywhere unless the limit is exceeded.
The details of some usage (messages and tokens) are stored in the active sample e.g., here while other usage info (e.g., working time) isn't. It seems somewhat redundant with the main limit code, but I tried to support this in 22e6446. Since I'm not sure the point of it, I didn't test it - could remove if this is unnecessary. Maybe it's connected with showing these things in the UI?
I'm still a bit confused by caching. I checked the limit was working as expected based on the token usage info I get from inspect eval (e.g., with cache reads labeled "CR"). I think this is provider-side caching (which is a discount I do want to account for here), but I didn't get local inspect caching to work. In 411c573 I tried to add support for treating these cache hits as non-free and add a test case, but I got stuck on the test case. Are there any existing tests with multiple models that I could build off?

Thanks and let me know if there's anything else you'd want changed here.

Adds test with multiple models to validate behavior.

craigwalton-dsit · 2025-06-13T10:54:57Z

Apologies for the delay in getting back to you Andrew. This is progressing well.

I'd lean toward keeping this simple for now and not showing it in the TUI (e.g. rich) for now, but happy to go with what JJ thinks. This might enable you to revert some changes.
You can remove this. The reason the token and message limits are stored on the TaskState (and that they're mutable) is to preserve backward compatibility. (Before we had "scoped" limits, users used to modify the TaskState.token_limit on the fly as a mechanism to for example limit how many tokens a sub-agent could use.)
I believe that tokens shown as "CR" is indeed provider-side caching, and sounds reasonable that you're accounting for this. In terms tests using multiple models, the only ones coming to mind are ones with a distinct scorer model (which won't help you as you're not tracking cost on the scoring). But see my inline comment about trying mockllm with multiple model names.

I'll do a pass over the code now and see if I have anything else to add. For reference, I have 3 weeks remaining at AISI and am happy to advise on this feature, or contribute myself - whatever works best for you.

craigwalton-dsit · 2025-06-13T10:57:13Z

src/inspect_ai/_cli/eval.py

    )
+    @click.option(
+        "--cost-limit",
+        type=float,


I wonder if this can be Decimal too? No worries if not.

craigwalton-dsit · 2025-06-13T10:58:53Z

src/inspect_ai/_cli/eval.py

+    @click.option(
+        "--cost-limit",
+        type=float,
+        help="Limit on idealized inference cost for each sample, assuming no local caching (treats the local inspect cache reads as real token spend). Must be used with a cost file (--cost-file)",


You mention USD in some of the param docstrings, which I think is reasonable, but if we're settling on having the cost unit as USD (rather than undefined, allowing the user to decide based on their cost file) we should maybe mention USD here too e.g.

Suggested change

help="Limit on idealized inference cost for each sample, assuming no local caching (treats the local inspect cache reads as real token spend). Must be used with a cost file (--cost-file)",

help="Limit on idealized inference cost (in USD) for each sample, assuming no local caching (treats the local inspect cache reads as real token spend). Must be used with a cost file (--cost-file)",

craigwalton-dsit · 2025-06-13T11:01:05Z

src/inspect_ai/_view/www/log-schema.json

+              "type": "number"
+            },
+            {
+              "type": "string"


Is the string type required in the schema, or will number suffice? Haven't run this myself, so appreciate there may be a good reason for it!

craigwalton-dsit · 2025-06-13T11:06:24Z

src/inspect_ai/log/_samples.py

        active.total_messages = total_messages


+def set_active_sample_cost_limit(cost_limit: Decimal | None) -> None:


I think these can be removed if we decide to not show the live cost/cost limit in the TUI (as asked in your question 1.)

craigwalton-dsit · 2025-06-13T14:01:54Z

src/inspect_ai/model/_model.py

+                        total_cost = calculate_model_usage_cost(
+                            {cache_entry.model: existing.usage}
+                        )
+                        record_model_usage_cost(total_cost)


Should we also be checking the cost limit here too?

The fact that this takes a different code path to the case where we don't use the local cache highlights that we're taking a different approach with cost to what we do with tokens. Token usage does not include locally cached ones.

Just a note to say I think we'll have to make the distinction crystal clear in the docs site or it could lead to surprises. I note you've already made this clear in the docstrings.

craigwalton-dsit · 2025-06-13T14:03:53Z

src/inspect_ai/solver/_task_state.py


+    @property
+    def cost_limit(self) -> Decimal | None:
+        """Limit on total messages allowed per conversation."""


I think we're removing this as discussed in the other comment, but if not something along these lines:

Suggested change

"""Limit on total messages allowed per conversation."""

"""Limit on total token cost allowed per conversation."""

craigwalton-dsit · 2025-06-13T15:52:35Z

src/inspect_ai/util/_cost.py

+if TYPE_CHECKING:
+    from inspect_ai.model import ModelUsage
+
+getcontext().prec = 10


I think this will affect the precision of all decimals in the process. While there aren't any other usages of the decimal module in Inspect itself, there could potentially be some in user code. I don't think this is necessarily a problem, just something to be aware of. I presume this is set just to reduce the data storage requirements and compute requirements compared to the default 28?

craigwalton-dsit reviewed May 28, 2025

View reviewed changes

andrew-aisi commented Jun 4, 2025

View reviewed changes

andrew-aisi force-pushed the feat/cost_cutoff branch 4 times, most recently from 3d397c5 to e810133 Compare June 9, 2025 02:29

andrew-aisi added 5 commits June 9, 2025 21:28

add new price based limit for samples

d36e883

Store sample cost in sample object

dbf31e6

WIP: Account for cache costs when calculating cost limit

411c573

Adds test with multiple models to validate behavior.

pass cost info through to task_state

22e6446

Regenerate schema with cost info

7aac770

andrew-aisi force-pushed the feat/cost_cutoff branch from af11026 to 7aac770 Compare June 10, 2025 01:11

craigwalton-dsit reviewed Jun 13, 2025

View reviewed changes

andrew-aisi mentioned this pull request Jul 11, 2025

Track idealized token usage in addition to actual #2123

Open

1 task

		return _PriceLimit(limit, price_file)


		def record_model_usage_price(usage: ModelUsage) -> None:

	help="Limit on idealized inference cost for each sample, assuming no local caching (treats the local inspect cache reads as real token spend). Must be used with a cost file (--cost-file)",
	help="Limit on idealized inference cost (in USD) for each sample, assuming no local caching (treats the local inspect cache reads as real token spend). Must be used with a cost file (--cost-file)",

		active.total_messages = total_messages


		def set_active_sample_cost_limit(cost_limit: Decimal \| None) -> None:

	"""Limit on total messages allowed per conversation."""
	"""Limit on total token cost allowed per conversation."""

Add price limit for samples #1930

Are you sure you want to change the base?

Add price limit for samples #1930

Uh oh!

Conversation

andrew-aisi commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

This PR contains:

What is the current behavior? (You can also link to an open issue here)

What is the new behavior?

Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)

Other information:

Uh oh!

craigwalton-dsit left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrew-aisi Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrew-aisi commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

craigwalton-dsit commented Jun 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

craigwalton-dsit Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

andrew-aisi commented May 27, 2025 •

edited

Loading

andrew-aisi Jun 4, 2025 •

edited

Loading

andrew-aisi commented Jun 9, 2025 •

edited

Loading

craigwalton-dsit Jun 13, 2025 •

edited

Loading