Skip to content

Conversation

@andrew-aisi
Copy link
Contributor

@andrew-aisi andrew-aisi commented May 27, 2025

This PR contains:

  • New features

What is the current behavior? (You can also link to an open issue here)

Users cannot set a price based limit for samples (#1152).

What is the new behavior?

Users can set a price limit for samples if they provide a json file that maps model names to per-token prices for output, cached input, and unique input. As an example, the model_prices_and_context_window.json from litellm provides this information.

Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)

No

Other information:

My goal here is to allow for cost-normalized comparisons, not to ensure end users are actually billed a specified amount. As such, this design ignores the fact that Inspect could load model responses from its own internal cache. Assuming token prices are correctly specified, a sample should never cost the user more than the limit they've set, but it may cost less due to local caching. Note that provider-side caching is accounted for in the cost calculations.

I don't like the current setup where the pricing file path is passed all the way to individual samples that each have to read it. I'd prefer to store pricing info globally, but wasn't sure how to best do so - any guidance would be appreciated.

In this design I was thinking it could be good to support a scenario where multiple models are used for a single sample (e.g., with a multi-agent design), but I'm not sure this is something Inspect actually supports.

This is related to, but doesn't solve #980. I think it could be extended to do this fairly easily, but wanted to get initial feedback on this approach first. I think that issue might be best solved by tracking actual cost details via API instead of based on tokens, but perhaps this "idealized cost" could actually be worth reporting to the user (I'm inclined to think it might be more confusing.)

Leaving this as a draft for now due to the above, but also this still needs:

  • Find a better way to store pricing information globally
  • Documentation
  • Validate web UI changes actually work as intended
  • Think through edge cases in pricing that I've missed and address (e.g., if reasoning token output price differs from regular output tokens)

Copy link
Collaborator

@craigwalton-dsit craigwalton-dsit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pioneering this Andrew, nice job! As requested, I've done a pass over this and have left some initial thoughts as comments.

I think the biggest remaining questions are around token price schemas, validation, whether we support all the different token cost methodologies (e.g. reasoning, cache writes, batched, etc.).

return _PriceLimit(limit, price_file)


def record_model_usage_price(usage: ModelUsage) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll have to also accept a model: str parameter here as we'll need to cost the tokens based on which model it is (I appreciate you mentioned this in a todo) - it is entirely possible that within a single sample you can use multiple models - easiest way is get_model("other model").generate().

I'm also weighing up wether we should merge record_model_usage_price() and record_model_usage().

Copy link
Contributor Author

@andrew-aisi andrew-aisi Jun 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also weighing up wether we should merge record_model_usage_price() and record_model_usage().

I've been thinking about this too. I think it could work to just use a single tree tracking token usage and then in the _CostLimit and _TokenLimit implementations check if the limits are exceeded. But I wasn't sure if this would cause problems. Will hack on this and see if I get somewhere.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I meant you could keep the two trees structures (might be more understandable that way), but just combine record_model_usage_price() and record_model_usage() and call out to the 2 trees within the one function.

prices = json.load(f)

# TODO: validate price format is correct!
# TODO: Can we load this file elsewhere instead of in every sample?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good idea, rather than loading + validating the file every time we instantiate a _PriceLimit.

I'd lean towards putting this in its own module (e.g. _cost.py or _price.py).

Something as simple as a global variable in that module would be my starting point. Here's a precedent of this sort of thing

global _inspect_ai_eps_loaded_all

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used an lru cache within the _cost.py file to cache reads of the file. But I'm wondering if it might be better to initialize a global with the path to the costs file (and perhaps another global with the contents) instead of passing that path around everywhere.

set_model_usage(model, usage, sample_model_usage_context_var.get(None))
set_model_usage(model, usage, model_usage_context_var.get(None))
record_model_usage(usage)
record_model_usage_price(usage)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this record_and_check_model_usage() function is only called from _model.generate() if the inspect cache was not hit. Which I don't think is what you were hoping for?

return existing, event

I'd recommend double checking this claim though - this is just based on my skimming the source.

}
price_file.write_text(json.dumps(data))

model = get_model(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@craigwalton-dsit any tips for generating a single trajectory using two (mock) models in a unit test? I don't think I can just do eval(model=[model1, model2]) if I want to generate a single trajectory using both models

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first attempt would be something like defining your own solver (rather than using the default generate()) and within the solver do something like

get_model("mockllm/model1").generate(...)
get_model("mockllm/model2").generate(...)

I've not tried this before, but it looks like mockllm doesn't care about what model name you give it.

@andrew-aisi andrew-aisi force-pushed the feat/cost_cutoff branch 4 times, most recently from 3d397c5 to e810133 Compare June 9, 2025 02:29
@andrew-aisi
Copy link
Contributor Author

andrew-aisi commented Jun 9, 2025

I incorporated the suggested changes and squashed the commits to keep the history from getting too messy. I think this is now working, but it looks like I'll need to fix a few more things to make CI happy.

Beyond the basic functionality, I have a few open questions and could use help writing docs (or I can get to that eventually)

  1. Should we show the running/final cost per sample in the UI somewhere? Right now the cost numbers aren't shown or stored anywhere unless the limit is exceeded.
  2. The details of some usage (messages and tokens) are stored in the active sample e.g., here while other usage info (e.g., working time) isn't. It seems somewhat redundant with the main limit code, but I tried to support this in 22e6446. Since I'm not sure the point of it, I didn't test it - could remove if this is unnecessary. Maybe it's connected with showing these things in the UI?
  3. I'm still a bit confused by caching. I checked the limit was working as expected based on the token usage info I get from inspect eval (e.g., with cache reads labeled "CR"). I think this is provider-side caching (which is a discount I do want to account for here), but I didn't get local inspect caching to work. In 411c573 I tried to add support for treating these cache hits as non-free and add a test case, but I got stuck on the test case. Are there any existing tests with multiple models that I could build off?

Thanks and let me know if there's anything else you'd want changed here.

@craigwalton-dsit
Copy link
Collaborator

Apologies for the delay in getting back to you Andrew. This is progressing well.

  1. I'd lean toward keeping this simple for now and not showing it in the TUI (e.g. rich) for now, but happy to go with what JJ thinks. This might enable you to revert some changes.
  2. You can remove this. The reason the token and message limits are stored on the TaskState (and that they're mutable) is to preserve backward compatibility. (Before we had "scoped" limits, users used to modify the TaskState.token_limit on the fly as a mechanism to for example limit how many tokens a sub-agent could use.)
  3. I believe that tokens shown as "CR" is indeed provider-side caching, and sounds reasonable that you're accounting for this. In terms tests using multiple models, the only ones coming to mind are ones with a distinct scorer model (which won't help you as you're not tracking cost on the scoring). But see my inline comment about trying mockllm with multiple model names.

I'll do a pass over the code now and see if I have anything else to add. For reference, I have 3 weeks remaining at AISI and am happy to advise on this feature, or contribute myself - whatever works best for you.

)
@click.option(
"--cost-limit",
type=float,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this can be Decimal too? No worries if not.

@click.option(
"--cost-limit",
type=float,
help="Limit on idealized inference cost for each sample, assuming no local caching (treats the local inspect cache reads as real token spend). Must be used with a cost file (--cost-file)",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mention USD in some of the param docstrings, which I think is reasonable, but if we're settling on having the cost unit as USD (rather than undefined, allowing the user to decide based on their cost file) we should maybe mention USD here too e.g.

Suggested change
help="Limit on idealized inference cost for each sample, assuming no local caching (treats the local inspect cache reads as real token spend). Must be used with a cost file (--cost-file)",
help="Limit on idealized inference cost (in USD) for each sample, assuming no local caching (treats the local inspect cache reads as real token spend). Must be used with a cost file (--cost-file)",

"type": "number"
},
{
"type": "string"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the string type required in the schema, or will number suffice? Haven't run this myself, so appreciate there may be a good reason for it!

active.total_messages = total_messages


def set_active_sample_cost_limit(cost_limit: Decimal | None) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these can be removed if we decide to not show the live cost/cost limit in the TUI (as asked in your question 1.)

total_cost = calculate_model_usage_cost(
{cache_entry.model: existing.usage}
)
record_model_usage_cost(total_cost)
Copy link
Collaborator

@craigwalton-dsit craigwalton-dsit Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also be checking the cost limit here too?

The fact that this takes a different code path to the case where we don't use the local cache highlights that we're taking a different approach with cost to what we do with tokens. Token usage does not include locally cached ones.

Just a note to say I think we'll have to make the distinction crystal clear in the docs site or it could lead to surprises. I note you've already made this clear in the docstrings.


@property
def cost_limit(self) -> Decimal | None:
"""Limit on total messages allowed per conversation."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're removing this as discussed in the other comment, but if not something along these lines:

Suggested change
"""Limit on total messages allowed per conversation."""
"""Limit on total token cost allowed per conversation."""

if TYPE_CHECKING:
from inspect_ai.model import ModelUsage

getcontext().prec = 10
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will affect the precision of all decimals in the process. While there aren't any other usages of the decimal module in Inspect itself, there could potentially be some in user code. I don't think this is necessarily a problem, just something to be aware of. I presume this is set just to reduce the data storage requirements and compute requirements compared to the default 28?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants