-
Notifications
You must be signed in to change notification settings - Fork 381
Add price limit for samples #1930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
craigwalton-dsit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pioneering this Andrew, nice job! As requested, I've done a pass over this and have left some initial thoughts as comments.
I think the biggest remaining questions are around token price schemas, validation, whether we support all the different token cost methodologies (e.g. reasoning, cache writes, batched, etc.).
src/inspect_ai/util/_limit.py
Outdated
| return _PriceLimit(limit, price_file) | ||
|
|
||
|
|
||
| def record_model_usage_price(usage: ModelUsage) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'll have to also accept a model: str parameter here as we'll need to cost the tokens based on which model it is (I appreciate you mentioned this in a todo) - it is entirely possible that within a single sample you can use multiple models - easiest way is get_model("other model").generate().
I'm also weighing up wether we should merge record_model_usage_price() and record_model_usage().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also weighing up wether we should merge record_model_usage_price() and record_model_usage().
I've been thinking about this too. I think it could work to just use a single tree tracking token usage and then in the _CostLimit and _TokenLimit implementations check if the limits are exceeded. But I wasn't sure if this would cause problems. Will hack on this and see if I get somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I meant you could keep the two trees structures (might be more understandable that way), but just combine record_model_usage_price() and record_model_usage() and call out to the 2 trees within the one function.
src/inspect_ai/util/_limit.py
Outdated
| prices = json.load(f) | ||
|
|
||
| # TODO: validate price format is correct! | ||
| # TODO: Can we load this file elsewhere instead of in every sample? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good idea, rather than loading + validating the file every time we instantiate a _PriceLimit.
I'd lean towards putting this in its own module (e.g. _cost.py or _price.py).
Something as simple as a global variable in that module would be my starting point. Here's a precedent of this sort of thing
| global _inspect_ai_eps_loaded_all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used an lru cache within the _cost.py file to cache reads of the file. But I'm wondering if it might be better to initialize a global with the path to the costs file (and perhaps another global with the contents) instead of passing that path around everywhere.
src/inspect_ai/model/_model.py
Outdated
| set_model_usage(model, usage, sample_model_usage_context_var.get(None)) | ||
| set_model_usage(model, usage, model_usage_context_var.get(None)) | ||
| record_model_usage(usage) | ||
| record_model_usage_price(usage) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this record_and_check_model_usage() function is only called from _model.generate() if the inspect cache was not hit. Which I don't think is what you were hoping for?
inspect_ai/src/inspect_ai/model/_model.py
Line 617 in 0ae0df9
| return existing, event |
I'd recommend double checking this claim though - this is just based on my skimming the source.
| } | ||
| price_file.write_text(json.dumps(data)) | ||
|
|
||
| model = get_model( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@craigwalton-dsit any tips for generating a single trajectory using two (mock) models in a unit test? I don't think I can just do eval(model=[model1, model2]) if I want to generate a single trajectory using both models
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My first attempt would be something like defining your own solver (rather than using the default generate()) and within the solver do something like
get_model("mockllm/model1").generate(...)
get_model("mockllm/model2").generate(...)I've not tried this before, but it looks like mockllm doesn't care about what model name you give it.
3d397c5 to
e810133
Compare
|
I incorporated the suggested changes and squashed the commits to keep the history from getting too messy. I think this is now working, but it looks like I'll need to fix a few more things to make CI happy. Beyond the basic functionality, I have a few open questions and could use help writing docs (or I can get to that eventually)
Thanks and let me know if there's anything else you'd want changed here. |
Adds test with multiple models to validate behavior.
af11026 to
7aac770
Compare
|
Apologies for the delay in getting back to you Andrew. This is progressing well.
I'll do a pass over the code now and see if I have anything else to add. For reference, I have 3 weeks remaining at AISI and am happy to advise on this feature, or contribute myself - whatever works best for you. |
| ) | ||
| @click.option( | ||
| "--cost-limit", | ||
| type=float, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this can be Decimal too? No worries if not.
| @click.option( | ||
| "--cost-limit", | ||
| type=float, | ||
| help="Limit on idealized inference cost for each sample, assuming no local caching (treats the local inspect cache reads as real token spend). Must be used with a cost file (--cost-file)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mention USD in some of the param docstrings, which I think is reasonable, but if we're settling on having the cost unit as USD (rather than undefined, allowing the user to decide based on their cost file) we should maybe mention USD here too e.g.
| help="Limit on idealized inference cost for each sample, assuming no local caching (treats the local inspect cache reads as real token spend). Must be used with a cost file (--cost-file)", | |
| help="Limit on idealized inference cost (in USD) for each sample, assuming no local caching (treats the local inspect cache reads as real token spend). Must be used with a cost file (--cost-file)", |
| "type": "number" | ||
| }, | ||
| { | ||
| "type": "string" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the string type required in the schema, or will number suffice? Haven't run this myself, so appreciate there may be a good reason for it!
| active.total_messages = total_messages | ||
|
|
||
|
|
||
| def set_active_sample_cost_limit(cost_limit: Decimal | None) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these can be removed if we decide to not show the live cost/cost limit in the TUI (as asked in your question 1.)
| total_cost = calculate_model_usage_cost( | ||
| {cache_entry.model: existing.usage} | ||
| ) | ||
| record_model_usage_cost(total_cost) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also be checking the cost limit here too?
The fact that this takes a different code path to the case where we don't use the local cache highlights that we're taking a different approach with cost to what we do with tokens. Token usage does not include locally cached ones.
Just a note to say I think we'll have to make the distinction crystal clear in the docs site or it could lead to surprises. I note you've already made this clear in the docstrings.
|
|
||
| @property | ||
| def cost_limit(self) -> Decimal | None: | ||
| """Limit on total messages allowed per conversation.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're removing this as discussed in the other comment, but if not something along these lines:
| """Limit on total messages allowed per conversation.""" | |
| """Limit on total token cost allowed per conversation.""" |
| if TYPE_CHECKING: | ||
| from inspect_ai.model import ModelUsage | ||
|
|
||
| getcontext().prec = 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will affect the precision of all decimals in the process. While there aren't any other usages of the decimal module in Inspect itself, there could potentially be some in user code. I don't think this is necessarily a problem, just something to be aware of. I presume this is set just to reduce the data storage requirements and compute requirements compared to the default 28?
This PR contains:
What is the current behavior? (You can also link to an open issue here)
Users cannot set a price based limit for samples (#1152).
What is the new behavior?
Users can set a price limit for samples if they provide a json file that maps model names to per-token prices for output, cached input, and unique input. As an example, the model_prices_and_context_window.json from litellm provides this information.
Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)
No
Other information:
My goal here is to allow for cost-normalized comparisons, not to ensure end users are actually billed a specified amount. As such, this design ignores the fact that Inspect could load model responses from its own internal cache. Assuming token prices are correctly specified, a sample should never cost the user more than the limit they've set, but it may cost less due to local caching. Note that provider-side caching is accounted for in the cost calculations.
I don't like the current setup where the pricing file path is passed all the way to individual samples that each have to read it. I'd prefer to store pricing info globally, but wasn't sure how to best do so - any guidance would be appreciated.
In this design I was thinking it could be good to support a scenario where multiple models are used for a single sample (e.g., with a multi-agent design), but I'm not sure this is something Inspect actually supports.
This is related to, but doesn't solve #980. I think it could be extended to do this fairly easily, but wanted to get initial feedback on this approach first. I think that issue might be best solved by tracking actual cost details via API instead of based on tokens, but perhaps this "idealized cost" could actually be worth reporting to the user (I'm inclined to think it might be more confusing.)
Leaving this as a draft for now due to the above, but also this still needs: