-
Notifications
You must be signed in to change notification settings - Fork 19.6k
feat(langchain): (SummarizationMiddleware) support use of model context windows when triggering summarization #33825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks awesome! Super excited about this improvement
total_tokens + max_output_tokens + self.buffer_tokens > max_input_tokens
have we checked w/ the applied AI team if it makes sense to default to having: self.buffer_tokens defaulting to 0? Related, do we want to max out at max_input_tokens? is there a downside to filling the window to the brim?
| token_counter: TokenCounter = count_tokens_approximately, | ||
| summary_prompt: str = DEFAULT_SUMMARY_PROMPT, | ||
| summary_prefix: str = SUMMARY_PREFIX, | ||
| *, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's move this up higher
| self.summary_prompt = summary_prompt | ||
| self.summary_prefix = summary_prefix | ||
| self.buffer_tokens = buffer_tokens | ||
| self.trim_token_limit = trim_token_limit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how necessary trim_token_limit is
It is a bit confusing as an argument (all thoughts I had)
- does this mean the maximum number of tokens that we can trim?
- does this mean the max number of tokens left after trimming?
- when does trimming even occur, and why! Are we trimming instead of summarizing?
If someone runs into an error while summarizing, shouldn't they just lower the max_tokens_before_summary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pass
Noneto skip trimming entirely (risking summary model overflows if the history is too long).
Imo the best solution here is just "compact earlier" instead of "trimming at compaction time". Summarization call is like an extra few thousand tokens max maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed this parameter to trim_tokens_to_summarize. The purpose of my adding this as a parameter is to allow us to circumvent the trimming entirely. The current behavior is to always trim what is sent to the LLM for summarization to 4000 tokens:
langchain/libs/langchain_v1/langchain/agents/middleware/summarization.py
Lines 221 to 249 in 915c446
| def _create_summary(self, messages_to_summarize: list[AnyMessage]) -> str: | |
| """Generate summary for the given messages.""" | |
| if not messages_to_summarize: | |
| return "No previous conversation history." | |
| trimmed_messages = self._trim_messages_for_summary(messages_to_summarize) | |
| if not trimmed_messages: | |
| return "Previous conversation was too long to summarize." | |
| try: | |
| response = self.model.invoke(self.summary_prompt.format(messages=trimmed_messages)) | |
| return cast("str", response.content).strip() | |
| except Exception as e: # noqa: BLE001 | |
| return f"Error generating summary: {e!s}" | |
| def _trim_messages_for_summary(self, messages: list[AnyMessage]) -> list[AnyMessage]: | |
| """Trim messages to fit within summary generation limits.""" | |
| try: | |
| return trim_messages( | |
| messages, | |
| max_tokens=_DEFAULT_TRIM_TOKEN_LIMIT, | |
| token_counter=self.token_counter, | |
| start_on="human", | |
| strategy="last", | |
| allow_partial=True, | |
| include_system=True, | |
| ) | |
| except Exception: # noqa: BLE001 | |
| return messages[-_DEFAULT_FALLBACK_MESSAGE_COUNT:] |
so this was a minimal change that is not breaking but lets us disable this. lmk if that makes sense or I misunderstood you.
| tokens_to_keep: float | None = None, | ||
| token_counter: TokenCounter = count_tokens_approximately, | ||
| summary_prompt: str = DEFAULT_SUMMARY_PROMPT, | ||
| summary_prefix: str = SUMMARY_PREFIX, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this parameter is unused
Here we use model profiles to allow summarization behavior to vary with model context window sizes.
We make these changes:
When is summarization triggered?
We add a
triggerparameter which subsumesmax_tokens_before_summary.trigger=("tokens", 1000)is equivalent tomax_tokens_before_summary=1000trigger=("messages", X)andtrigger=("fraction", X). The latter will use a fraction of a model's context window if model profiles are available.trigger=[("tokens", 1000), ("messages", 50)]will trigger summarization if either condition is met.We retain runtime support for
max_tokens_before_summary, although it will emit a deprecation warning and generate a typing error.trigger=None(ormax_tokens_before_summary=None) disables summarization, consistent with howmax_tokens_before_summarywas documented (and resolving #33701)What context is summarized?
We add a
keepparameter which subsumesmessages_to_keep.keep=("messages", 20)is equivalent tomessages_to_keep=20.keep=("tokens", X)andkeep=("fraction", X).We default to
("messages", 20)as before.Note: we also enable all of this context to be sent to the LLM for summarization.