Skip to content

increase timeout to receive first token#263

Merged
mtblanton merged 1 commit intomainfrom
raise-overloaded-timeout
Apr 3, 2025
Merged

increase timeout to receive first token#263
mtblanton merged 1 commit intomainfrom
raise-overloaded-timeout

Conversation

@mtblanton
Copy link
Copy Markdown
Contributor

Based on discussion from Slack: https://allenai.slack.com/archives/C07530Y79Q9/p1743707972179919?thread_ts=1743706445.356419&cid=C07530Y79Q9

The timeout we set for the Tulu3 405B release was a little too short for all models. This was causing Molmo to get "overloaded" errors when it shouldn't have. @codeviking said that Molmo's TTFT p99 was 10s, so we're going a little above that to play it safe.

@mtblanton mtblanton requested review from schmmd and yensung April 3, 2025 19:26
@mtblanton mtblanton self-assigned this Apr 3, 2025
@mtblanton mtblanton requested a review from codeviking April 3, 2025 19:27

# We handle the first chunk differently since we want to timeout if it takes longer than 5 seconds
first_chunk = results.get(5.0)
first_chunk = results.get(15.0)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the real change, the rest is ruff formatting.

Copy link
Copy Markdown

@codeviking codeviking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@mtblanton mtblanton merged commit 21f053a into main Apr 3, 2025
3 checks passed
@mtblanton mtblanton deleted the raise-overloaded-timeout branch April 3, 2025 19:30
yensung added a commit that referenced this pull request Apr 4, 2025
* main:
  increase timeout to receive first token (#263)
  handle the server-overloaded case (#261)
mtblanton added a commit that referenced this pull request Jun 13, 2025
Based on discussion from Slack:
https://allenai.slack.com/archives/C07530Y79Q9/p1743707972179919?thread_ts=1743706445.356419&cid=C07530Y79Q9

The timeout we set for the Tulu3 405B release was a little too short for
all models. This was causing Molmo to get "overloaded" errors when it
shouldn't have. @codeviking said that Molmo's TTFT p99 was 10s, so we're
going a little above that to play it safe.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants