-
Notifications
You must be signed in to change notification settings - Fork 10k
[Changelog] Workers AI #21610
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
[Changelog] Workers AI #21610
Changes from 5 commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
401dbdd
[Changelog] Workers AI
kodster28 2b85c60
hidden prop
kodster28 c5cb810
Update src/content/changelog/workers-ai/2025-04-11-new-models-faster-…
kodster28 c96ccd9
Update src/content/changelog/workers-ai/2025-04-11-new-models-faster-…
kodster28 289cfa7
Merge branch 'production' into workers-ai-changelog-04-10
kodster28 3ae6eba
Update src/content/changelog/workers-ai/2025-04-11-new-models-faster-…
kodster28 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
50 changes: 50 additions & 0 deletions
50
src/content/changelog/workers-ai/2025-04-11-new-models-faster-inference.mdx
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,50 @@ | ||
| --- | ||
| title: Workers AI for Developer Week - faster inference, new models, async batch API, expanded LoRA support | ||
| description: Workers AI now has more models, faster inference, async batch API, and expanded LoRA | ||
| date: 2025-04-11T18:00:00Z | ||
| hidden: true | ||
| --- | ||
|
|
||
| Happy Developer Week 2025! Workers AI is excited to announce a couple of new features and improvements available today. Check out our [blog](https://blog.cloudflare.com/workers-ai-improvements) for all the announcement details. | ||
|
|
||
| ### Faster inference + New models | ||
|
|
||
| We’re rolling out some in-place improvements to our models that can help speed up inference by 2-4x! Users of the models below will enjoy an automatic speed boost starting today: | ||
|
|
||
| - [`@cf/meta/llama-3.3-70b-instruct-fp8-fast`](/workers-ai/models/llama-3.3-70b-instruct-fp8-fast/) gets a speed boost of 2-4x, leveraging techniques like speculative decoding, prefix caching, and an updated inference backend. | ||
| - [`@cf/baai/bge-small-en-v1.5`](/workers-ai/models/bge-small-en-v1.5/), [`@cf/baai/bge-base-en-v1.5`](/workers-ai/models/bge-base-en-v1.5/), [`@cf/baai/bge-large-en-v1.5`](/workers-ai/models/bge-large-en-v1.5/) get an updated back end, which should improve inference times by 2x. | ||
| - With the `bge` models, we’re also announcing a new parameter called `pooling` which can take `cls` or `mean` as options. We highly recommend using `pooling: cls` which will help generate more accurate embeddings. However, embeddings generated with cls pooling are not backwards compatible with mean pooling. For this to not be a breaking change, the default remains as mean pooling. Please specify `pooling: cls` to enjoy more accurate embeddings going forward. | ||
|
|
||
| We’re also excited to launch a few new models in our catalog to help round out your experience with Workers AI. We’ll be deprecating some older models in the future, so stay tuned for a deprecation announcement. Today’s new models include: | ||
|
|
||
| - [`@cf/mistralai/mistral-small-3.1-24b-instruct`](/workers-ai/models/mistral-small-3.1-24b-instruct/): a 24B parameter model achieving state-of-the-art capabilities comparable to larger models, with support for vision and tool calling. | ||
| - [`@cf/google/gemma-3-12b-it`](/workers-ai/models/gemma-3-12b-it/): well-suited for a variety of text generation and image understanding tasks, including question answering, summarization and reasoning, with a 128K context window, and multilingual support in over 140 languages. | ||
| - [`@cf/qwen/qwq-32b`](/workers-ai/models/qwq-32b/): a medium-sized reasoning model, which is capable of achieving competitive performance against state-of-the-art reasoning models, e.g., DeepSeek-R1, o1-mini. | ||
| - [`@cf/qwen/qwen2.5-coder-32b-instruct`](/workers-ai/models/qwen2.5-coder-32b-instruct/): the current state-of-the-art open-source code LLM, with its coding abilities matching those of GPT-4o. | ||
|
|
||
| ### Batch Inference | ||
|
|
||
| Introducing a new batch inference feature that allows you to send us an array of requests, which we will fulfill as fast as possible and send them back as an array. This is really helpful for large workloads such as summarization, embeddings, etc. where you don’t have a human-in-the-loop. Using the batch API will guarantee that your requests are fulfilled eventually, rather than erroring out if we don’t have enough capacity at a given time. | ||
|
|
||
| Check out the [tutorial](/workers-ai/features/batch-api/) to get started! Models that support batch inference today include: | ||
|
|
||
| - [`@cf/meta/llama-3.3-70b-instruct-fp8-fast`](/workers-ai/models/llama-3.3-70b-instruct-fp8-fast/) | ||
| - [`@cf/baai/bge-small-en-v1.5`](/workers-ai/models/bge-small-en-v1.5/) | ||
| - [`@cf/baai/bge-base-en-v1.5`](/workers-ai/models/bge-base-en-v1.5/) | ||
| - [`@cf/baai/bge-large-en-v1.5`](/workers-ai/models/bge-large-en-v1.5/) | ||
| - [`@cf/baai/bge-m3`](/workers-ai/models/bge-m3/) | ||
| - [`@cf/meta/m2m100-1.2b`](/workers-ai/models/m2m100-1.2b/) | ||
|
|
||
| ### Expanded LoRA support | ||
|
|
||
| We’ve upgraded our LoRA experience to include 8 newer models, and can support ranks of up to 32 with a 300MB safetensors file limit (previously limited to rank of 8 and 100MB safetensors) Check out our [LoRAs page](/workers-ai/features/fine-tunes/loras/) to get started. Models that support LoRAs now include: | ||
|
|
||
| - [`@cf/meta/llama-3.2-11b-vision-instruct`](/workers-ai/models/llama-3.2-11b-vision-instruct/) | ||
| - [`@cf/meta/llama-3.3-70b-instruct-fp8-fast`](/workers-ai/models/llama-3.3-70b-instruct-fp8-fast/) | ||
| - [`@cf/meta/llama-guard-3-8b`](/workers-ai/models/llama-guard-3-8b/) | ||
| - [`@cf/meta/llama-3.1-8b-instruct-fast`](/workers-ai/models/llama-3.1-8b-instruct-fast/) (coming soon) | ||
| - [`@cf/deepseek-ai/deepseek-r1-distill-qwen-32b`](/workers-ai/models/deepseek-r1-distill-qwen-32b/) (coming soon) | ||
| - [`@cf/qwen/qwen2.5-coder-32b-instruct`](/workers-ai/models/qwen2.5-coder-32b-instruct/) | ||
| - [`@cf/qwen/qwq-32b`](/workers-ai/models/qwq-32b/) | ||
| - [`@cf/mistralai/mistral-small-3.1-24b-instruct`](/workers-ai/models/mistral-small-3.1-24b-instruct/) | ||
| - [`@cf/google/gemma-3-12b-it`](/workers-ai/models/gemma-3-12b-it/) | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.