You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/projects/projects.mdx
+10-2Lines changed: 10 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -89,7 +89,7 @@ Deployments can scale based on SLO-driven metrics such as queue depth, TTFT, end
89
89
90
90
### 3.11 SLO-aware autoscaling
91
91
92
-
- Auto-scaling is essential for delivering fast and cost-effective GenAI workloads, but every application has different service level objectives—some prioritize time to first token, others focus on end-to-end latency, throughput, or resource utilization. That’s why Bud AI Foundry supports SLO-aware autoscaling, enabling deployments to scale based on the SLOs and business priorities that matter most. The result is smarter scaling, predictable performance, and optimized costs tailored to your specific SLO demands.
92
+
- Auto-scaling is essential for delivering fast and cost-effective GenAI workloads, but every application has different service level objectives, some prioritize time to first token, others focus on end-to-end latency, throughput, or resource utilization. That’s why Bud AI Foundry supports SLO-aware autoscaling, enabling deployments to scale based on the SLOs and business priorities that matter most. The result is smarter scaling, predictable performance, and optimized costs tailored to your specific SLO demands.
93
93
- Enable autoscaling in deployment Settings to scale replicas between a min/max range.
- Add schedule hints for predictable traffic windows and enable predictive scaling for demand forecasting.
@@ -166,6 +166,10 @@ Deployments can scale based on SLO-driven metrics such as queue depth, TTFT, end
166
166
1. View Settings to configure rate limits, retries, and fallback chains per deployment.
167
167
2. Toggle Rate Limit and choose the algorithm (fixed, sliding, or token bucket), then set per-second/minute/hour quotas and burst size.
168
168
3. Add Fallback deployment and Retry limits to harden reliability, then Save to persist the policy.
169
+
4. Enable Autoscaling to activate SLO-aware scaling controls inside Settings.
170
+
5. Set min/max replicas and choose the metric sources (queue depth, TTFT, TPOT, end-to-end latency, or embedding/classify latency) that should trigger scaling.
171
+
6. Add Schedule Hints for planned traffic windows, or enable Predictive Scaling to look ahead using historical demand.
172
+
7. Tune scaling behavior (stabilization windows and scaling policies) to keep capacity changes smooth, then save the autoscale configuration.
169
173
170
174
#### 4.6.5 Benchmarks
171
175
@@ -336,4 +340,8 @@ Open Use this model from the deployment row to copy ready-made snippets in cURL,
336
340
337
341
**Q10. What happens when I publish a model?**
338
342
339
-
Publishing sets token pricing (input/output, USD per selected token block) and makes the endpoint available in the Bud customer dashboard for org users. You can revisit Publish Details to review pricing history, adjust prices, or unpublish without deleting the deployment.
343
+
Publishing sets token pricing (input/output, USD per selected token block) and makes the endpoint available in the Bud customer dashboard for org users. You can revisit Publish Details to review pricing history, adjust prices, or unpublish without deleting the deployment.
344
+
345
+
**Q11. How does autoscaling work for deployments?**
346
+
347
+
Autoscaling is configured in the deployment Settings tab. Enable it to set min/max replicas, choose SLO-driven metrics (queue depth, TTFT, TPOT, end-to-end latency, embedding/classify latency), and optionally add schedule hints or predictive scaling. These controls let the deployment scale intelligently against performance and cost objectives.
0 commit comments