dstackai
diff --git a/‎docs/docs/concepts/gateways.md‎
Lines changed: 1 addition & 2 deletions b/‎docs/docs/concepts/gateways.md‎
Lines changed: 1 addition & 2 deletions
diff --git a/‎docs/docs/concepts/services.md‎
Lines changed: 56 additions & 3 deletions b/‎docs/docs/concepts/services.md‎
Lines changed: 56 additions & 3 deletions
diff --git a/‎docs/docs/quickstart.md‎
Lines changed: 3 additions & 1 deletion b/‎docs/docs/quickstart.md‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎docs/docs/reference/dstack.yml/service.md‎
Lines changed: 32 additions & 0 deletions b/‎docs/docs/reference/dstack.yml/service.md‎
Lines changed: 32 additions & 0 deletions
diff --git a/‎scripts/docs/gen_schema_reference.py‎
Lines changed: 1 addition & 5 deletions b/‎scripts/docs/gen_schema_reference.py‎
Lines changed: 1 addition & 5 deletions
diff --git a/‎src/dstack/_internal/core/models/configurations.py‎
Lines changed: 78 additions & 0 deletions b/‎src/dstack/_internal/core/models/configurations.py‎
Lines changed: 78 additions & 0 deletions
diff --git a/‎src/dstack/_internal/proxy/gateway/resources/nginx/service.jinja2‎
Lines changed: 14 additions & 4 deletions b/‎src/dstack/_internal/proxy/gateway/resources/nginx/service.jinja2‎
Lines changed: 14 additions & 4 deletions
diff --git a/‎src/dstack/_internal/proxy/gateway/routers/registry.py‎
Lines changed: 1 addition & 0 deletions b/‎src/dstack/_internal/proxy/gateway/routers/registry.py‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎src/dstack/_internal/proxy/gateway/schemas/registry.py‎
Lines changed: 2 additions & 0 deletions b/‎src/dstack/_internal/proxy/gateway/schemas/registry.py‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎src/dstack/_internal/proxy/gateway/services/nginx.py‎
Lines changed: 19 additions & 0 deletions b/‎src/dstack/_internal/proxy/gateway/services/nginx.py‎
Lines changed: 19 additions & 0 deletions
@@ -1,8 +1,7 @@
 # Gateways
 
 Gateways manage the ingress traffic of running [services](services.md),
-provide an HTTPS endpoint mapped to your domain,
-and handle auto-scaling.
+provide an HTTPS endpoint mapped to your domain, handle auto-scaling and rate limits.
 
 > If you're using [dstack Sky :material-arrow-top-right-thin:{ .external }](https://sky.dstack.ai){:target="_blank"},
 > the gateway is already set up for you.
 
@@ -100,7 +100,7 @@ If [authorization](#authorization) is not disabled, the service endpoint require
 
     However, you'll need a gateway in the following cases:
 
-    * To use auto-scaling
+    * To use auto-scaling or rate limits
     * To enable HTTPS for the endpoint and map it to your domain
     * If your service requires WebSockets
     * If your service cannot work with a [path prefix](#path-prefix)
@@ -161,8 +161,7 @@ case `dstack` adjusts the number of replicas (scales up or down) automatically b
 
 Setting the minimum number of replicas to `0` allows the service to scale down to zero when there are no requests.
 
->The `scaling` property currently requires creating a [gateway](gateways.md).
-This requirement is expected to be removed soon.
+> The `scaling` property requires creating a [gateway](gateways.md).
 
 ### Model
 
@@ -238,6 +237,60 @@ set [`strip_prefix`](../reference/dstack.yml/service.md#strip_prefix) to `false`
 If your app cannot be configured to work with a path prefix, you can host it
 on a dedicated domain name by setting up a [gateway](gateways.md).
 
+### Rate Limits { #rate-limits }
+
+If you have a [gateway](gateways.md), you can configure rate limits for your service
+using the [`rate_limits`](../reference/dstack.yml/service.md#rate_limits) property.
+
+<div editor-title="service.dstack.yml"> 
+
+```yaml
+type: service
+image: my-app:latest
+port: 80
+
+rate_limits:
+# For /api/auth/* - 1 request per second, no bursts
+- prefix: /api/auth/
+  rps: 1
+# For other URLs - 4 requests per second + bursts of up to 9 requests
+- rps: 4
+  burst: 9
+```
+
+</div>
+
+The limit is specified in requests per second, but requests are tracked with millisecond
+granularity. For example, `rps: 4` means at most 1 request every 250 milliseconds.
+For most applications, it is recommended to set the `burst` property, which allows
+temporary bursts, but keeps the average request rate at the limit specified in `rps`.
+
+Rate limits are applied to the entire service regardless of the number of replicas.
+They are applied to each client separately, as determined by the client's IP address.
+If a client violates a limit, it receives an error with status code `429`.
+
+??? info "Partitioning key"
+    Instead of partitioning requests by client IP address,
+    you can choose to partition by the value of a header.
+
+    <div editor-title="service.dstack.yml"> 
+
+    ```yaml
+    type: service
+    image: my-app:latest
+    port: 80
+
+    rate_limits:
+    - rps: 4
+      burst: 9
+      # Apply to each user, as determined by the `Authorization` header
+      key:
+        type: header
+        header: Authorization
+    ```
+
+    </div>
+
 ### Resources
 
 If you specify memory size, you can either specify an explicit size (e.g. `24GB`) or a 
 
@@ -191,7 +191,9 @@ $ dstack init
     </div>
 
     !!! info "Gateway"
-        To enable [auto-scaling](concepts/services.md#replicas-and-scaling), or use a custom domain with HTTPS, 
+        To enable [auto-scaling](concepts/services.md#replicas-and-scaling),
+        [rate limits](concepts/services.md#rate-limits),
+        or use a custom domain with HTTPS, 
         set up a [gateway](concepts/gateways.md) before running the service.
         If you're using [dstack Sky :material-arrow-top-right-thin:{ .external }](https://sky.dstack.ai){:target="_blank"},
         a gateway is pre-configured for you.
 
@@ -74,6 +74,38 @@ The `service` configuration type allows running [services](../../concepts/servic
       type:
         required: true
 
+### `rate_limits`
+
+#### `rate_limits[n]`
+
+#SCHEMA# dstack._internal.core.models.configurations.RateLimit
+    overrides:
+      show_root_heading: false
+      type:
+        required: true
+
+##### `rate_limits[n].key` { data-toc-label="key" }
+
+=== "IP address"
+
+    Partition requests by client IP address.
+
+    #SCHEMA# dstack._internal.core.models.configurations.IPAddressPartitioningKey
+        overrides:
+          show_root_heading: false
+          type:
+            required: true
+
+=== "Header"
+
+    Partition requests by the value of a header.
+
+    #SCHEMA# dstack._internal.core.models.configurations.HeaderPartitioningKey
+        overrides:
+          show_root_heading: false
+          type:
+            required: true
+
 ### `retry`
 
 #SCHEMA# dstack._internal.core.models.profiles.ProfileRetry
 
@@ -76,11 +76,7 @@ def generate_schema_reference(
             # TODO: This is a dirty workaround
             if field_type:
                 if field.annotation.__name__ == "Annotated":
-                    if field_type.__name__ == "Optional":
-                        field_type = get_args(field_type)[0]
-                    if field_type.__name__ == "List":
-                        field_type = get_args(field_type)[0]
-                    if field_type.__name__ == "Union":
+                    if field_type.__name__ in ["Optional", "List", "list", "Union"]:
                         field_type = get_args(field_type)[0]
                 base_model = (
                     inspect.isclass(field_type)
 
@@ -1,4 +1,5 @@
 import re
+from collections import Counter
 from enum import Enum
 from typing import Any, Dict, List, Optional, Union
 
@@ -18,6 +19,7 @@
 
 CommandsList = List[str]
 ValidPort = conint(gt=0, le=65536)
+MAX_INT64 = 2**63 - 1
 SERVICE_HTTPS_DEFAULT = True
 STRIP_PREFIX_DEFAULT = True
 
@@ -85,6 +87,70 @@ class ScalingSpec(CoreModel):
     ] = Duration.parse("10m")
 
 
+class IPAddressPartitioningKey(CoreModel):
+    type: Annotated[Literal["ip_address"], Field(description="Partitioning type")] = "ip_address"
+
+
+class HeaderPartitioningKey(CoreModel):
+    type: Annotated[Literal["header"], Field(description="Partitioning type")] = "header"
+    header: Annotated[
+        str,
+        Field(
+            description="Name of the header to use for partitioning",
+            regex=r"^[a-zA-Z0-9-_]+$",  # prevent Nginx config injection
+            max_length=500,  # chosen randomly, Nginx limit is higher
+        ),
+    ]
+
+
+class RateLimit(CoreModel):
+    prefix: Annotated[
+        str,
+        Field(
+            description=(
+                "URL path prefix to which this limit is applied."
+                " If an incoming request matches several prefixes, the longest prefix is applied"
+            ),
+            max_length=4094,  # Nginx limit
+            regex=r"^/[^\s\\{}]*$",  # prevent Nginx config injection
+        ),
+    ] = "/"
+    key: Annotated[
+        Union[IPAddressPartitioningKey, HeaderPartitioningKey],
+        Field(
+            discriminator="type",
+            description=(
+                "The partitioning key. Each incoming request belongs to a partition"
+                " and rate limits are applied per partition."
+                " Defaults to partitioning by client IP address"
+            ),
+        ),
+    ] = IPAddressPartitioningKey()
+    rps: Annotated[
+        float,
+        Field(
+            description=(
+                "Max allowed number of requests per second."
+                " Requests are tracked at millisecond granularity."
+                " For example, `rps: 10` means at most 1 request per 100ms"
+            ),
+            # should fit into Nginx limits after being converted to requests per minute
+            ge=1 / 60,
+            le=MAX_INT64 // 60,
+        ),
+    ]
+    burst: Annotated[
+        int,
+        Field(
+            ge=0,
+            le=MAX_INT64,  # Nginx limit
+            description=(
+                "Max number of requests that can be passed to the service ahead of the rate limit"
+            ),
+        ),
+    ] = 0
+
+
 class BaseRunConfiguration(CoreModel):
     type: Literal["none"]
     name: Annotated[
@@ -306,6 +372,7 @@ class ServiceConfigurationParams(CoreModel):
         Optional[ScalingSpec],
         Field(description="The auto-scaling rules. Required if `replicas` is set to a range"),
     ] = None
+    rate_limits: Annotated[list[RateLimit], Field(description="Rate limiting rules")] = []
 
     @validator("port")
     def convert_port(cls, v) -> PortMapping:
@@ -358,6 +425,17 @@ def validate_scaling(cls, values):
             raise ValueError("To use `scaling`, `replicas` must be set to a range.")
         return values
 
+    @validator("rate_limits")
+    def validate_rate_limits(cls, v: list[RateLimit]) -> list[RateLimit]:
+        counts = Counter(limit.prefix for limit in v)
+        duplicates = [prefix for prefix, count in counts.items() if count > 1]
+        if duplicates:
+            raise ValueError(
+                f"Prefixes {duplicates} are used more than once."
+                " Each rate limit should have a unique path prefix"
+            )
+        return v
+
 
 class ServiceConfiguration(
     ProfileParams, BaseRunConfigurationWithCommands, ServiceConfigurationParams
 
@@ -1,3 +1,7 @@
+{% for zone in limit_req_zones %}
+limit_req_zone {{ zone.key }} zone={{ zone.name }}:10m rate={{ zone.rpm }}r/m;
+{% endfor %}
+
 {% if replicas %}
 upstream {{ run_name }} {
     {% for replica in replicas %}
@@ -9,21 +13,27 @@ upstream {{ run_name }} {
 {% endif %}
 server {
     server_name {{ domain }};
-
+    limit_req_status 429;
     access_log {{ access_log_path }} dstack_stat;
     client_max_body_size {{ client_max_body_size }};
 
-    location / {
+    {% for location in locations %}
+    location {{ location.prefix }} {
         {% if auth %}
-        auth_request /auth;
+        auth_request /_dstack_auth;
         {% endif %}
 
         {% if replicas %}
         try_files /nonexistent @$http_upgrade;
         {% else %}
         return 503;
         {% endif %}
+
+        {% if location.limit_req %}
+        limit_req zone={{ location.limit_req.zone }}{% if location.limit_req.burst %} burst={{ location.limit_req.burst }} nodelay{% endif %};
+        {% endif %}
     }
+    {% endfor %}
 
     {% if replicas %}
     location @websocket {
@@ -44,7 +54,7 @@ server {
     {% endif %}
 
     {% if auth %}
-    location = /auth {
+    location = /_dstack_auth {
         internal;
         if ($remote_addr = 127.0.0.1) {
             return 200;
 
@@ -30,6 +30,7 @@ async def register_service(
         run_name=body.run_name.lower(),
         domain=body.domain.lower(),
         https=body.https,
+        rate_limits=body.rate_limits,
         auth=body.auth,
         client_max_body_size=body.client_max_body_size,
         model=body.options.openai.model if body.options.openai is not None else None,
 
@@ -3,6 +3,7 @@
 from pydantic import BaseModel, Field
 
 from dstack._internal.core.models.instances import SSHConnectionParams
+from dstack._internal.proxy.lib.models import RateLimit
 
 
 class BaseChatModel(BaseModel):
@@ -42,6 +43,7 @@ class RegisterServiceRequest(BaseModel):
     client_max_body_size: int
     options: Options
     ssh_private_key: str
+    rate_limits: tuple[RateLimit, ...] = ()
 
 
 class RegisterReplicaRequest(BaseModel):
 
@@ -3,6 +3,7 @@
 import tempfile
 from asyncio import Lock
 from pathlib import Path
+from typing import Optional
 
 import jinja2
 from pydantic import BaseModel
@@ -38,13 +39,31 @@ class ReplicaConfig(BaseModel):
     socket: Path
 
 
+class LimitReqZoneConfig(BaseModel):
+    name: str
+    key: str
+    rpm: int
+
+
+class LimitReqConfig(BaseModel):
+    zone: str
+    burst: int
+
+
+class LocationConfig(BaseModel):
+    prefix: str
+    limit_req: Optional[LimitReqConfig]
+
+
 class ServiceConfig(SiteConfig):
     type: Literal["service"] = "service"
     project_name: str
     run_name: str
     auth: bool
     client_max_body_size: int
     access_log_path: Path
+    limit_req_zones: list[LimitReqZoneConfig]
+    locations: list[LocationConfig]
     replicas: list[ReplicaConfig]