From f07e2efe075a22853ece1e6449a7740fbfdf700f Mon Sep 17 00:00:00 2001 From: Maedah Batool Date: Thu, 26 Jun 2025 15:38:11 -0700 Subject: [PATCH] Move AWS latency optimization to modelConfigexamples docs --- .../enterprise/completions-configuration.mdx | 93 +------------------ .../cody/enterprise/model-config-examples.mdx | 87 +++++++++++++++++ 2 files changed, 90 insertions(+), 90 deletions(-) diff --git a/docs/cody/enterprise/completions-configuration.mdx b/docs/cody/enterprise/completions-configuration.mdx index 46e6e0bb2..952876b9f 100644 --- a/docs/cody/enterprise/completions-configuration.mdx +++ b/docs/cody/enterprise/completions-configuration.mdx @@ -87,62 +87,9 @@ For `endpoint`, you can either: For `accessToken`, you can either: -- Leave it empty and rely on instance role bindings or other AWS configurations in the `frontend` service -- Set it to `:` if directly configuring the credentials -- Set it to `::` if a session token is also required - -#### AWS Bedrock: Latency optimization - -Optimization for latency with AWS Bedrock is available in Sourcegraph v6.5 and more. - -AWS Bedrock supports [Latency Optimized Inference](https://docs.aws.amazon.com/bedrock/latest/userguide/latency-optimized-inference.html) which can reduce autocomplete latency with models like Claude 3.5 Haiku by up to ~40%. - -To use Bedrock's latency optimized inference feature for a specific model with Cody, configure the `"latencyOptimization": "optimized"` setting under the `serverSideConfig` of any model in `modelOverrides`. For example: - -```json -"modelOverrides": [ - { - "modelRef": "aws-bedrock::v1::claude-3-5-haiku-latency-optimized", - "modelName": "us.anthropic.claude-3-5-haiku-20241022-v1:0", - "displayName": "Claude 3.5 Haiku (latency optimized)", - "capabilities": [ - "chat", - "autocomplete" - ], - "category": "speed", - "status": "stable", - "contextWindow": { - "maxInputTokens": 200000, - "maxOutputTokens": 4096 - }, - "serverSideConfig": { - "type": "awsBedrock", - "latencyOptimization": "optimized" - } - }, - { - "modelRef": "aws-bedrock::v1::claude-3-5-haiku", - "modelName": "us.anthropic.claude-3-5-haiku-20241022-v1:0", - "displayName": "Claude 3.5 Haiku", - "capabilities": [ - "chat", - "autocomplete" - ], - "category": "speed", - "status": "stable", - "contextWindow": { - "maxInputTokens": 200000, - "maxOutputTokens": 4096 - }, - "serverSideConfig": { - "type": "awsBedrock", - "latencyOptimization": "standard" - } - } -] -``` - -See also [Debugging: running a latency test](#debugging-running-a-latency-test). +- Leave it empty and rely on instance role bindings or other AWS configurations in the `frontend` service +- Set it to `:` if directly configuring the credentials +- Set it to `::` if a session token is also required ### Example: Using GCP Vertex AI @@ -237,37 +184,3 @@ To enable StarCoder, go to **Site admin > Site configuration** (`/site-admin/con ``` Users of the Cody extensions will automatically pick up this change when connected to your Enterprise instance. - -## Debugging: Running a latency test - -Debugging latency optimizated inference is supported in Sourcegraph v6.5 and more. - -Site administrators can test completions latency by sending a special debug command in any Cody chat window (in the web, in the editor, etc.): - -```shell -cody_debug:::{"latencytest": 100} -``` - -Cody will then perform `100` quick `Hello, please respond with a short message.` requests to the LLM model selected in the dropdown, and measure the time taken to get the first streaming event back (for example first token from the model.) It records all of these requests timing information, and then responds with a report indicating the latency between the Sourcegraph `frontend` container and the LLM API: - -```shell -Starting latency test with 10 requests... - -Individual timings: - -[... how long each request took ...] - -Summary: - -* Requests: 10/10 successful -* Average: 882ms -* Minimum: 435ms -* Maximum: 1.3s -``` - -This can be helpful to get a feel for the latency of particular models, or models with different configurations - such as when using the AWS Bedrock Latency Optimized Inference feature. - -Few important considerations: - -- Debug commands are only available to site administrators and have no effect when used by regular users. -- Sourcegraph's built-in Grafana monitoring also has a full `Completions` dashboard for monitoring LLM requests, performance, etc. diff --git a/docs/cody/enterprise/model-config-examples.mdx b/docs/cody/enterprise/model-config-examples.mdx index a138bcc8a..6ce3434f7 100644 --- a/docs/cody/enterprise/model-config-examples.mdx +++ b/docs/cody/enterprise/model-config-examples.mdx @@ -792,4 +792,91 @@ Provisioned throughput for Amazon Bedrock models can be configured using the `"a ](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_InstanceMetadataOptionsRequest.html#:~:text=HttpPutResponseHopLimit) instance metadata option to a higher value (e.g., 2) to ensure that the metadata service can be accessed from the frontend container running in the EC2 instance. See [here](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configuring-IMDS-existing-instances.html) for instructions. +## AWS Bedrock: Latency optimization + +Optimization for latency with AWS Bedrock is available in Sourcegraph v6.5 and more. + +AWS Bedrock supports [Latency Optimized Inference](https://docs.aws.amazon.com/bedrock/latest/userguide/latency-optimized-inference.html) which can reduce autocomplete latency with models like Claude 3.5 Haiku by up to ~40%. + +To use Bedrock's latency optimized inference feature for a specific model with Cody, configure the `"latencyOptimization": "optimized"` setting under the `serverSideConfig` of any model in `modelOverrides`. For example: + +```json +"modelOverrides": [ + { + "modelRef": "aws-bedrock::v1::claude-3-5-haiku-latency-optimized", + "modelName": "us.anthropic.claude-3-5-haiku-20241022-v1:0", + "displayName": "Claude 3.5 Haiku (latency optimized)", + "capabilities": [ + "chat", + "autocomplete" + ], + "category": "speed", + "status": "stable", + "contextWindow": { + "maxInputTokens": 200000, + "maxOutputTokens": 4096 + }, + "serverSideConfig": { + "type": "awsBedrock", + "latencyOptimization": "optimized" + } + }, + { + "modelRef": "aws-bedrock::v1::claude-3-5-haiku", + "modelName": "us.anthropic.claude-3-5-haiku-20241022-v1:0", + "displayName": "Claude 3.5 Haiku", + "capabilities": [ + "chat", + "autocomplete" + ], + "category": "speed", + "status": "stable", + "contextWindow": { + "maxInputTokens": 200000, + "maxOutputTokens": 4096 + }, + "serverSideConfig": { + "type": "awsBedrock", + "latencyOptimization": "standard" + } + } +] +``` + +See also [Debugging: running a latency test](#debugging-running-a-latency-test). + +### Debugging: Running a latency test + +Debugging latency optimizated inference is supported in Sourcegraph v6.5 and more. + +Site administrators can test completions latency by sending a special debug command in any Cody chat window (in the web, in the editor, etc.): + +```shell +cody_debug:::{"latencytest": 100} +``` + +Cody will then perform `100` quick `Hello, please respond with a short message.` requests to the LLM model selected in the dropdown, and measure the time taken to get the first streaming event back (for example first token from the model.) It records all of these requests timing information, and then responds with a report indicating the latency between the Sourcegraph `frontend` container and the LLM API: + +```shell +Starting latency test with 10 requests... + +Individual timings: + +[... how long each request took ...] + +Summary: + +* Requests: 10/10 successful +* Average: 882ms +* Minimum: 435ms +* Maximum: 1.3s +``` + +This can be helpful to get a feel for the latency of particular models, or models with different configurations - such as when using the AWS Bedrock Latency Optimized Inference feature. + +Few important considerations: + +- Debug commands are only available to site administrators and have no effect when used by regular users. +- Sourcegraph's built-in Grafana monitoring also has a full `Completions` dashboard for monitoring LLM requests, performance, etc. +