diff --git a/docs/cody/enterprise/completions-configuration.mdx b/docs/cody/enterprise/completions-configuration.mdx index cdab972ab..de561e1c1 100644 --- a/docs/cody/enterprise/completions-configuration.mdx +++ b/docs/cody/enterprise/completions-configuration.mdx @@ -91,15 +91,58 @@ For `accessToken`, you can either: - Set it to `:` if directly configuring the credentials - Set it to `::` if a session token is also required - - We only recommend configuring AWS Bedrock to use an accessToken for - authentication. Specifying no accessToken (e.g. to use [IAM roles for EC2 / - instance role - binding](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html)) - is not currently recommended (there is a known performance bug with this - method which will prevent autocomplete from working correctly. (internal - issue: PRIME-662) - +#### AWS Bedrock: Latency optimization + +Optimization for latency with AWS Bedrock is available in Sourcegraph v6.5 and more. + +AWS Bedrock supports [Latency Optimized Inference](https://docs.aws.amazon.com/bedrock/latest/userguide/latency-optimized-inference.html) which can reduce autocomplete latency with models like Claude 3.5 Haiku by up to ~40%. + +To use Bedrock's latency optimized inference feature for a specific model with Cody, configure the `"latencyOptimization": "optimized"` setting under the `serverSideConfig` of any model in `modelOverrides`. For example: + +```json +"modelOverrides": [ + { + "modelRef": "aws-bedrock::v1::claude-3-5-haiku-latency-optimized", + "modelName": "us.anthropic.claude-3-5-haiku-20241022-v1:0", + "displayName": "Claude 3.5 Haiku (latency optimized)", + "capabilities": [ + "chat", + "autocomplete" + ], + "category": "speed", + "status": "stable", + "contextWindow": { + "maxInputTokens": 200000, + "maxOutputTokens": 4096 + }, + "serverSideConfig": { + "type": "awsBedrock", + "latencyOptimization": "optimized" + } + }, + { + "modelRef": "aws-bedrock::v1::claude-3-5-haiku", + "modelName": "us.anthropic.claude-3-5-haiku-20241022-v1:0", + "displayName": "Claude 3.5 Haiku", + "capabilities": [ + "chat", + "autocomplete" + ], + "category": "speed", + "status": "stable", + "contextWindow": { + "maxInputTokens": 200000, + "maxOutputTokens": 4096 + }, + "serverSideConfig": { + "type": "awsBedrock", + "latencyOptimization": "standard" + } + } +] +``` + +See also [Debugging: running a latency test](#debugging-running-a-latency-test). ### Example: Using GCP Vertex AI @@ -194,3 +237,37 @@ To enable StarCoder, go to **Site admin > Site configuration** (`/site-admin/con ``` Users of the Cody extensions will automatically pick up this change when connected to your Enterprise instance. + +## Debugging: Running a latency test + +Debugging latency optimizated inference is supported in Sourcegraph v6.5 and more. + +Site administrators can test completions latency by sending a special debug command in any Cody chat window (in the web, in the editor, etc.): + +```shell +cody_debug:::{"latencytest": 100} +``` + +Cody will then perform `100` quick `Hello, please respond with a short message.` requests to the LLM model selected in the dropdown, and measure the time taken to get the first streaming event back (for example first token from the model.) It records all of these requests timing information, and then responds with a report indicating the latency between the Sourcegraph `frontend` container and the LLM API: + +```shell +Starting latency test with 10 requests... + +Individual timings: + +[... how long each request took ...] + +Summary: + +* Requests: 10/10 successful +* Average: 882ms +* Minimum: 435ms +* Maximum: 1.3s +``` + +This can be helpful to get a feel for the latency of particular models, or models with different configurations - such as when using the AWS Bedrock Latency Optimized Inference feature. + +Few important considerations: + +- Debug commands are only available to site administrators and have no effect when used by regular users. +- Sourcegraph's built-in Grafana monitoring also has a full `Completions` dashboard for monitoring LLM requests, performance, etc. diff --git a/docs/cody/enterprise/model-config-examples.mdx b/docs/cody/enterprise/model-config-examples.mdx index 99671fc94..a138bcc8a 100644 --- a/docs/cody/enterprise/model-config-examples.mdx +++ b/docs/cody/enterprise/model-config-examples.mdx @@ -792,14 +792,4 @@ Provisioned throughput for Amazon Bedrock models can be configured using the `"a ](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_InstanceMetadataOptionsRequest.html#:~:text=HttpPutResponseHopLimit) instance metadata option to a higher value (e.g., 2) to ensure that the metadata service can be accessed from the frontend container running in the EC2 instance. See [here](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configuring-IMDS-existing-instances.html) for instructions. - - We only recommend configuring AWS Bedrock to use an accessToken for - authentication. Specifying no accessToken (e.g. to use [IAM roles for EC2 / - instance role - binding](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html)) - is not currently recommended. There is a known performance bug with this - method which will prevent autocomplete from working correctly (internal - issue: CORE-819) - - diff --git a/public/llms.txt b/public/llms.txt index 96fc55723..89c54c5e5 100644 --- a/public/llms.txt +++ b/public/llms.txt @@ -15668,16 +15668,6 @@ Provisioned throughput for Amazon Bedrock models can be configured using the `"a ](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_InstanceMetadataOptionsRequest.html#:~:text=HttpPutResponseHopLimit) instance metadata option to a higher value (e.g., 2) to ensure that the metadata service can be accessed from the frontend container running in the EC2 instance. See [here](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configuring-IMDS-existing-instances.html) for instructions. - - We only recommend configuring AWS Bedrock to use an accessToken for - authentication. Specifying no accessToken (e.g. to use [IAM roles for EC2 / - instance role - binding](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html)) - is not currently recommended. There is a known performance bug with this - method which will prevent autocomplete from working correctly (internal - issue: CORE-819) - - @@ -15897,16 +15887,6 @@ For `accessToken`, you can either: - Set it to `:` if directly configuring the credentials - Set it to `::` if a session token is also required - - We only recommend configuring AWS Bedrock to use an accessToken for - authentication. Specifying no accessToken (e.g. to use [IAM roles for EC2 / - instance role - binding](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html)) - is not currently recommended (there is a known performance bug with this - method which will prevent autocomplete from working correctly. (internal - issue: PRIME-662) - - ### Example: Using GCP Vertex AI On [GCP Vertex](https://cloud.google.com/vertex-ai/generative-ai/docs/partner-models/use-claude), we only support Anthropic Claude models.