Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
93 changes: 3 additions & 90 deletions docs/cody/enterprise/completions-configuration.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -87,62 +87,9 @@ For `endpoint`, you can either:

For `accessToken`, you can either:

- Leave it empty and rely on instance role bindings or other AWS configurations in the `frontend` service
- Set it to `<ACCESS_KEY_ID>:<SECRET_ACCESS_KEY>` if directly configuring the credentials
- Set it to `<ACCESS_KEY_ID>:<SECRET_ACCESS_KEY>:<SESSION_TOKEN>` if a session token is also required

#### AWS Bedrock: Latency optimization

<Callout type="note">Optimization for latency with AWS Bedrock is available in Sourcegraph v6.5 and more.</Callout>

AWS Bedrock supports [Latency Optimized Inference](https://docs.aws.amazon.com/bedrock/latest/userguide/latency-optimized-inference.html) which can reduce autocomplete latency with models like Claude 3.5 Haiku by up to ~40%.

To use Bedrock's latency optimized inference feature for a specific model with Cody, configure the `"latencyOptimization": "optimized"` setting under the `serverSideConfig` of any model in `modelOverrides`. For example:

```json
"modelOverrides": [
{
"modelRef": "aws-bedrock::v1::claude-3-5-haiku-latency-optimized",
"modelName": "us.anthropic.claude-3-5-haiku-20241022-v1:0",
"displayName": "Claude 3.5 Haiku (latency optimized)",
"capabilities": [
"chat",
"autocomplete"
],
"category": "speed",
"status": "stable",
"contextWindow": {
"maxInputTokens": 200000,
"maxOutputTokens": 4096
},
"serverSideConfig": {
"type": "awsBedrock",
"latencyOptimization": "optimized"
}
},
{
"modelRef": "aws-bedrock::v1::claude-3-5-haiku",
"modelName": "us.anthropic.claude-3-5-haiku-20241022-v1:0",
"displayName": "Claude 3.5 Haiku",
"capabilities": [
"chat",
"autocomplete"
],
"category": "speed",
"status": "stable",
"contextWindow": {
"maxInputTokens": 200000,
"maxOutputTokens": 4096
},
"serverSideConfig": {
"type": "awsBedrock",
"latencyOptimization": "standard"
}
}
]
```

See also [Debugging: running a latency test](#debugging-running-a-latency-test).
- Leave it empty and rely on instance role bindings or other AWS configurations in the `frontend` service
- Set it to `<ACCESS_KEY_ID>:<SECRET_ACCESS_KEY>` if directly configuring the credentials
- Set it to `<ACCESS_KEY_ID>:<SECRET_ACCESS_KEY>:<SESSION_TOKEN>` if a session token is also required

### Example: Using GCP Vertex AI

Expand Down Expand Up @@ -237,37 +184,3 @@ To enable StarCoder, go to **Site admin > Site configuration** (`/site-admin/con
```

Users of the Cody extensions will automatically pick up this change when connected to your Enterprise instance.

## Debugging: Running a latency test

<Callout type="note">Debugging latency optimizated inference is supported in Sourcegraph v6.5 and more.</Callout>

Site administrators can test completions latency by sending a special debug command in any Cody chat window (in the web, in the editor, etc.):

```shell
cody_debug:::{"latencytest": 100}
```

Cody will then perform `100` quick `Hello, please respond with a short message.` requests to the LLM model selected in the dropdown, and measure the time taken to get the first streaming event back (for example first token from the model.) It records all of these requests timing information, and then responds with a report indicating the latency between the Sourcegraph `frontend` container and the LLM API:

```shell
Starting latency test with 10 requests...

Individual timings:

[... how long each request took ...]

Summary:

* Requests: 10/10 successful
* Average: 882ms
* Minimum: 435ms
* Maximum: 1.3s
```

This can be helpful to get a feel for the latency of particular models, or models with different configurations - such as when using the AWS Bedrock Latency Optimized Inference feature.

Few important considerations:

- Debug commands are only available to site administrators and have no effect when used by regular users.
- Sourcegraph's built-in Grafana monitoring also has a full `Completions` dashboard for monitoring LLM requests, performance, etc.
87 changes: 87 additions & 0 deletions docs/cody/enterprise/model-config-examples.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -792,4 +792,91 @@ Provisioned throughput for Amazon Bedrock models can be configured using the `"a
](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_InstanceMetadataOptionsRequest.html#:~:text=HttpPutResponseHopLimit) instance metadata option to a higher value (e.g., 2) to ensure that the metadata service can be accessed from the frontend container running in the EC2 instance. See [here](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configuring-IMDS-existing-instances.html) for instructions.
</Callout>

## AWS Bedrock: Latency optimization

<Callout type="note">Optimization for latency with AWS Bedrock is available in Sourcegraph v6.5 and more.</Callout>

AWS Bedrock supports [Latency Optimized Inference](https://docs.aws.amazon.com/bedrock/latest/userguide/latency-optimized-inference.html) which can reduce autocomplete latency with models like Claude 3.5 Haiku by up to ~40%.

To use Bedrock's latency optimized inference feature for a specific model with Cody, configure the `"latencyOptimization": "optimized"` setting under the `serverSideConfig` of any model in `modelOverrides`. For example:

```json
"modelOverrides": [
{
"modelRef": "aws-bedrock::v1::claude-3-5-haiku-latency-optimized",
"modelName": "us.anthropic.claude-3-5-haiku-20241022-v1:0",
"displayName": "Claude 3.5 Haiku (latency optimized)",
"capabilities": [
"chat",
"autocomplete"
],
"category": "speed",
"status": "stable",
"contextWindow": {
"maxInputTokens": 200000,
"maxOutputTokens": 4096
},
"serverSideConfig": {
"type": "awsBedrock",
"latencyOptimization": "optimized"
}
},
{
"modelRef": "aws-bedrock::v1::claude-3-5-haiku",
"modelName": "us.anthropic.claude-3-5-haiku-20241022-v1:0",
"displayName": "Claude 3.5 Haiku",
"capabilities": [
"chat",
"autocomplete"
],
"category": "speed",
"status": "stable",
"contextWindow": {
"maxInputTokens": 200000,
"maxOutputTokens": 4096
},
"serverSideConfig": {
"type": "awsBedrock",
"latencyOptimization": "standard"
}
}
]
```

See also [Debugging: running a latency test](#debugging-running-a-latency-test).

### Debugging: Running a latency test

<Callout type="note">Debugging latency optimizated inference is supported in Sourcegraph v6.5 and more.</Callout>

Site administrators can test completions latency by sending a special debug command in any Cody chat window (in the web, in the editor, etc.):

```shell
cody_debug:::{"latencytest": 100}
```

Cody will then perform `100` quick `Hello, please respond with a short message.` requests to the LLM model selected in the dropdown, and measure the time taken to get the first streaming event back (for example first token from the model.) It records all of these requests timing information, and then responds with a report indicating the latency between the Sourcegraph `frontend` container and the LLM API:

```shell
Starting latency test with 10 requests...

Individual timings:

[... how long each request took ...]

Summary:

* Requests: 10/10 successful
* Average: 882ms
* Minimum: 435ms
* Maximum: 1.3s
```

This can be helpful to get a feel for the latency of particular models, or models with different configurations - such as when using the AWS Bedrock Latency Optimized Inference feature.

Few important considerations:

- Debug commands are only available to site administrators and have no effect when used by regular users.
- Sourcegraph's built-in Grafana monitoring also has a full `Completions` dashboard for monitoring LLM requests, performance, etc.

</Accordion>