Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
142 changes: 92 additions & 50 deletions docs/products/kafka/howto/best-practices.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,27 @@
---
title: Get the best from Apache Kafka®
title: Optimize Apache Kafka® performance
---

Follow these best practices to ensure that your Aiven for Apache Kafka® service is fast and reliable.
Follow these best practices to optimize the performance and reliability of your Aiven for Apache Kafka® service.

## Check your topic replication factors

Apache Kafka services use replication between brokers to preserve data in case of a
node failure. Consider how critical the data in each topic is to your business, and set
a replication factor high enough to ensure data protection.
Apache Kafka uses replication between brokers to protect data in case of node failures.
The replication factor (RF) determines how many copies of each partition are maintained
across the cluster.

Evaluate the importance of each topic and set a replication factor that balances
durability requirements with cost and performance. An RF of 3 is recommended
for production because it improves durability and availability. In multi-AZ deployments,
replication traffic across availability zones can increase network costs, especially
for high-throughput workloads.

For Diskless Topics architecture and considerations, see
[Diskless Topics overview](/docs/products/kafka/diskless/concepts/diskless-overview).

Set the replication factor when creating or editing a
[topic](/docs/products/kafka/howto/create-topic) in the [Aiven Console](https://console.aiven.io/).
[topic](/docs/products/kafka/howto/create-topic) in the
[Aiven Console](https://console.aiven.io/).

:::note
Replication factors below 2 are not allowed to prevent data loss from unexpected node
Expand All @@ -20,77 +30,109 @@ terminations.

## Choose a reasonable number of partitions for a topic

Too few partitions can cause bottlenecks in data processing. In the extreme case, a
single partition means that messages are processed sequentially. Too many
partitions strain the cluster due to overhead. Since partition numbers cannot be reduced,
start with a low number that supports efficient processing and increase as needed.
Too few partitions can create processing bottlenecks. A single partition processes
messages sequentially, which limits throughput. Too many partitions increase overhead
and reduce cluster efficiency. Because partition counts cannot be reduced, start with a
number that supports parallel processing and increase it as needed.

A maximum of 4,000 partitions per broker and 200,000 per cluster is recommended. For more
details, see this [Apache Kafka blog post](https://blogsarchive.apache.org/kafka/entry/apache-kafka-supports-more-partitions).
In addition, the total number of topics per cluster should remain under 7,000.
A maximum of 4,000 partitions per broker and 200,000 per cluster is recommended.
For details, see this
[Apache Kafka blog post](https://blogsarchive.apache.org/kafka/entry/apache-kafka-supports-more-partitions).
Keep the total number of topics under 7,000.

:::note
Ordering is only guaranteed within a partition. To maintain the order of related records,
Ordering is guaranteed only within a partition. To maintain ordering of related records,
place them in the same partition.
:::

## Check entity-based partitions for imbalances

Partitioning messages based on an entity ID (such as a user ID) can lead to
imbalanced partitions. This results in uneven load distribution and reduces the
cluster's efficiency in processing messages in parallel.
Partitioning messages by an entity identifier, such as a user ID, can create imbalanced
partitions. This results in uneven load distribution and reduces parallel processing
efficiency.

You can view the size of each partition in the **Partitions** tab under
[topic](/docs/products/kafka/howto/create-topic) details in the
[Aiven Console](https://console.aiven.io/).
You can view the size of each partition by selecting the
[topic](/docs/products/kafka/howto/create-topic) in the Topics list and opening
the **Partitions** tab in the [Aiven Console](https://console.aiven.io/).

## Balance between throughput and latency

To find the right balance between throughput and latency, adjust the batch sizes in
your producer and consumer settings. Larger batches improve throughput but can increase
the time it takes to process individual messages. Smaller batches reduce this time
but increase the overhead, which may lower overall throughput.
Adjust producer and consumer batch sizes to balance throughput and latency. Larger
batches increase throughput but add latency. Smaller batches reduce latency but increase
overhead, which can lower throughput.

You can change settings like `batch.size` and `linger.ms` in your producer
configuration. For more details, refer to the
Settings such as `batch.size` and `linger.ms` can be configured in the producer. For
more details, refer to the
[Apache Kafka documentation](https://kafka.apache.org/documentation/).

## Configure acknowledgments for received data

The `acks` parameter in the producer configuration controls how the success of a
write operation is determined. Choose the appropriate setting based on your data
reliability needs:
The `acks` parameter in the producer configuration controls how write operations are
acknowledged. Choose a setting that matches your reliability requirements:

- **`acks=0`**: The producer sends data without waiting for confirmation from the
broker. This speeds up communication, but there’s a risk of data loss if the broker
goes down during transmission. Use this setting only if some data loss is acceptable.
- **`acks=0`**: The producer does not wait for confirmation. This minimizes latency but
increases the risk of data loss if the broker fails during transmission. Use this
setting only when some data loss is acceptable.

- **`acks=1` (default and recommended setting)**: The producer waits for the leader
broker to confirm receipt of the data. This reduces the chance of data loss, but
data can still be lost if the leader fails before the data is fully replicated.
- **`acks=1`** (default and recommended): The producer waits for the leader broker to
confirm receipt. This reduces the risk of data loss but does not protect against
leader failure before replication completes.

- **`acks=all`**: The producer waits for acknowledgment from both the leader and all
replicas. This ensures no data loss but can slow down communication.
- **`acks=all`**: The producer waits for acknowledgment from the leader and all in-sync
replicas. This prevents data loss but increases latency.

## Configure single availability zone (AZ) for BYOC

For Bring Your Own Cloud (BYOC) customers, deploying Aiven for Apache Kafka in a single
AZ can reduce costs by removing inter-zone data transfer fees.
However, using a single AZ removes Kafka's resiliency, as data is not replicated across
zones. This increases the risk of downtime if the AZ fails.
Deploying Aiven for Apache Kafka in a single availability zone (AZ) reduces inter-zone
data transfer costs. Single-AZ deployment places all brokers and replicas in one
failure domain, so the cluster cannot tolerate an AZ outage. If the zone becomes
unavailable, the service cannot recover until the zone is restored.

:::note
Before enabling this configuration, contact your account team to discuss your use case
and agree on the reduced SLA. The standard uptime SLA does not apply to services
deployed in a single AZ.
:::

### Replication factor considerations in a single AZ

**Replication factor 1 (RF=1):**
Creates a single copy of each partition. In a single-AZ deployment, a broker or AZ
failure results in data loss. Use RF=1 only when losing data is acceptable.

**Replication factor 3 (RF=3):**
Protects against individual broker failures. It does not protect against an AZ failure
when all replicas are in the same zone. If the AZ becomes unavailable, all replicas can
be lost, and the cluster cannot recover until the zone is restored.

### When to use single AZ

Avoid single-AZ deployment for production workloads or any data that cannot be
recreated. Use single AZ only for:

- Development, QA, or test workloads
- Temporary proof-of-concept environments
- Workloads where data can be recreated

### Risks and considerations

- All brokers and replicas exist within one failure domain, increasing the impact of an
AZ outage.
- Recovery options are limited because the cluster cannot fail over to another zone.
- Service downtime may increase during an AZ failure because no cross-zone redundancy
exists.
- SLA terms for single-AZ deployments must be agreed with your account team.

When considering a single AZ allocation, evaluate your organization's risk tolerance,
as Aiven's standard uptime SLA does not apply to services deployed in a single AZ.
### Enable single-AZ allocation

- To enable this option for your project, contact
[Aiven support](mailto:[email protected]) or your account team.
- You must configure single AZ allocation during service creation. It cannot be applied
to existing services.
Single-AZ allocation must be configured during service creation. It cannot be enabled
for existing Kafka services.

To enable single AZ allocation, use the [Aiven CLI](/docs/tools/cli) and
set `single_zone.enabled=true`.
To enable this option for your project, contact [Aiven support](mailto:[email protected])
or your account team.

Example command:
To create a single-AZ Kafka service using the [Aiven CLI](/docs/tools/cli),
set `single_zone.enabled=true`:

```bash
avn service create SERVICE_NAME \
Expand Down