Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -10,20 +10,86 @@

## Overview

Deploy Observability Pipelines Worker into your infrastructure like any other service to intercept and manipulate data, and then forward it to your destinations. Each Observability Pipelines Worker instance operates independently, so that you can scale the architecture with a simple load balancer.
Deploy the Observability Pipelines Worker into your infrastructure, like you would any other service, to intercept, manipulate, and forward data to your destinations. Each Observability Pipelines Worker instance is designed to operate independently, allowing you to scale your architecture with load balancing.

This guide walks you through the recommended aggregator architecture for new Observability Pipelines Worker users. Specifically these topics:
This guide walks you through the recommended aggregator pattern for new Observability Pipelines Worker users, specifically:

- [Architecture models and approaches](#architecture)
- [Optimizing the instance](#optimize-the-instance) so you can horizontally scale the Observability Pipelines Worker aggregator.
- Starting points to estimate your resource capacity for [capacity planning and scaling](#capacity-planning-and-scaling) the Observability Pipelines Worker.

## Architecture

This section covers:

- Architecture models:
- [VM-based model](#vm-based-architecture)
- [Kubernetes-based model](#kubernetes-based-architecture)
- [Centralized vs decentralized approach](#centralized-vs-decentralized-approach)
- [Choosing a VM-based vs Kubernetes-based architecture](#choosing-a-vm-based-vs-kubernetes-based-architecture)

### Architecture models

There are two common architecture models:

- **Virtual-machine-based (VM-based) architecture**: A host-based model fronted by a load balancer.
- **Kubernetes-based architecture**: A container-based model that can optionally be fronted with an ingress controller or load balancer (for sources external to the cluster, a Kubernetes service handles internal cluster requests).

Both models can be applied to a centralized or decentralized approach. In a centralized approach, Workers operate on a global scale, across datacenters or regions. In a decentralized approach, Workers operate on a local scale, so in the region, datacenter, or cluster where the data source is located. For large scale environments spanning many datacenters, regions, or cloud provider accounts, a hybrid model may be appropriate.

Check warning on line 38 in content/en/observability_pipelines/scaling_and_performance/best_practices_for_scaling_observability_pipelines.md

View workflow job for this annotation

GitHub Actions / vale

Datadog.words_case_insensitive

Use 'data centers' instead of 'datacenters'.

Check warning on line 38 in content/en/observability_pipelines/scaling_and_performance/best_practices_for_scaling_observability_pipelines.md

View workflow job for this annotation

GitHub Actions / vale

Datadog.words_case_insensitive

Use 'data center' instead of 'datacenter'.

Check warning on line 38 in content/en/observability_pipelines/scaling_and_performance/best_practices_for_scaling_observability_pipelines.md

View workflow job for this annotation

GitHub Actions / vale

Datadog.words_case_insensitive

Use 'data centers' instead of 'datacenters'.

Generally, Datadog recommends operating the Worker as close to the data source as possible. This might require more administrative and infrastructure overhead, but it reduces concerns about network transit issues and single point of failures.

For both models, Datadog recommends scaling Workers [horizontally][1] to handle increased load and maintain high availability. You can achieve this using a managed instance group (such as an autoscaling group) or horizontal pod autoscaling.

The Worker can also be scaled [vertically][2], which takes advantage of additional cores and memory without any additional configuration. For certain processors, such as the Sensitive Data Scanner processor with many rules enabled, or heavy processing use cases, the Worker benefits from additional cores to allow for parallel thread execution. When vertically scaling, Datadog recommends capping an instance's size to process no more than 33% of your total volume. This allows for high availability in the event of a node failure.

Check notice on line 44 in content/en/observability_pipelines/scaling_and_performance/best_practices_for_scaling_observability_pipelines.md

View workflow job for this annotation

GitHub Actions / vale

Datadog.sentencelength

Suggestion: Try to keep your sentence length to 25 words or fewer.

#### VM-based architecture

Check warning on line 46 in content/en/observability_pipelines/scaling_and_performance/best_practices_for_scaling_observability_pipelines.md

View workflow job for this annotation

GitHub Actions / vale

Datadog.headings

'VM-based architecture' should use sentence-style capitalization.

The following architecture diagram is for a host-based architecture, where a load balancer accepts traffic from push-based sources. If only pull-based sources are being used, a load balancer is not required. In the diagram, the Worker is part of a managed instance group that scales based on processing needs. The Observability Pipelines Worker is almost always CPU constrained. CPU utilization is the strongest signal for autoscaling because CPU utilization metrics do not produce false positives.

{{< img src="observability_pipelines/scaling_best_practices/vm-infra.png" alt="Diagram showing the Worker as part of a managed instance group" style="width:100%;" >}}

#### Kubernetes-based architecture

The following architecture diagram is for a container-based architecture, where the Kubernetes service acts as the router to the statefulset and accepts traffic from push-based sources. If you are sending telemetry from outside the cluster, set the [service.type to `LoadBalancer`][3] or install an [ingress controller][4] and configure an [ingress][5] for routing. The Worker runs as part of a statefulset and supports horizontal pod autoscaling to adjust capacity based on processing needs. Like the VM-based architecture, Workers can also scale vertically and take advantage of multiple cores for parallel processing.

Check notice on line 54 in content/en/observability_pipelines/scaling_and_performance/best_practices_for_scaling_observability_pipelines.md

View workflow job for this annotation

GitHub Actions / vale

Datadog.sentencelength

Suggestion: Try to keep your sentence length to 25 words or fewer.

{{< img src="observability_pipelines/scaling_best_practices/containerized-infra.png" alt="Diagram showing the Worker as part of a statefulset" style="width:100%;" >}}

### Choosing a VM-based vs Kubernetes-based architecture

Choose the Kubernetes-based architecture if:

- Your log sources are within a Kubernetes cluster and you want to use the decentralized approach
- Your organization uses Kubernetes heavily and is proficient with it

Choose the VM-based architecture if your organization is more VM centric and not proficient with Kubernetes.

Choosing between the two models comes down to what your organization is best equipped to do from an infrastructure perspective. Each model offers the ability to automatically scale based on CPU utilization, which is generally the primary constraint for Observability Pipelines. See [Optimize the instance][6] for more information.

### Centralized vs decentralized approach

Datadog recommends the decentralized approach of deploying the Workers as close to the data source as possible. This means placing Workers within each location where the data originates, such as the region, cluster, or datacenter. The decentralized model is better for environments with large volumes of data because it:

Check warning on line 71 in content/en/observability_pipelines/scaling_and_performance/best_practices_for_scaling_observability_pipelines.md

View workflow job for this annotation

GitHub Actions / vale

Datadog.words_case_insensitive

Use 'data center' instead of 'datacenter'.

- Minimizes cross-region or cross-datacenter network transit

Check warning on line 73 in content/en/observability_pipelines/scaling_and_performance/best_practices_for_scaling_observability_pipelines.md

View workflow job for this annotation

GitHub Actions / vale

Datadog.words_case_insensitive

Use 'data center' instead of 'datacenter'.
- Avoids potential performance issues related to inter-region or inter-account data transfer
- Helps reduce data transfer costs by keeping processing local to the data sources

A centralized deployment runs Workers in a single location, aggregating data from multiple regions, clusters, or datacenters. This approach works best for lower data volumes or when network peering already exists. Be aware that high-volume data transfers across regions or accounts may incur additional costs.

Check warning on line 77 in content/en/observability_pipelines/scaling_and_performance/best_practices_for_scaling_observability_pipelines.md

View workflow job for this annotation

GitHub Actions / vale

Datadog.words_case_insensitive

Use 'data centers' instead of 'datacenters'.

A hybrid model is a good compromise between the decentralized and centralized approaches, particularly for large wide-spread infrastructure deployments. For example, if you have six regions and in each region you have 10 Kubernetes clusters, rather than:

- Deploying Workers into each cluster, which results in 60 deployments
- Deploying Workers into one region and routing traffic across regions, which introduces a single point of failure

A hybrid approach uses a dedicated Kubernetes cluster or managed instance group in each region, resulting in only six deployments. The 10 clusters within each region send their data to the regional Observability Pipelines Worker (OPW) deployment.

## Optimize the instance

### Instance sizing

Based on performance benchmarking for a pipeline that is using 12 processors to transform data, the Worker can handle approximately 1 TB per vCPU per day. For example, if you have 4 TB of events per day, you should provision enough compute plus headroom to account for your volumes. This could be three two-core machines or containers, or one six-core machine or container. Datadog recommends deploying Workers as part of an autoscaling group or deployed with [Horizontal Pod Autoscaling][1] enabled. Do not rely on a statically configured number of VMs or containers. This ensures that if the number of events spike, you can safely handle the traffic without data loss. It also ensures high availability should a Worker go down for any reason.
Based on performance benchmarking for a pipeline that is using 12 processors to transform data, the Worker can handle approximately 1 TB per vCPU per day. For example, if you have 4 TB of events per day, you should provision enough compute plus headroom to account for your volumes. This could be three two-core machines or containers, or one six-core machine or container. Datadog recommends deploying Workers as part of an autoscaling group or deployed with [Horizontal Pod Autoscaling][7] enabled. Do not rely on a statically configured number of VMs or containers. This helps ensure you can safely handle traffic spikes without data loss and maintain high availability if a Worker goes down.

Check warning on line 90 in content/en/observability_pipelines/scaling_and_performance/best_practices_for_scaling_observability_pipelines.md

View workflow job for this annotation

GitHub Actions / vale

Datadog.words_case_insensitive

Use 'helps' or 'helps ensure' instead of 'ensure'.

Check notice on line 90 in content/en/observability_pipelines/scaling_and_performance/best_practices_for_scaling_observability_pipelines.md

View workflow job for this annotation

GitHub Actions / vale

Datadog.sentencelength

Suggestion: Try to keep your sentence length to 25 words or fewer.

For high throughput environments, Datadog recommends larger machine types because they typically have higher network bandwidth. Consult your cloud provider's documentation for details (for example, [Amazon EC2 instance network bandwith][2]).
For high throughput environments, Datadog recommends larger machine types because they typically have higher network bandwidth. Consult your cloud provider's documentation for details (for example, [Amazon EC2 instance network bandwith][8]).

| Cloud Provider| Recommendation (minimum) |
| ------------- | ------------------------ |
Expand Down Expand Up @@ -77,12 +143,8 @@

For push-based sources, front your Observability Pipelines Worker instances with a network load balancer and scale them up and down as needed.

{{< img src="observability_pipelines/production_deployment_overview/horizontal_scaling_push.png" alt="A diagram showing a cloud region broken down into agents, network load balancers, and an Observability Pipelines Worker aggregator, and the data from the agents are sent to the load balancer, Observability Pipelines Workers, and then to other destinations" style="width:60%;" >}}

A load balancer is not required for pull-based sources. Deploy Observability Pipelines Worker and scale it up and down as needed. Your publish-subscription system coordinates exclusive access to the data when Observability Pipelines Worker asks to read it.

{{< img src="observability_pipelines/production_deployment_overview/horizontal_scaling_pull.png" alt="A diagram showing a cloud region broken down into agents, brokers, and an Observability Pipelines aggregator. Data from the agents are sent to the brokers, and then sent and received between the broker and the Observability Pipelines Workers, and then sent from the Workers out to the other destinations" style="width:60%;" >}}

##### Load balancing

A load balancer is only required for push-based sources, such as agents. You do not need a load balancer if you are exclusively using pull-based sources, such as Kafka.
Expand Down Expand Up @@ -139,5 +201,11 @@
- Average CPU with a 85% utilization target.
- A five minute stabilization period for scaling up and down.

[1]: https://github.com/DataDog/helm-charts/blob/main/charts/observability-pipelines-worker/values.yaml#L70-L85
[2]: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-network-bandwidth.html
[1]: /observability_pipelines/scaling_and_performance/best_practices_for_scaling_observability_pipelines/#horizontal-scaling
[2]: /observability_pipelines/scaling_and_performance/best_practices_for_scaling_observability_pipelines/#vertical-scaling
[3]: https://github.com/DataDog/helm-charts/blob/main/charts/observability-pipelines-worker/values.yaml#L208-L209
[4]: https://kubernetes.io/docs/concepts/services-networking/ingress-controllers/
[5]: https://github.com/DataDog/helm-charts/blob/main/charts/observability-pipelines-worker/values.yaml#L238
[6]: /observability_pipelines/scaling_and_performance/best_practices_for_scaling_observability_pipelines/#optimize-the-instance
[7]: https://github.com/DataDog/helm-charts/blob/main/charts/observability-pipelines-worker/values.yaml#L70-L85
[8]: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-network-bandwidth.html
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading