Skip to content

Conversation

stzou
Copy link
Contributor

@stzou stzou commented Aug 11, 2025

What does this PR do?

Add recommended ECS Fargate monitors

Motivation

CAP-2761 - part of OKR to improve the overall ECS alerting and troubleshooting experience in Datadog.

Review checklist (to be filled by reviewers)

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • Add the qa/skip-qa label if the PR doesn't need to be tested during QA.
  • If you need to backport this PR to another branch, you can add the backport/<branch-name> label to the PR and it will automatically open a backport PR once this one is merged

Copy link

⚠️ Recommendation: Add qa/skip-qa Label

This PR does not modify any files shipped with the agent.

To help streamline the release process, please consider adding the qa/skip-qa label if these changes do not require QA testing.

8 similar comments
Copy link

⚠️ Recommendation: Add qa/skip-qa Label

This PR does not modify any files shipped with the agent.

To help streamline the release process, please consider adding the qa/skip-qa label if these changes do not require QA testing.

Copy link

⚠️ Recommendation: Add qa/skip-qa Label

This PR does not modify any files shipped with the agent.

To help streamline the release process, please consider adding the qa/skip-qa label if these changes do not require QA testing.

Copy link

⚠️ Recommendation: Add qa/skip-qa Label

This PR does not modify any files shipped with the agent.

To help streamline the release process, please consider adding the qa/skip-qa label if these changes do not require QA testing.

Copy link

⚠️ Recommendation: Add qa/skip-qa Label

This PR does not modify any files shipped with the agent.

To help streamline the release process, please consider adding the qa/skip-qa label if these changes do not require QA testing.

Copy link

⚠️ Recommendation: Add qa/skip-qa Label

This PR does not modify any files shipped with the agent.

To help streamline the release process, please consider adding the qa/skip-qa label if these changes do not require QA testing.

Copy link

⚠️ Recommendation: Add qa/skip-qa Label

This PR does not modify any files shipped with the agent.

To help streamline the release process, please consider adding the qa/skip-qa label if these changes do not require QA testing.

Copy link

⚠️ Recommendation: Add qa/skip-qa Label

This PR does not modify any files shipped with the agent.

To help streamline the release process, please consider adding the qa/skip-qa label if these changes do not require QA testing.

Copy link

⚠️ Recommendation: Add qa/skip-qa Label

This PR does not modify any files shipped with the agent.

To help streamline the release process, please consider adding the qa/skip-qa label if these changes do not require QA testing.

@stzou stzou added the qa/skip-qa Automatically skip this PR for the next QA label Aug 11, 2025
steveny91
steveny91 previously approved these changes Aug 14, 2025
@temporal-github-worker-1 temporal-github-worker-1 bot dismissed steveny91’s stale review August 18, 2025 20:46

Review from steveny91 is dismissed. Related teams and files:

  • agent-integrations
    • ecs_fargate/assets/monitors/ecs_fargate_cpu_usage.json
    • ecs_fargate/assets/monitors/ecs_fargate_ephemeral_storage.json
    • ecs_fargate/assets/monitors/ecs_fargate_mem_usage.json
    • ecs_fargate/assets/monitors/ecs_fargate_net_rcvd.json
    • ecs_fargate/assets/monitors/ecs_fargate_net_sent.json
sblumenthal
sblumenthal previously approved these changes Aug 18, 2025
Copy link

@sumedham sumedham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

  1. We discussed this on slack but over time, we want to drive customers to the ECS Page, as soon as the slack message comes through in whatever #ops channel they may have. Here is a monitor example that does well with deep links and has shown more drive to the Serverless page.
  2. When looking at metrics most used in monitors I noticed that ecs.fargate.mem.hierarchical_memory_limit is used quite a bit by some customers. Is there any reason customers would use that?

@temporal-github-worker-1 temporal-github-worker-1 bot dismissed sblumenthal’s stale review August 19, 2025 19:52

Review from sblumenthal is dismissed. Related teams and files:

  • container-integrations
    • ecs_fargate/assets/monitors/ecs_fargate_cpu_usage.json
    • ecs_fargate/assets/monitors/ecs_fargate_ephemeral_storage.json
    • ecs_fargate/assets/monitors/ecs_fargate_mem_usage.json
    • ecs_fargate/assets/monitors/ecs_fargate_net_rcvd.json
    • ecs_fargate/assets/monitors/ecs_fargate_net_sent.json
@stzou
Copy link
Contributor Author

stzou commented Aug 19, 2025

2. When looking at metrics most used in monitors I noticed that ecs.fargate.mem.hierarchical_memory_limit is used quite a bit by some customers. Is there any reason customers would use that?

ecs.fargate.mem.hierarchical_memory_limit reports the mem limit for containers without an explicit memory limit defined. In the case of ecs containers, it will be the task mem limit. The use case for this metric would be to receive alerts when the sum all containers' resource usage in a task are approaching the task limit. This use case if covered by the cpu and mem monitors.

@stzou stzou added this pull request to the merge queue Aug 28, 2025
Merged via the queue into master with commit 8d1b2a5 Aug 28, 2025
51 of 52 checks passed
@stzou stzou deleted the stzou/CAP-2761 branch August 28, 2025 16:41
github-actions bot pushed a commit that referenced this pull request Aug 28, 2025
* Add ecs_fargate monitors

* address validation issues

* shorten monitor description

* Link back to ecs explorer 8d1b2a5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants