Skip to content

Conversation

@tatiana
Copy link

@tatiana tatiana commented Nov 25, 2025

Details

Currently, Astronomer Cosmos has an overwhelming number of configurations. These were introduced organically over the last three years. Unfortunately, they lead to customer confusion, increased CRE support, and a lot of work for the DevRel team (they wrote and maintain a 100-page eBook!).

We aim to understand how customers use Cosmos configurations to decide what to keep in Cosmos 2.0, planned for next year. This information will also help shape the following features we implement. Given this context, telemetry is set as one of the OSS Build squad's Q4 P1s.

I had a meeting with @stuart23 last week, and he shared a few pointers. I hope this is on the right track.

Related Issues

Request for feedback

This is my first statsd exporter mapping contribution, so I'd love any feedback, in particular:

  1. Do we prefer to use * or regex? I thought regex was safer to avoid over-matching, but it is harder to read. The existing mapping includes both.
  2. I made sure that all labels have low cardinality (the label operator_name is the one with the highest, with around 80 values. All others should have less than 20 values). Is this acceptable?
  3. I tried to track multiple values to avoid further increasing the number of metrics created/expanded. I assumed this would be cost-effective. Do you think this is a good practice? Is it better to break them down into multiple metric definitions?
  4. For the duration metrics, does it make sense to have three definitions - as it is currently implemented - or would it make more sense to have cosmos_rendering_dbt_nodes_parsing_duration, cosmos_rendering_dbt_nodes_filtering_duration and cosmos_rendering_airflow_dag_generation_duration as a single metric, and have the "operation" ("dbt_nodes_parsing", "dbt_nodes_filtering", "rendering_airflow_dag") as a label?

How this was tested

I installed statsd_exporter by cloning the repo and running make build.

In one terminal, I ran the statstd_exporter with the new version of the mapping

statsd_exporter --statsd.mapping-config=statsd-exporter/include/mappings-gen2.yml --log.level=debug

In another terminal, I sent statsd events, and confirmed there were no errors in the statsd_exporter logs:

echo "cosmos.task.operator_name.DbtRunLocalOperator.is_subclass.False.execution_mode.local.invocation_mode.subprocess.dbt_command.run.install_deps.True.origin.DbtTaskGroup.has_callback.False.status.success.counter:1|c"| nc -w1 -u 127.0.0.1 9125

echo "cosmos.profile.database.bigquery.profile_strategy.yaml_file.profile_mapping_class.None.counter:1|c" | nc -w1 -u 127.0.0.1 9125

echo "cosmos.rendering.used_automatic_load_mode.True.actual_load_mode.dbt_ls_cache.invocation_mode.dbt_runner.install_deps.False.uses_node_converter.False.test_behavior.after_each.source_behavior.none.total_dbt_models.100.selected_dbt_models.8:1|c" | nc -w1 -u 127.0.0.1 9125

echo "cosmos.rendering.actual_load_mode.dbt_ls.duration:34500|ms" | nc -w1 -u 127.0.0.1 9125

echo "cosmos.rendering.actual_load_mode.dbt_ls.dbt_nodes_parsing.duration:34500|ms" | nc -w1 -u 127.0.0.1 9125

echo "cosmos.rendering.actual_load_mode.dbt_ls.dbt_nodes_filtering.duration:30|ms" | nc -w1 -u 127.0.0.1 9125

echo "cosmos.rendering.actual_load_mode.dbt_ls.airflow_dag_generation.duration:140|ms" | nc -w1 -u 127.0.0.1 9125

And I confirmed the statsd_exporter logs didn't contain errors:

time=2025-12-08T14:32:33.388Z level=INFO source=main.go:296 msg="Starting StatsD -> Prometheus Exporter" version="(version=0.28.0, branch=master, revision=d63f22b266f72e6d832fbf89bc7341bf625185f6)"
time=2025-12-08T14:32:33.388Z level=INFO source=main.go:297 msg="Build context" context="(go=go1.25.1, platform=darwin/arm64, [email protected], date=20251205-13:20:28, tags=unknown)"
time=2025-12-08T14:32:33.390Z level=INFO source=main.go:346 msg="Accepting StatsD Traffic" udp=:9125 tcp=:9125 unixgram=""
time=2025-12-08T14:32:33.390Z level=INFO source=main.go:347 msg="Accepting Prometheus Requests" addr=:9102
time=2025-12-08T14:32:35.083Z level=DEBUG source=listener.go:96 msg="Incoming line" proto=udp 

line=cosmos.task.operator_name.DbtRunLocalOperator.is_subclass.False.execution_mode.local.invocation_mode.subprocess.dbt_command.run.install_deps.True.origin.DbtTaskGroup.has_callback.False.status.success.counter:1|c
time=2025-12-08T14:32:35.084Z level=DEBUG source=listener.go:96 msg="Incoming line" proto=udp line=""


time=2025-12-08T14:32:40.935Z level=DEBUG source=listener.go:96 msg="Incoming line" proto=udp line=cosmos.profile.database.bigquery.profile_strategy.yaml_file.profile_mapping_class.None.counter:1|c
time=2025-12-08T14:32:40.935Z level=DEBUG source=listener.go:96 msg="Incoming line" proto=udp line=""

time=2025-12-08T14:32:45.220Z level=DEBUG source=listener.go:96 msg="Incoming line" proto=udp line=cosmos.rendering.used_automatic_load_mode.True.actual_load_mode.dbt_ls_cache.invocation_mode.dbt_runner.install_deps.False.uses_node_converter.False.test_behavior.after_each.source_behavior.none.total_dbt_models.100.selected_dbt_models.8:1|c
time=2025-12-08T14:32:45.220Z level=DEBUG source=listener.go:96 msg="Incoming line" proto=udp line=""

time=2025-12-08T14:32:48.736Z level=DEBUG source=listener.go:96 msg="Incoming line" proto=udp line=cosmos.rendering.actual_load_mode.dbt_ls.duration:34500|ms
time=2025-12-08T14:32:48.736Z level=DEBUG source=listener.go:96 msg="Incoming line" proto=udp line=""

time=2025-12-08T14:32:52.633Z level=DEBUG source=listener.go:96 msg="Incoming line" proto=udp line=cosmos.rendering.actual_load_mode.dbt_ls.dbt_nodes_parsing.duration:34500|ms
time=2025-12-08T14:32:52.633Z level=DEBUG source=listener.go:96 msg="Incoming line" proto=udp line=""

time=2025-12-08T14:32:56.413Z level=DEBUG source=listener.go:96 msg="Incoming line" proto=udp line=cosmos.rendering.actual_load_mode.dbt_ls.dbt_nodes_filtering.duration:30|ms
time=2025-12-08T14:32:56.413Z level=DEBUG source=listener.go:96 msg="Incoming line" proto=udp line=""

time=2025-12-08T14:33:00.178Z level=DEBUG source=listener.go:96 msg="Incoming line" proto=udp line=cosmos.rendering.actual_load_mode.dbt_ls.airflow_dag_generation.duration:140|ms
time=2025-12-08T14:33:00.178Z level=DEBUG source=listener.go:96 msg="Incoming line" proto=udp line=""

@pankajkoti pankajkoti self-requested a review November 25, 2025 14:03
@tatiana tatiana requested a review from pankajastro November 26, 2025 10:25
@pgvishnuram pgvishnuram marked this pull request as ready for review December 3, 2025 14:02
@pgvishnuram pgvishnuram requested review from a team as code owners December 3, 2025 14:02
@jpweber jpweber requested a review from a team December 3, 2025 18:03
@jpweber
Copy link
Contributor

jpweber commented Dec 3, 2025

Added @astronomer/airflow-infra as reviewers as they consume this image as part what is deployed in dataplanes.

@tatiana
Copy link
Author

tatiana commented Dec 5, 2025

Thanks, @jpweber, for adding @astronomer/airflow-infra as reviewers.
Also, thanks for the review, @pgvishnuram..! I believe I addressed all the comments.
Is there anyone else who should review and eventually approve this? I'd love to take it to the finishing line

@stuart23 stuart23 requested a review from Copilot December 5, 2025 23:30
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds comprehensive statsd metrics collection for Astronomer Cosmos to understand customer configuration usage patterns and inform Cosmos 2.0 planning. The metrics capture task execution details, profile configurations, and rendering performance characteristics.

Key changes:

  • Added three counter metrics tracking task execution, profile usage, and rendering configurations
  • Added three duration metrics tracking parsing, filtering, and DAG generation performance
  • Incremented statsd-exporter version to reflect the new mappings

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
statsd-exporter/version.txt Version bump to 0.28.0-4 to reflect new Cosmos metrics
statsd-exporter/include/mappings-gen2.yml Added six new metric mappings for Cosmos telemetry collection

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@ianbuss ianbuss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only concern I have is that this image is shared by APC. Do APC customers also use Cosmos?

@tatiana
Copy link
Author

tatiana commented Dec 8, 2025

The only concern I have is that this image is shared by APC. Do APC customers also use Cosmos?
@ianbuss thanks a lot for the review!

We proposed these changes in this repo after a call and recommendation from @stuart23 - I didn't realise it was used in multiple parts of our services.

I don't know who the APC customers are, so I don't know which of those use Cosmos. How can we find this information? If not in this repository, in which repo should we define these metrics?

uses_node_converter: "$5" # True or False
test_behavior: "$6" # after_each, after_all, none, build
source_behavior: "$7" # all, with_tests_or_freshness, none
total_dbt_models: "$8" # Total number of dbt models in the project
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having numerical values as labels isn't really great, because you can't use them in calculations at all, and if they change they end up having an impact on cardinality.

statsd also has a gauge type, could you emit total_dbt_models and selected_dbt_models as gauges please?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally makes sense, thanks a lot, @stuart23 , I'll do this

# 2. Durations
# These are identified by the suffix ".counter" or ".duration"

# What is the name of the operator class used to run the task? Did the end-user subclass it?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just be careful if you're using unsanitized values (e.g. users classes) as labels. Statsd is a very out-of-date and terrible way to move telemetry around and using period delimination for "labels" means that if any label has a period in it, you end up in a mess (e.g. airflow emits mapped tasks or task group tasks as dag.task_group.task_name instead of dag.task_name, and it breaks a lot of these rules. TLDR, just sanitize the labels please!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, @stuart23. We are confident the label values we're collecting don't have periods, but we'll add some validation in the Cosmos codebase to avoid this from happening over time.

Are there any other better alternatives than period-delimited regex?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Unfortunately this is the only way for now - the vanilla statsd implementation does not officially support labels (ref) so we have to serialize them into the metric name.

@stuart23
Copy link

stuart23 commented Dec 9, 2025

@ianbuss , APC use statsd-exporter/include/mappings.yml or they might mount something over the top of it now. Astro still has the mappings hard coded in mappings-gen2.yaml.

@tatiana thanks for doing the testing with manually pushing statsd metrics to the exporter. There is a small testing framework in https://github.com/astronomer/ap-vendor/tree/main/statsd-exporter/test but if you don't want to do that, can you run it again and then curl http://localhost:9102/metrics (or open it in a browser) after you send your test messages to it? Just would be good to verify those label extracts - you should see them as the metric name followed by key-value pairs, e.g.:

airflow_dagrun_duration{dag_id="5_retries_pass_on_5th",quantile="0.5"} 246.767874
airflow_dagrun_duration{dag_id="5_retries_pass_on_5th",quantile="0.9"} 246.767874
airflow_dagrun_duration{dag_id="5_retries_pass_on_5th",quantile="0.99"} 246.767874

instance: "$1"
mount_path: "$2"

# ------------------------------------------------------------

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is relying heavily on ([^.]+) which is fine if you control sanitisation, but brittle otherwise.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This goes back to one of my questions in the PR description:

Do we prefer to use * or regex? I thought regex was safer to avoid over-matching, but it is harder to read. The existing mapping includes both.

If we're confident * is a better standard and wouldn't over-match, I'd be happy to remove the regex patterns ([^.]+)

dbt_command: "$5" # Example: "run", "build", "test"
install_deps: "$6" # True or False
origin: "$7" # DbtTaskGroup, DbtDag or StandaloneTask
has_callback: "$8" # True or False
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add is_mapped_task so we are consistent with the introduced argument: https://github.com/astronomer/astronomer-cosmos/pull/2195/changes#r2630935633

Copy link
Contributor

@ianbuss ianbuss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No objection to adding these, just one comment about the match format.

# What is the name of the operator class used to run the task? Did the end-user subclass it?
# Which dbt command was used to run the task?
# What execution mode was used? What invocation mode was used?
- match: cosmos\.task\.operator_name\.([^.]+)\.is_subclass\.([^.]+)\.execution_mode\.([^.]+)\.invocation_mode\.([^.]+)\.dbt_command\.([^.]+)\.install_deps\.([^.]+)\.origin\.([^.]+)\.has_callback\.([^.]+)\.status\.([^.]+)\.counter$
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if possible, let's avoid the regex -- we should be able to test it. Also as a general point, do we need the names of the labels in the metric itself? This could probably become something similar to the below:

cosmos.task.counter.*.*.*.*.*.*.*.*.*

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants