-
Notifications
You must be signed in to change notification settings - Fork 10
Add statsd metrics collection for Astronomer Cosmos #1102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Added @astronomer/airflow-infra as reviewers as they consume this image as part what is deployed in dataplanes. |
|
Thanks, @jpweber, for adding @astronomer/airflow-infra as reviewers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds comprehensive statsd metrics collection for Astronomer Cosmos to understand customer configuration usage patterns and inform Cosmos 2.0 planning. The metrics capture task execution details, profile configurations, and rendering performance characteristics.
Key changes:
- Added three counter metrics tracking task execution, profile usage, and rendering configurations
- Added three duration metrics tracking parsing, filtering, and DAG generation performance
- Incremented statsd-exporter version to reflect the new mappings
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| statsd-exporter/version.txt | Version bump to 0.28.0-4 to reflect new Cosmos metrics |
| statsd-exporter/include/mappings-gen2.yml | Added six new metric mappings for Cosmos telemetry collection |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
ianbuss
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only concern I have is that this image is shared by APC. Do APC customers also use Cosmos?
We proposed these changes in this repo after a call and recommendation from @stuart23 - I didn't realise it was used in multiple parts of our services. I don't know who the APC customers are, so I don't know which of those use Cosmos. How can we find this information? If not in this repository, in which repo should we define these metrics? |
| uses_node_converter: "$5" # True or False | ||
| test_behavior: "$6" # after_each, after_all, none, build | ||
| source_behavior: "$7" # all, with_tests_or_freshness, none | ||
| total_dbt_models: "$8" # Total number of dbt models in the project |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having numerical values as labels isn't really great, because you can't use them in calculations at all, and if they change they end up having an impact on cardinality.
statsd also has a gauge type, could you emit total_dbt_models and selected_dbt_models as gauges please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally makes sense, thanks a lot, @stuart23 , I'll do this
| # 2. Durations | ||
| # These are identified by the suffix ".counter" or ".duration" | ||
|
|
||
| # What is the name of the operator class used to run the task? Did the end-user subclass it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just be careful if you're using unsanitized values (e.g. users classes) as labels. Statsd is a very out-of-date and terrible way to move telemetry around and using period delimination for "labels" means that if any label has a period in it, you end up in a mess (e.g. airflow emits mapped tasks or task group tasks as dag.task_group.task_name instead of dag.task_name, and it breaks a lot of these rules. TLDR, just sanitize the labels please!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot, @stuart23. We are confident the label values we're collecting don't have periods, but we'll add some validation in the Cosmos codebase to avoid this from happening over time.
Are there any other better alternatives than period-delimited regex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Unfortunately this is the only way for now - the vanilla statsd implementation does not officially support labels (ref) so we have to serialize them into the metric name.
|
@ianbuss , APC use statsd-exporter/include/mappings.yml or they might mount something over the top of it now. Astro still has the mappings hard coded in @tatiana thanks for doing the testing with manually pushing statsd metrics to the exporter. There is a small testing framework in https://github.com/astronomer/ap-vendor/tree/main/statsd-exporter/test but if you don't want to do that, can you run it again and then |
| instance: "$1" | ||
| mount_path: "$2" | ||
|
|
||
| # ------------------------------------------------------------ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is relying heavily on ([^.]+) which is fine if you control sanitisation, but brittle otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This goes back to one of my questions in the PR description:
Do we prefer to use * or regex? I thought regex was safer to avoid over-matching, but it is harder to read. The existing mapping includes both.
If we're confident * is a better standard and wouldn't over-match, I'd be happy to remove the regex patterns ([^.]+)
| dbt_command: "$5" # Example: "run", "build", "test" | ||
| install_deps: "$6" # True or False | ||
| origin: "$7" # DbtTaskGroup, DbtDag or StandaloneTask | ||
| has_callback: "$8" # True or False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add is_mapped_task so we are consistent with the introduced argument: https://github.com/astronomer/astronomer-cosmos/pull/2195/changes#r2630935633
ianbuss
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No objection to adding these, just one comment about the match format.
| # What is the name of the operator class used to run the task? Did the end-user subclass it? | ||
| # Which dbt command was used to run the task? | ||
| # What execution mode was used? What invocation mode was used? | ||
| - match: cosmos\.task\.operator_name\.([^.]+)\.is_subclass\.([^.]+)\.execution_mode\.([^.]+)\.invocation_mode\.([^.]+)\.dbt_command\.([^.]+)\.install_deps\.([^.]+)\.origin\.([^.]+)\.has_callback\.([^.]+)\.status\.([^.]+)\.counter$ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if possible, let's avoid the regex -- we should be able to test it. Also as a general point, do we need the names of the labels in the metric itself? This could probably become something similar to the below:
cosmos.task.counter.*.*.*.*.*.*.*.*.*
Details
Currently, Astronomer Cosmos has an overwhelming number of configurations. These were introduced organically over the last three years. Unfortunately, they lead to customer confusion, increased CRE support, and a lot of work for the DevRel team (they wrote and maintain a 100-page eBook!).
We aim to understand how customers use Cosmos configurations to decide what to keep in Cosmos 2.0, planned for next year. This information will also help shape the following features we implement. Given this context, telemetry is set as one of the OSS Build squad's Q4 P1s.
I had a meeting with @stuart23 last week, and he shared a few pointers. I hope this is on the right track.
Related Issues
Request for feedback
This is my first statsd exporter mapping contribution, so I'd love any feedback, in particular:
*or regex? I thought regex was safer to avoid over-matching, but it is harder to read. The existing mapping includes both.operator_nameis the one with the highest, with around 80 values. All others should have less than 20 values). Is this acceptable?cosmos_rendering_dbt_nodes_parsing_duration,cosmos_rendering_dbt_nodes_filtering_durationandcosmos_rendering_airflow_dag_generation_durationas a single metric, and have the "operation" ("dbt_nodes_parsing", "dbt_nodes_filtering", "rendering_airflow_dag") as a label?How this was tested
I installed
statsd_exporterby cloning the repo and runningmake build.In one terminal, I ran the
statstd_exporterwith the new version of the mappingIn another terminal, I sent statsd events, and confirmed there were no errors in the
statsd_exporterlogs:And I confirmed the
statsd_exporterlogs didn't contain errors: