Skip to content

Conversation

@Ziinc
Copy link
Contributor

@Ziinc Ziinc commented Nov 12, 2025

System monitoring for users:

  • provisions system-managed sources for each User
  • Each system source is specific to a type of event
  • otel-format metrics are collected and then exported into a :metrics source pre-aggregated.
  • a centralized OtelMetricExporter is used to avoid running :keep callback multiple times for each each SourceSup
  • export_callback/2 overrides default library behaviour
  • we rely on string metadata keys as a declarative way of deciding what metadata keys to keep and set as otel metric attribute.

for user-specific logs, these are routed to the source based on whether specific identifying metadata keys are present. If the log is pertaining to a specific source/backend belonging to the user, then it will get ingested into that source.

4 specific user metrics are implemented

  • egress on http adaptors
  • total ingested events
  • total ingested bytes
  • total bytes processed for queries

Refactors tests to avoid testing implementation.

  • docs
  • test
  • log drains egress metric
  • endpoints query bytes metric

Closes ANL-1202

@Ziinc Ziinc requested a review from amokan November 12, 2025 12:46
@Ziinc Ziinc force-pushed the feat/ingestion-metrics-system-monitoring-refactor branch from 61b6aa1 to 1707e1e Compare November 14, 2025 11:05
@Ziinc Ziinc force-pushed the feat/ingestion-metrics-system-monitoring-refactor branch from 0c76910 to 3e3b809 Compare November 24, 2025 20:48
@Ziinc Ziinc marked this pull request as ready for review November 24, 2025 20:49
@Ziinc Ziinc requested a review from a team November 24, 2025 20:50
Copy link
Contributor

@josevalim josevalim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Ziinc! 👋

Comments inline!

[label] -> {label, [label]}
end)
end)
|> Map.new()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For completeness, you could use custom Ecto.Types to store those aleady as lists in the database: https://hexdocs.pm/ecto/Ecto.Type.html - although I am not expecting this code to be performance sensitive enough to matter.

end

def get_related_user_id(map) do
case map do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a separate conversation but I wonder if something could be done to remove the number of "sources" here.

Also, it is worth noting that you are caching whole structs, but you may only need the user_id field. For example, Cache.get_endpoint_query seems to be used exclusively here and you don 't care about the other fields, so there is no need to store or read them. You may benefit from adding a get_user_id_by_endpoint_query. This may also apply to other operations, but please double check!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i intend to refactor that off as it isn't as performant, and we should be able to derive the user_id from wherever the telemetry is getting emitted and set it on the metadata, thereby bypassing the whole need to fetch from Cachex

@Ziinc Ziinc dismissed chasers’s stale review November 26, 2025 10:55

removed rejected events docs

@Ziinc Ziinc force-pushed the feat/ingestion-metrics-system-monitoring-refactor branch from 4cfbaba to 475ecfb Compare November 26, 2025 19:18
@amokan amokan merged commit dfdca79 into main Nov 26, 2025
10 checks passed
@amokan amokan deleted the feat/ingestion-metrics-system-monitoring-refactor branch November 26, 2025 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants