-
Notifications
You must be signed in to change notification settings - Fork 149
Add opentelemetry integration to Oban #6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
7bbe628
to
ef1a06f
Compare
Thanks for opening this @indrekj. Once I get the phoenix migration complete, I will try to write up a guide in the next few days on migrating a library to here. I couldn't find the original repo out on github or the package in hex. Do you have links to those? Also, one thing to get a jump on is verify anyone who has contributed to the original codebase has signed the CLA https://identity.linuxfoundation.org/projects/cncf. |
I deleted the repository. Basically ported the initial (non-merged) PR to here. There have been no releases yet. That's why you cannot find the repo nor the package. I can release the first version (v0.1) after this gets merged.
Again, not an issue. There were no contributions before that. The current code is written by myself. |
f4170a0
to
2958c39
Compare
We'll do a formal review later as part of this, but I had a couple of general questions on the implementation that could be considered while we work out CI and processes.
|
I did look into this but didn't find a good transparent option. I initially thought that maybe using the Oban plugins mechanism could help here. But the jobs are inserted like this. This is basically a direct insertion to the database so there are no plugins involved. There is a So if the user is willing to change the codebase then we could propagate the information through that. E.g. before:
after:
where
|
I would strongly suggest supporting that and selection of metadata keys for inclusion as attributes. This is from defp decode_ctx("undefined"), do: :undefined
defp decode_ctx(ctx) do
Enum.map(ctx, fn
el when is_binary(el) -> String.to_existing_atom(el)
el -> el
end)
|> List.to_tuple()
end |
2958c39
to
baf26b9
Compare
@bryannaegele I did a bunch of changes here. But basically now if the user uses Also added some instructions to the project README file. I'll deal with CI after opentelemetry_phoenix gets merged to avoid duplicate work. The tests pass locally, but for CI (Github actions) I need to somehow set up Postgres. |
baf26b9
to
41852f3
Compare
Nice! And sadly this is the reality of libraries using
Not that I don't think that was completely reasonable to do :), just stating the state of things. There is likely a need for a push for libraries that use Because some users who want to enable tracing are going to be using Oban indirectly and won't be able to simply use the new function. Same issue exists for HTTP, database, MQ, Kafka, etc clients |
I just came across this PR, FYI we have also an implementation available at https://github.com/qdentity/opentelemetry_oban. For this I had to get two PRs into Oban: https://github.com/sorentwo/oban/pulls?q=is%3Apr+author%3Advic+is%3Aclosed. We can have a look and, if needed, merge missing pieces into this PR. |
41852f3
to
c20a9ad
Compare
@tsloughter @bryannaegele Rebased my PR and addressed the checklist in the open checklist PR. @dvic I did see that repository, though only after most of my code was already done. I would like to get this PR approved/merged before porting over your changes. This current PR takes care of job processing and trace propagation. I see you added tracing for circuits, plugins etc. It likely makes sense to port them over but as it needs some testing and making sure they follow semantic conventions I'd like to start with the basic features that this PR should take care of. |
@indrekj I just tried your PR and I must say, it's looking good :) I like the way you organized the attributes and distinguished between process/send spans, while also supporting linking them to each other. I archived https://github.com/qdentity/opentelemetry_oban and referred to this PR. Let me know if there's anything I can do to get this PR merged and published 👍 |
Another thing that you could add this PR: |
Ack, I'll take a look tomorrow |
👍 Another thing I just encountered that you might want tackle as well: ricardoccpaiva/opentelemetry_tesla#5 (it seems we have to do |
a38db18
to
763b5d5
Compare
@dvic Updated |
Perfect, thanks 👍 |
Great :) Should we add to the docs a line about how to exclude the internal query (by tracing the plugin events and then using a root sampler)? |
Agreed on a separate PR for adding propagation style. |
I have used the following sampler: defmodule MySampler do
@behaviour :otel_sampler
defstruct [:opts]
@impl true
def setup(opts), do: %__MODULE__{opts: opts}
@impl true
def description(%__MODULE__{}), do: "MySampler"
@impl true
def should_sample(ctx, _trace_id, _links, span_name, _span_kind, attributes, _opts) do
drop =
String.ends_with?(span_name, ".repo.query:oban_producers") ||
String.ends_with?(span_name, ".repo.query:oban_jobs") ||
String.starts_with?(span_name, "Elixir.Oban.Plugins") ||
String.starts_with?(span_name, "Elixir.Oban.Pro.Plugins") ||
Keyword.get(attributes, :"db.statement") in ["begin", "commit"]
decision =
if drop do
:drop
else
:record_and_sample
end
attributes = []
tracestate =
OpenTelemetry.Tracer.current_span_ctx(ctx)
|> OpenTelemetry.Span.tracestate()
{decision, attributes, tracestate}
end
end and then in the config: my_sampler = {MySampler, []}
config :opentelemetry,
sampler:
{:parent_based,
%{
root: my_sampler,
local_parent_sampled: my_sampler,
remote_parent_sampled: my_sampler
}} If you think it's useful we could add this to the docs as a starter. |
I'm personally not a fan of that approach. It requires the user to manually list every possible span there could be. If the underlying library changes and there's a new span then this has to be modified again. This can be quite cumbersome when there are many services. In general, this seems to be a common problem and I think API/SDK should provide a way to start a non-recording span. E.g. In ruby there's a helper method |
Interesting, but children of a non-recording span are still sampled right? How would this pan out with the ecto queries problem? |
No, they're not. The decision is propagated to all the children spans. The sidekiq library had exactly the same problem. The Poller was running in the background and that caused Redis spans to be created without any context/parents. So creating a non recording span/trace removed them. |
Ah I see, didn't know that, thanks! |
483ed85
to
9ff9f64
Compare
@indrekj so you've got the tests against postgres working in CI? Making this good to merge now? |
Yup I think so
…On Sun, Nov 28, 2021, 18:22 Tristan Sloughter ***@***.***> wrote:
@indrekj <https://github.com/indrekj> so you've got the tests against
postgres working in CI? Making this good to merge now?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAA23X5SULDQK7VEUKDXHDUOJJMZANCNFSM5DBWAPHQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
kind: :consumer, | ||
links: links | ||
}) | ||
|> Span.set_attributes(attributes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@indrekj why is this done with Span.set_attributes
? It's not possible to set the attributes directly in the start_opts
here? The reason why I'm asking is that I don't see the attributes in my sampler (but they do appear in Honeycomb). I'm receiving this span without the attributes in my sampler. @tsloughter is this intended behaviour of the sampler / span design?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I used some other project as an example. The start_opts
works with attributes
key indeed: https://github.com/open-telemetry/opentelemetry-erlang-contrib/compare/9ff9f64044d47306c0764dd9e12472a1b7eb3349..5b50df46ca8c0847f89e595cf74d4a1e2892c6e8 . I'm also using set_attributes in two other places (insert functions) but there I don't have access to all attributes immediately so there it makes sense to use set_attributes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this change now I do get the attributes in the sampler, but I'm not sure if this is a bug or not, I was expecting this to work also with Span.set_attributes
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be intentional. Imagine you have this flow:
- Start span A
- Send outbound request to service B
- Set attributes to span A
- Do something and finish span A
In the step 2, the sampling decision has to be known. Otherwise, service B doesn't know if the spans should be sampled or not. So the sampler has to be run in the step 1.
d7e0794
to
db0e1f1
Compare
:otel_propagator_text_map.extract(Map.to_list(job_meta)) | ||
parent = OpenTelemetry.Tracer.current_span_ctx() | ||
links = if parent == :undefined, do: [], else: [OpenTelemetry.link(parent)] | ||
OpenTelemetry.Tracer.set_current_span(:undefined) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@indrekj Due to the force pushes on this PR I lost a bit track when this was introduced (maybe it's a good idea to do one force push at the end ;)), but currently I'm having the problem that the span started inside a job are not children of the job span, is this intended? Are you also seeing this behaviour? The consequence of this is that my custom sampler is not that useful anymore :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, I can always start another wrapper span inside the job, but I just wanted to double check and understand that this is intended behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're not seeing this. I also added a test to ensure it: https://github.com/open-telemetry/opentelemetry-erlang-contrib/compare/db0e1f1a67bbd51992b10a0020da3b21119f6edb..ddf26187cee991e329c9dcbeaa42ad50eacc0c5d
We however had a similar issue when we were using def timeout(_job) do
in the jobs. If timeout function is defined then oban uses perform_timed to execute the worker, see: https://github.com/sorentwo/oban/blob/60274b76ff2c59d660ae1bf7721a3ff14854700b/lib/oban/queue/executor.ex#L257-L269 . This creates a new process so the current span information is lost. We only had one job in the entire system with that method, so we used:
# We use our own `with_timeout` instead of Oban.Worker.timeout/1 because if
# the latter is used then the trace context is not propagated.
defp with_timeout(fun) do
task = Task.async(OpentelemetryFunction.wrap(fun))
case Task.yield(task, @timeout) || Task.shutdown(task) do
{:ok, reply} ->
reply
nil ->
{:error, :timeout}
end
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to get this PR merged to start addressing these kind of rare edge cases. As you said, the PR is huge and it's hard to follow the changes now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhh, yes in this case we also used Oban.Worker.timeout/1
, thanks! 👍
I'd like to get this PR merged to start addressing these kind of rare edge cases. As you said, the PR is huge and it's hard to follow the changes now.
I agree, maybe we can track these open issues / edge cases in the description so we can work on it when the PR is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I can merge and then you should start opening issues.
I saw some question about set_attributes
and sampling but can't find it again, so answering here: The sampler won't see attributes set with set_attributes
because it has already been sampled at that point. A span is started and the attributes passed to the function that starts the span will be available but any attributes set after starting the span are of course not available.
Not sure what the code was, maybe it was something that used a pipe to look like it was using the "builder pattern" to construct a span? But we don't use that pattern, so there is no "build" function that would come after doing something like span("name") |> set_attributes(...) |> start()
. So yea, must pass any attributes that need to be available for sampling to the start call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. It was here: https://github.com/open-telemetry/opentelemetry-erlang-contrib/pull/6/files/9ff9f64044d47306c0764dd9e12472a1b7eb3349#diff-ca8cd6f4f23c284564be4b34cb5caf868177e4190300720bc3c6fdb255c18b90 . For that particular case we were able to avoid set_attributes
Yea, I can merge and then you should start opening issues.
👍
By default a new trace is automatically started when a job is processed by monitoring these events: * `[:oban, :job, :start]` — at the point a job is fetched from the database and will execute * `[:oban, :job, :stop]` — after a job succeeds and the success is recorded in the database * `[:oban, :job, :exception]` — after a job fails and the failure is recorded in the database To also record a span when a job is created and to link traces together `Oban.insert/2` has to be replaced by `OpentelemetryOban.insert/2`. Before: ```elixir %{id: 1, in_the: "business", of_doing: "business"} |> MyApp.Business.new() |> Oban.insert() ``` After: ```elixir %{id: 1, in_the: "business", of_doing: "business"} |> MyApp.Business.new() |> OpentelemetryOban.insert() ```
db0e1f1
to
ddf2618
Compare
PR has been approved for some time now, and there doesn't seem to be more comments. Could somebody merge it? |
Thanks for the persistence :). I keep preparing to merge it, hit the update button, checks start running again... and I move on to something else and forget :(. Checks are again running after I hit the |
Merged. I looked up the package on hex and it appears to be owned by a company/org "salemove". Is that you? We need the opentelemetry org/team added to it on hex so we can also publish if need be -- as in, we won't be publishing the new release but in case the other owner went away or wasn't available and there was a security vulnerability, stuff like that, we'd need to be able to publish. |
@tsloughter Thx. Ownership has been transferred to |
Thanks, you should stay as an owner to be able to publish. It is much better to release manually and not github actions. Unless github actions has a way to require manual intervention during an action which may make it a viable option. |
Yes, I believe you can do this in public repos (and github enterprise). See https://docs.github.com/en/actions/managing-workflow-runs/reviewing-deployments |
Could you run |
Hm, that didn't work, but I swear you can have both a user and an owner as the owner. It returned "cannot add owner that is not a member of the organization" |
By default a new trace is automatically started when a job is processed
by monitoring these events:
[:oban, :job, :start]
— at the point a job is fetched from the database and will execute[:oban, :job, :stop]
— after a job succeeds and the success is recorded in the database[:oban, :job, :exception]
— after a job fails and the failure is recorded in the databaseTo also record a span when a job is created and to link traces together
Oban.insert/2
has to be replaced byOpentelemetryOban.insert/2
.Before:
After:
Checklist
Acknowledgements
Project Type
Prior Art
Existing Project
All Projects