Skip to content

Conversation

indrekj
Copy link
Contributor

@indrekj indrekj commented Aug 30, 2021

By default a new trace is automatically started when a job is processed
by monitoring these events:

  • [:oban, :job, :start] — at the point a job is fetched from the database and will execute
  • [:oban, :job, :stop] — after a job succeeds and the success is recorded in the database
  • [:oban, :job, :exception] — after a job fails and the failure is recorded in the database

To also record a span when a job is created and to link traces together
Oban.insert/2 has to be replaced by OpentelemetryOban.insert/2.

Before:

  %{id: 1, in_the: "business", of_doing: "business"}
  |> MyApp.Business.new()
  |> Oban.insert()

After:

  %{id: 1, in_the: "business", of_doing: "business"}
  |> MyApp.Business.new()
  |> OpentelemetryOban.insert()

Checklist

Acknowledgements

  • I certify that I am authorized to assign ownership of this code to the OpenTelemetry project.
  • I acknowledge that ownership of the Hex package must be given/transferred to the OpenTelemetry Hex Org

Project Type

  • Instrumentation
  • Exporter
  • Propagator
  • Metapackage
  • Example
  • Util

Prior Art

  • Is this an existing project? Includes private libraries being open sourced.
    • Yes
    • No
  • Are there any other packages on Hex providing similar functionality? - I'm not aware of it.

Existing Project

  • Moving from an existing public repository? Provide a link in the description.
  • All authors who have contributed code contained in this PR have signed the CNCF CLA - no other authors at the moment

All Projects

  • Label created (maintainers)
  • labeler.yml updated
  • CI
    • Elixir or Erlang language file is updated
    • Test strategy should use the matrix strategy with all supported language/OTP combinations
    • The job is scoped to only run for relevant code changes
  • Codeowners adds scoped ownership for project maintainers
  • License file (Apache2 - copyright to OpenTelemetry Authors)
  • All code complies with the Otel Specification and Otel Semantic Conventions - Trace Metrics

@indrekj indrekj marked this pull request as ready for review August 30, 2021 11:41
@indrekj indrekj requested a review from a team August 30, 2021 11:41
@indrekj indrekj force-pushed the opentelemetry_oban branch from 7bbe628 to ef1a06f Compare August 30, 2021 11:47
@bryannaegele
Copy link
Contributor

Thanks for opening this @indrekj. Once I get the phoenix migration complete, I will try to write up a guide in the next few days on migrating a library to here.

I couldn't find the original repo out on github or the package in hex. Do you have links to those?

Also, one thing to get a jump on is verify anyone who has contributed to the original codebase has signed the CLA https://identity.linuxfoundation.org/projects/cncf.

@indrekj
Copy link
Contributor Author

indrekj commented Aug 30, 2021

I couldn't find the original repo out on github or the package in hex. Do you have links to those?

I deleted the repository. Basically ported the initial (non-merged) PR to here. There have been no releases yet. That's why you cannot find the repo nor the package. I can release the first version (v0.1) after this gets merged.

Also, one thing to get a jump on is verify anyone who has contributed to the original codebase has signed the CLA https://identity.linuxfoundation.org/projects/cncf.

Again, not an issue. There were no contributions before that. The current code is written by myself.

@indrekj indrekj force-pushed the opentelemetry_oban branch from f4170a0 to 2958c39 Compare August 30, 2021 18:21
@bryannaegele bryannaegele marked this pull request as draft August 30, 2021 18:22
@bryannaegele
Copy link
Contributor

We'll do a formal review later as part of this, but I had a couple of general questions on the implementation that could be considered while we work out CI and processes.

  • Propagation: There's currently no ability for propagating context to the jobs for linking spans to parents. In our implementation at work, we trace directly and aren't using telemetry. Is there a mechanism you can think of to propagate that ctx through job metadata that could be implemented?
  • Job creation: We have a process event here but no corresponding send or receive. Are there mechanisms or opportunities here we could leverage?
  • Batches: It would be nice to support batches with the batch id. This also relates to propagation in general.

@indrekj
Copy link
Contributor Author

indrekj commented Aug 31, 2021

Propagation: There's currently no ability for propagating context to the jobs for linking spans to parents. In our implementation at work, we trace directly and aren't using telemetry. Is there a mechanism you can think of to propagate that ctx through job metadata that could be implemented?

I did look into this but didn't find a good transparent option. I initially thought that maybe using the Oban plugins mechanism could help here. But the jobs are inserted like this. This is basically a direct insertion to the database so there are no plugins involved.

There is a meta field on the job that could be used to propagate tracing context.

So if the user is willing to change the codebase then we could propagate the information through that.

E.g. before:

      %{account_id: 1, url: "https://example.com"}
      |> MyApp.Worker.new()
      |> Oban.insert()

after:

      %{account_id: 1, url: "https://example.com"}
      |> MyApp.Worker.new()
      |> OpentelemetryOban.enhance()
      |> Oban.insert()

where OpentelemetryOban.enhance(worker) would add propagation information to worker.meta field. Is this something you think would be a good solution?

Job creation: We have a process event here but no corresponding send or receive. Are there mechanisms or opportunities here we could leverage?

send: The whole send is basically just an insert call to the Postgres database. So there aren't any hooks/plugins that we can use (at least I don't see them). Again, if we'd use OpentelemetryOban.enhance() helper method then theoretically we could create a span there. But that seems very brittle because it doesn't actually wrap the insert call and if the user inserts the job later then the timings will be wrong. It would be better to wrap or replace the insert call somehow, but I'm not exactly sure how. I'm open to ideas.

receive: Correct me if I'm wrong but to my understanding, this is meant to be used in case of batch processing or when receive and process and done separately. I checked a couple of ruby libraries (1, 2, 3) and they all seemed to use process without the receive event. Oban seems to support batch workers but that's available only through the Pro version. I'd prefer to focus on the free version first.

@bryannaegele
Copy link
Contributor

bryannaegele commented Sep 1, 2021

So if the user is willing to change the codebase then we could propagate the information through that.

I would strongly suggest supporting that and selection of metadata keys for inclusion as attributes.

This is from 0.4 code. We have this in our workers, but with the new span ctx format, I'd recommend using text_map_inject. Oban serializes metadata to json, so it's easier to just use propagators.

  defp decode_ctx("undefined"), do: :undefined

  defp decode_ctx(ctx) do
    Enum.map(ctx, fn
      el when is_binary(el) -> String.to_existing_atom(el)
      el -> el
    end)
    |> List.to_tuple()
  end

@indrekj
Copy link
Contributor Author

indrekj commented Sep 2, 2021

@bryannaegele I did a bunch of changes here.

But basically now if the user uses OpentelemeryOban.insert/2 instead of Oban.insert/2 then a send span is created and also the context is propagated through the Oban job metadata. On the receiving side, we create a new trace and link them together.

Also added some instructions to the project README file.

I'll deal with CI after opentelemetry_phoenix gets merged to avoid duplicate work. The tests pass locally, but for CI (Github actions) I need to somehow set up Postgres.

@tsloughter
Copy link
Member

But basically now if the user uses OpentelemeryOban.insert/2 instead of Oban.insert/2 then a send span is created and also the context is propagated through the Oban job metadata. On the receiving side, we create a new trace and link them together.

Nice!

And sadly this is the reality of libraries using telemetry for instrumentation related to tracing :(, especially if they don't offer any middleware/hooking functionality.

telemetry is filling the role that opentelemetry_api is meant to, but with less functionality -- the "role" being a way for everyone to instrument their code with a common library that by default does nothing, so users who don't want it can ignore it, and is completely pluggable in what it does do if the user chooses to enable it.

Not that I don't think that was completely reasonable to do :), just stating the state of things.

There is likely a need for a push for libraries that use telemetry and make these sort of external service calls to all include a pluggable way to extend those calls. Then when opentelemetry_oban starts it can register a function based on the configured opentelemetry propagators.

Because some users who want to enable tracing are going to be using Oban indirectly and won't be able to simply use the new function. Same issue exists for HTTP, database, MQ, Kafka, etc clients

@dvic
Copy link
Contributor

dvic commented Sep 13, 2021

I just came across this PR, FYI we have also an implementation available at https://github.com/qdentity/opentelemetry_oban. For this I had to get two PRs into Oban: https://github.com/sorentwo/oban/pulls?q=is%3Apr+author%3Advic+is%3Aclosed. We can have a look and, if needed, merge missing pieces into this PR.

@indrekj
Copy link
Contributor Author

indrekj commented Sep 13, 2021

@tsloughter @bryannaegele Rebased my PR and addressed the checklist in the open checklist PR.

@dvic I did see that repository, though only after most of my code was already done. I would like to get this PR approved/merged before porting over your changes. This current PR takes care of job processing and trace propagation. I see you added tracing for circuits, plugins etc. It likely makes sense to port them over but as it needs some testing and making sure they follow semantic conventions I'd like to start with the basic features that this PR should take care of.

@dvic
Copy link
Contributor

dvic commented Sep 17, 2021

@indrekj I just tried your PR and I must say, it's looking good :) I like the way you organized the attributes and distinguished between process/send spans, while also supporting linking them to each other. I archived https://github.com/qdentity/opentelemetry_oban and referred to this PR.

Let me know if there's anything I can do to get this PR merged and published 👍

@dvic
Copy link
Contributor

dvic commented Sep 20, 2021

Another thing that you could add this PR: OpentelemetryOban.insert!/1, I noticed that one was missing (Oban has this one)

@indrekj
Copy link
Contributor Author

indrekj commented Sep 20, 2021

Ack, I'll take a look tomorrow

@dvic
Copy link
Contributor

dvic commented Sep 20, 2021

Ack, I'll take a look tomorrow

👍 Another thing I just encountered that you might want tackle as well: ricardoccpaiva/opentelemetry_tesla#5 (it seems we have to do &__MODULE__.my_handle_fun/4 instead of &my_handle_fun/4)

@indrekj indrekj force-pushed the opentelemetry_oban branch 2 times, most recently from a38db18 to 763b5d5 Compare September 21, 2021 13:20
@indrekj
Copy link
Contributor Author

indrekj commented Sep 21, 2021

@dvic Updated

@dvic
Copy link
Contributor

dvic commented Sep 21, 2021

@dvic Updated

Perfect, thanks 👍

@indrekj indrekj marked this pull request as ready for review September 27, 2021 10:52
@dvic
Copy link
Contributor

dvic commented Nov 23, 2021

@tsloughter @bryannaegele Updated

  • Sampler has been removed
  • CI is now green - Postgres configuration was indeed missing
  • Span linking - I would like to add a propagation_style configuration option so the user can choose between Links and Child-Of style. Similar to Ruby libraries. But I would like to do that in a separate PR after this PR, because this PR is already huge and I don't want to make it even bigger.

Great :) Should we add to the docs a line about how to exclude the internal query (by tracing the plugin events and then using a root sampler)?

@tsloughter
Copy link
Member

Agreed on a separate PR for adding propagation style.

@dvic
Copy link
Contributor

dvic commented Nov 24, 2021

I have used the following sampler:

defmodule MySampler do
  @behaviour :otel_sampler

  defstruct [:opts]

  @impl true
  def setup(opts), do: %__MODULE__{opts: opts}

  @impl true
  def description(%__MODULE__{}), do: "MySampler"

  @impl true
  def should_sample(ctx, _trace_id, _links, span_name, _span_kind, attributes, _opts) do
    drop =
      String.ends_with?(span_name, ".repo.query:oban_producers") ||
        String.ends_with?(span_name, ".repo.query:oban_jobs") ||
        String.starts_with?(span_name, "Elixir.Oban.Plugins") ||
        String.starts_with?(span_name, "Elixir.Oban.Pro.Plugins") ||
        Keyword.get(attributes, :"db.statement") in ["begin", "commit"]

    decision =
      if drop do
        :drop
      else
        :record_and_sample
      end

    attributes = []

    tracestate =
      OpenTelemetry.Tracer.current_span_ctx(ctx)
      |> OpenTelemetry.Span.tracestate()

    {decision, attributes, tracestate}
  end
end

and then in the config:

my_sampler = {MySampler, []}

config :opentelemetry,
  sampler:
    {:parent_based,
     %{
       root: my_sampler,
       local_parent_sampled: my_sampler,
       remote_parent_sampled: my_sampler
     }}

If you think it's useful we could add this to the docs as a starter.

@indrekj
Copy link
Contributor Author

indrekj commented Nov 24, 2021

I'm personally not a fan of that approach. It requires the user to manually list every possible span there could be. If the underlying library changes and there's a new span then this has to be modified again. This can be quite cumbersome when there are many services.

In general, this seems to be a common problem and I think API/SDK should provide a way to start a non-recording span. E.g. In ruby there's a helper method untraced for instrumentation libraries. These are used in libraries like sidekiq.

@dvic
Copy link
Contributor

dvic commented Nov 24, 2021

I'm personally not a fan of that approach. It requires the user to manually list every possible span there could be. If the underlying library changes and there's a new span then this has to be modified again. This can be quite cumbersome when there are many services.

In general, this seems to be a common problem and I think API/SDK should provide a way to start a non-recording span. E.g. In ruby there's a helper method [untraced](https://github.com/open-telemetry/opentelemetry-ruby/blob/2bf0e6aed060867bc65560a92e2c464d48b26a02/common/lib/opentelemetry/common/utilities.rb#L75-L77) for instrumentation libraries. These are used in libraries like sidekiq.

Interesting, but children of a non-recording span are still sampled right? How would this pan out with the ecto queries problem?

@indrekj
Copy link
Contributor Author

indrekj commented Nov 24, 2021

Interesting, but children of a non-recording span are still sampled right?

No, they're not. The decision is propagated to all the children spans.

The sidekiq library had exactly the same problem. The Poller was running in the background and that caused Redis spans to be created without any context/parents. So creating a non recording span/trace removed them.

@dvic
Copy link
Contributor

dvic commented Nov 24, 2021

Interesting, but children of a non-recording span are still sampled right?

No, they're not. The decision is propagated to all the children spans.

Ah I see, didn't know that, thanks!

@tsloughter
Copy link
Member

@indrekj so you've got the tests against postgres working in CI? Making this good to merge now?

@indrekj
Copy link
Contributor Author

indrekj commented Nov 28, 2021 via email

kind: :consumer,
links: links
})
|> Span.set_attributes(attributes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@indrekj why is this done with Span.set_attributes? It's not possible to set the attributes directly in the start_opts here? The reason why I'm asking is that I don't see the attributes in my sampler (but they do appear in Honeycomb). I'm receiving this span without the attributes in my sampler. @tsloughter is this intended behaviour of the sampler / span design?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I used some other project as an example. The start_opts works with attributes key indeed: https://github.com/open-telemetry/opentelemetry-erlang-contrib/compare/9ff9f64044d47306c0764dd9e12472a1b7eb3349..5b50df46ca8c0847f89e595cf74d4a1e2892c6e8 . I'm also using set_attributes in two other places (insert functions) but there I don't have access to all attributes immediately so there it makes sense to use set_attributes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change now I do get the attributes in the sampler, but I'm not sure if this is a bug or not, I was expecting this to work also with Span.set_attributes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be intentional. Imagine you have this flow:

  1. Start span A
  2. Send outbound request to service B
  3. Set attributes to span A
  4. Do something and finish span A

In the step 2, the sampling decision has to be known. Otherwise, service B doesn't know if the spans should be sampled or not. So the sampler has to be run in the step 1.

@indrekj indrekj force-pushed the opentelemetry_oban branch 3 times, most recently from d7e0794 to db0e1f1 Compare November 30, 2021 09:28
:otel_propagator_text_map.extract(Map.to_list(job_meta))
parent = OpenTelemetry.Tracer.current_span_ctx()
links = if parent == :undefined, do: [], else: [OpenTelemetry.link(parent)]
OpenTelemetry.Tracer.set_current_span(:undefined)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@indrekj Due to the force pushes on this PR I lost a bit track when this was introduced (maybe it's a good idea to do one force push at the end ;)), but currently I'm having the problem that the span started inside a job are not children of the job span, is this intended? Are you also seeing this behaviour? The consequence of this is that my custom sampler is not that useful anymore :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, I can always start another wrapper span inside the job, but I just wanted to double check and understand that this is intended behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not seeing this. I also added a test to ensure it: https://github.com/open-telemetry/opentelemetry-erlang-contrib/compare/db0e1f1a67bbd51992b10a0020da3b21119f6edb..ddf26187cee991e329c9dcbeaa42ad50eacc0c5d

We however had a similar issue when we were using def timeout(_job) do in the jobs. If timeout function is defined then oban uses perform_timed to execute the worker, see: https://github.com/sorentwo/oban/blob/60274b76ff2c59d660ae1bf7721a3ff14854700b/lib/oban/queue/executor.ex#L257-L269 . This creates a new process so the current span information is lost. We only had one job in the entire system with that method, so we used:

  # We use our own `with_timeout` instead of Oban.Worker.timeout/1 because if
  # the latter is used then the trace context is not propagated.
  defp with_timeout(fun) do
    task = Task.async(OpentelemetryFunction.wrap(fun))

    case Task.yield(task, @timeout) || Task.shutdown(task) do
      {:ok, reply} ->
        reply

      nil ->
        {:error, :timeout}
    end
  end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to get this PR merged to start addressing these kind of rare edge cases. As you said, the PR is huge and it's hard to follow the changes now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh, yes in this case we also used Oban.Worker.timeout/1, thanks! 👍

I'd like to get this PR merged to start addressing these kind of rare edge cases. As you said, the PR is huge and it's hard to follow the changes now.

I agree, maybe we can track these open issues / edge cases in the description so we can work on it when the PR is merged.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I can merge and then you should start opening issues.

I saw some question about set_attributes and sampling but can't find it again, so answering here: The sampler won't see attributes set with set_attributes because it has already been sampled at that point. A span is started and the attributes passed to the function that starts the span will be available but any attributes set after starting the span are of course not available.

Not sure what the code was, maybe it was something that used a pipe to look like it was using the "builder pattern" to construct a span? But we don't use that pattern, so there is no "build" function that would come after doing something like span("name") |> set_attributes(...) |> start(). So yea, must pass any attributes that need to be available for sampling to the start call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. It was here: https://github.com/open-telemetry/opentelemetry-erlang-contrib/pull/6/files/9ff9f64044d47306c0764dd9e12472a1b7eb3349#diff-ca8cd6f4f23c284564be4b34cb5caf868177e4190300720bc3c6fdb255c18b90 . For that particular case we were able to avoid set_attributes

Yea, I can merge and then you should start opening issues.

👍

By default a new trace is automatically started when a job is processed
by monitoring these events:
* `[:oban, :job, :start]` — at the point a job is fetched from the database and will execute
* `[:oban, :job, :stop]` — after a job succeeds and the success is recorded in the database
* `[:oban, :job, :exception]` — after a job fails and the failure is recorded in the database

To also record a span when a job is created and to link traces together
`Oban.insert/2` has to be replaced by `OpentelemetryOban.insert/2`.

Before:

```elixir
  %{id: 1, in_the: "business", of_doing: "business"}
  |> MyApp.Business.new()
  |> Oban.insert()
```

After:

```elixir
  %{id: 1, in_the: "business", of_doing: "business"}
  |> MyApp.Business.new()
  |> OpentelemetryOban.insert()
```
@indrekj
Copy link
Contributor Author

indrekj commented Dec 8, 2021

PR has been approved for some time now, and there doesn't seem to be more comments. Could somebody merge it?

@tsloughter
Copy link
Member

Thanks for the persistence :). I keep preparing to merge it, hit the update button, checks start running again... and I move on to something else and forget :(. Checks are again running after I hit the update button, this time I'll try not to move on to something else.

@tsloughter tsloughter merged commit eecb238 into open-telemetry:main Dec 8, 2021
@tsloughter
Copy link
Member

Merged.

I looked up the package on hex and it appears to be owned by a company/org "salemove". Is that you?

We need the opentelemetry org/team added to it on hex so we can also publish if need be -- as in, we won't be publishing the new release but in case the other owner went away or wasn't available and there was a security vulnerability, stuff like that, we'd need to be able to publish.

@indrekj
Copy link
Contributor Author

indrekj commented Dec 8, 2021

@tsloughter Thx.

Ownership has been transferred to opentelemetry now. See https://hex.pm/packages/opentelemetry_oban . I think I lost all permissions myself now, but that doesn't really matter I think. Probably releasing new versions should be part of the github actions flow anyway.

@tsloughter
Copy link
Member

Thanks, you should stay as an owner to be able to publish. It is much better to release manually and not github actions. Unless github actions has a way to require manual intervention during an action which may make it a viable option.

@dvic
Copy link
Contributor

dvic commented Dec 8, 2021

Thanks, you should stay as an owner to be able to publish. It is much better to release manually and not github actions. Unless github actions has a way to require manual intervention during an action which may make it a viable option.

Yes, I believe you can do this in public repos (and github enterprise). See https://docs.github.com/en/actions/managing-workflow-runs/reviewing-deployments

@indrekj
Copy link
Contributor Author

indrekj commented Dec 9, 2021

Thanks, you should stay as an owner to be able to publish.

Could you run mix hex.owner add opentelemetry_oban indrek in that case? All my permissions were lost when I transferred this between organizations.

@tsloughter
Copy link
Member

Hm, that didn't work, but I swear you can have both a user and an owner as the owner. It returned "cannot add owner that is not a member of the organization"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants