-
Notifications
You must be signed in to change notification settings - Fork 217
feat: sidekiq scheduled jobs improvements - scheduled
operation and add links
#1717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: sidekiq scheduled jobs improvements - scheduled
operation and add links
#1717
Conversation
03a5fdb
to
4eca039
Compare
…d add links When a Job is pushed via `perform_in` or friends, it goes through the following life-cycle: 1. First the Job is pushed with `at` attribute in Redis from `Sidekiq::Client` - this will now create `<...> scheduled` span (before it was `<...> publish` 2. Then the Job is handled from `Sidekiq::Scheduled#enqueue` and pushed without `at` attribute in Redis from `Sidekiq::Client` - this will keep creating `<...> publish` span 3. Then the Job is handled by the Worker as usual Additionally when using `:propagation_style = :link` and `:trace_poller_enqueue = true` a handy set of links are added between the 3 actors above.
44cda99
to
6a3983f
Compare
This reverts commit 4ed7a70.
scheduled
operation and add linksscheduled
operation and add links
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your submission. I have some concerns about how this changes/breaks primary use cases.
Please review my comments and provide any feedback to help me understand how we can support your needs.
attributes[SemanticConventions::Trace::PEER_SERVICE] = instrumentation_config[:peer_service] if instrumentation_config[:peer_service] | ||
|
||
scheduled_at = job['at'] | ||
op = scheduled_at.nil? ? 'publish' : 'scheduled' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 This naming does not match any of the semantic convention naming for messaging.
Is there something special about scheduled spans that help you disambiguate them from any other publisher?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will answer in the next comment ...
tracer.in_span(span_name, attributes: attributes, kind: :producer) do |span| | ||
# In case this is Scheduled job, there is already context injected, so link to that context | ||
# NOTE: :propagation_style = :child is not supported as it is quite tricky when :trace_poller_enqueue = true | ||
extracted_context = OpenTelemetry.propagation.extract(job, context: OpenTelemetry::Context::ROOT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite interesting and do not see a case covered in the spec https://github.com/open-telemetry/semantic-conventions/blob/v1.24.0/docs/messaging/messaging-spans.md#span-name
If I understand this correctly, this is disconnecting the parent/child context and treats a Sidekiq scheduled job from the place where it was enqueued?
This seems unexpected to do from the client all cases of links because it is going to disconnect the clients, publishers and consumers from each other generating 3 traces ids instead of 2. By default, even in a link situation, I would expect the publisher span to use parent/child propagation and the consumer to generate links.
E.g. you have a web request that uses wait_until
, I would expect the http.server
span to have a parent/child relationship with the messaging.producer
span:
def checkout
# ... schedule housekeeping clean up
GuestsCleanupJob.set(wait_until: Date.tomorrow.noon).perform_later(guest)
end
When the GuestsCleanupJob
then executes it would create a new trace, and link
the web request trace to it.
Sidekiq
maps wait_until
to the at
attribute AFAICT: https://github.com/sidekiq/sidekiq/blob/96f867cb58b7fa0a6a832af1a732a339aa0eb61f/lib/sidekiq/job.rb#L194
With these changes, it will end up changing this expected behavior and that is undesirable in my opinion.
My recommendation here would be for you to add a utility method to your code or identify something more specific in Sidekiq
that would address your use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is how Sidekiq is working, to what I have found:
<SomeJob>.perform_in(1.minute)
pushes the Job definition to a dedicated queue (sorted set) with nameschedule
, using theat
to sort the setSidekiq::Scheduled::Poller
runs roughly everyaverage_scheduled_poll_interval
,- During the each run, it picks the items in the
schedule
that are due and pushes to the defined<SomeJob>.queue
<SomeJob>.perform
is invoked
When it comes to OpenTelemetry instrumentation of that:
Currently when when using span_naming=:job_class (default is :queue), propagation_style=:link (default is :link)
this results in the following spans
<SomeJob> publish
and attrs{"messaging.destination.name": "high", ...}
time=T, id=123, trace=ABC, parent=123- If using
trace_poller_enqueue: true
-Sidekiq::Scheduled::Poller#enqueue
time~=T+1.minute, id=567, trace=XYZ, parent=nil, having no parent and it enqueues a batch of Jobs- no relation / link to p.1
- If using
trace_poller_enqueue: true
-<SomeJob> publish
and attrs{"messaging.destination.name": "high", ...}
time~=T+1.minute, id=678, trace=XYZ, parent=567- no relation / link to p.1
<SomeJob> process
and attrs{"messaging.destination.name": "high", ...}
time~=T+1.minute+operational delay, id=987, trace=DEF- relation / link to p.3, but only if using
trace_poller_enqueue: true
- relation / link to p.3, but only if using
Caveats:
- If using the default
trace_poller_enqueue: false
- the P.2, P.3 are completely missed - If using the default
trace_poller_enqueue: true
- the P.2, P.3 and P.4 are correlated, but the Span from P.2 (Sidekiq::Scheduled::Poller#enqueue
) can't be child of many parents as each there are NxP.1 each on its own Trace, thus only Link is available to correlate. - If using the default
trace_poller_enqueue: true
- the The P.3 (<SomeJob> publish
) spans are put on the trace of P.2. The spans of P.2 eventually with quite some work, may be refactored to be put as child of P.1 and have links to P.2, but that is not something I am keen on doing as it is very hard. - If using the default
trace_poller_enqueue: false
(the default) all of the P.2 and P.3 spans and correlation are missing
My goal is to have meaningful correlation from P.1 to P.4, regardless same Trace or Links, whatever is easy and feasible.
With the above caveats in place, this PR changes
<SomeJob> scheduled
and attrs{"messaging.destination.name": "high", ...}
time=T, id=123, trace=ABC, parent=123- changed from
publish
toscheduled
to distinguish from P.3
- changed from
- If using
trace_poller_enqueue: true
-Sidekiq::Scheduled::Poller#enqueue
time~=T+1.minute, id=567, trace=XYZ, parent=nil, having no parent and it enqueues a batch of Jobs- no relation / link to p.1
- If using
trace_poller_enqueue: true
-<SomeJob> publish
and attrs{"messaging.destination.name": "high", ...}
time~=T+1.minute, id=678, trace=XYZ, parent=567- if using
propagation_style=:link
- a Link to P.1 is added
- if using
<SomeJob> process
and attrs{"messaging.destination.name": "high", ...}
time~=T+1.minute+operational delay, id=987, trace=DEF- relation / link to p.3, but only if using
trace_poller_enqueue: true
- relation / link to p.3, but only if using
Resulting in this more meaningful list and correlation with links or traces

I have been thinking to do some monkey patching or whatever, but it is way harder, moreover this PR has value for others.
So @arielvalentin how about, the following?
- Keep correlation as improved (limited Trace and added Links)
- If you are still very keen on obliging the standards, I can easily change P.1 spans addressing feat: sidekiq scheduled jobs improvements -
scheduled
operation and add links #1717 (comment)- from
<SomeJob> **scheduled**
and attrs{"messaging.destination.name": "**high**", ...}
time=T, id=123, trace=ABC, parent=123 - to
<SomeJob> **publish**
and attrs{"messaging.destination.name": "**schedule**", ...}
time=T, id=123, trace=ABC, parent=123
- from
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a sidekiq user so I'm thinking about this from the perspective of an active job user who may use the sidekiq adapter.
I see your point though of using non standard semconv in the cases where the naming pattern is Job Name instead of semantic conventions. I think in those cases using scheduled
in the span name should be fine; however I think the semconv cases should remain untouched.
I'd need to review your comments more carefully since again, I'm not a sidekiq user so I don't quite understand the nuances of the cause of disconnect between the producer and consumer spans in the cases you outlined above.
When a Job is pushed via
perform_in
or friends, it goes through the following life-cycle:at
attribute in Redis fromSidekiq::Client
- this will now create<...> scheduled
span (before it was<...> publish
Sidekiq::Scheduled#enqueue
and pushed withoutat
attribute in Redis fromSidekiq::Client
- this will keep creating<...> publish
spanAdditionally when using
:propagation_style = :link
and:trace_poller_enqueue = true
a handy set of links are added between the 3 actors above.