- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 1.7k
fix(otel): Don't ignore child spans after the root is sent #16416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This fixes a critical issue I was having with Mastra where almost all my spans from a workflow run were getting stuck due to their parent span being sent almost immediately. I'm guessing this has to do something with `async` stuff going on. Regardless, by treating these children as 'root spans' I was able to get them out and displayed correctly in Spotlight. What this does is creates a 5-min capped 'already sent span ids' cache and treats any span that has a parent id in this list as a root span to get them unstuck.
| size-limit report 📦
 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this!
Could we also add a test that showcases child spans being sent despite their root having already been sent?
        
          
                dev-packages/opentelemetry-v2-tests/test/integration/transactions.test.ts
              
                Outdated
          
            Show resolved
            Hide resolved
        
      | 
 Yeah was my next step, just got busy with other stuff. Thanks a lot for the review! | 
| @andreiborza @AbhiPrasad any idea about the Next.js test failures? Not sure if they are releated to my change and if they are, whether the tests need updating or there's something I actually broke? | 
| The test failure is showing the span description is not correct. In my opinion this means either 
 This is not a flake, we don't usually get errors like this. I do lack context about Next specifics nowadays, so we can wait till Monday to get more feedback from the folks who own that SDK now (@chargome and @RulaKhaled) | 
| Spans were actually still correct, but with this change some things started leaking because we ran them with parallel workers | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this!
Follow up to #16416, slightly moving things around and making some small improvements to performance and logic (some not related to that PR but another improvement we noticed when BYK looked int this): 1. Actually debounce the flushing properly, including a maxWait of 100ms. Previously, it was technically possible to debounce flushing forever, if a span was flushed every ms. Now, at least after 100ms it should _always_ flush. 2. Simplify overhead by avoid checking for the 5min timeout of sent spans. As this check `isSpanAlreadySent` has been called for every single span that ended, it is quite high impact, and meant a little bit of additional work (although it should not necessarily run _too_ often, but still). Instead, we now simply check for `map.has(id)` which should be good enough for what we want to achieve, IMHO. We still clean up the sent spans when flushing, so old stuff should still go away eventually. 3. Made stuff that is not needed as public API private on the span exporter class. this is technically breaking but I think it is OK, this should not be public API surface as it does not need to be called from outside. For this I moved the already existing `debounce` function from replay to core. We re-implement this in replay with a different setTimeout impl. which is needed for some angular stuff. --------- Co-authored-by: Abhijeet Prasad <[email protected]>
This fixes a critical issue I was having with Mastra where almost all my spans from a workflow run were getting stuck due to their parent span being sent almost immediately. I'm guessing this has to do something with
asyncstuff going on. Regardless, by treating these children as 'root spans' I was able to get them out and displayed correctly in Spotlight.What this does is creates a 5-min capped 'already sent span ids' cache and treats any span that has a parent id in this list as a root span to get them unstuck.