feat: update/add detailed telemetry for stdio and http mcp transports#1987
feat: update/add detailed telemetry for stdio and http mcp transports#1987AjmeraParth132 wants to merge 8 commits intogoogleapis:mainfrom
Conversation
Summary of ChangesHello @AjmeraParth132, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly upgrades the observability of the Managed Code Platform (MCP) server by implementing comprehensive and detailed telemetry. The changes introduce dynamic trace naming, allowing for more specific tracking of operations like toolset discovery and tool invocation. Furthermore, telemetry spans now capture the transport mechanism used and the name of the invoked tool, alongside improved error reporting. This enhancement aims to provide developers and operators with actionable insights and quicker visibility into the system's behavior and potential issues. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request adds detailed telemetry for MCP sessions over HTTP and stdio transports. The changes are well-implemented, propagating toolName information through the call stack to enrich trace spans. This will provide better visibility into tool discovery and invocation. The changes are consistent across different MCP protocol versions. I have one suggestion to refactor duplicated code to improve maintainability.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request enhances telemetry for MCP sessions across both HTTP and STDIO transports. It introduces a transport attribute to OpenTelemetry spans and provides more specific trace names for tools/list and tools/call methods. Additionally, for tool invocations, the tool_name is now included as a span attribute, which is a great improvement for observability. The implementation involves refactoring several functions to propagate the toolName up the call stack, and these changes have been applied consistently across different MCP protocol versions.
My main feedback is regarding a piece of duplicated logic for determining the trace name, which I've commented on. Addressing this would improve the code's maintainability.
|
Hi @AjmeraParth132 , thank you for your contribution! I'll make sure to review this tomorrow. :) In the meantime, please reopen this issue #1633. There's also some outstanding comments regarding the naming of the traces. Ideally we would want to come into an agreement on the names before making any official changes since updating trace name is considered breaking change~ |
|
Hi @Yuan325, thanks for the update. Sure, we'll just internally discuss the latest scope of work and reopen the updated issue quickly. Will inform you here once done. |
Yuan325
left a comment
There was a problem hiding this comment.
Please also rebase this PR :) Thank you again for your contribution!
|
@Yuan325 thanks for your review suggestions.
Will rebase the PR and update here :) |
|
Some questions: (please advise on best practice within the field if there's any) --
Please add the full visualization of the traces/attributes/metrics that you will be adding in that new issue -- if you're planning to just continue to work from there? (similar to #1633 (comment)) It will be easier for future contributors to reference, and also a way to keep track of the changes that will be applied :) Thank you very much!! I'll try to keep a closer eye on your issue/PR so that we can progress faster. Apologies for the slow response |
|
Hi @Yuan325 , Apologies for the delayed response, we’ve been AFK this week due to some family commitments and New Year festivities. We’ll be back to work after this week and will pick this up with the highest priority to help move things faster. Responding to your points:
We’ll also add a complete visualization of the new traces, attributes, and metrics here shortly, so we can align on the overall approach before proceeding further with development. Thanks. |
|
@AjmeraParth132 Sounds good, sent an email to you :) |
|
@AjmeraParth132 Please lmk once this is rebased and updated (include moving span to |
|
@Yuan325 perfect, updating the same. though, one more clarification: moving/initializing trace in processMCPMessage() or processMethod() will eventually narrow down the span scope - ie, span time and capturing error will now be limited to the new scope. what we can do here is:
which one do you prefer? |
|
@AjmeraParth132 What do you think about this (kind of combination between 1 and 3):
|
064b565 to
f8f8004
Compare
|
@Yuan325 That makes sense. But rather than spamming with 2 traces for the same case, what I feel a better approach is - To start with You can check the PR - it's updated and rebased as well. Lmk if we prefer any other way though. |
|
@AjmeraParth132 Isn't it generally bad practice to rename telemetry traces since it will generate inconsistent traces, and potentially make it harder to track? |
|
Agreed @Yuan325 , but the downsides felt more concerning! For starting
I feel, if we can let decoding the method in handler itself, that could be a solution OR the renaming way itself as the likelihood of inconsistent traces is almost null (as the traces will always be renamed unless we catch an error in handler parsing prior to Would love to hear your thoughts! |
|
@AjmeraParth132 Got it, thanks for the insight. Let me discuss this with the team before finalizing this. Will keep this as priority on my end and circle back to you asap :) |
|
@AjmeraParth132 I discovered that OpenTelemetry actually has established semantic conventions for MCP (here). I wonder if you/your team had a chance to review those yet? I also wanted to flag that we are planning to deprecate the native endpoint transport in the future. Given that, it may not be worth the effort to consolidate spans from both endpoints right now. If you’d prefer to sync up to resolve this quickly, feel free to send over a meeting invite. |
|
Hey @Yuan325! Sorry for the delay. Saw this today - that just didn’t pop in my email. Yes we have been following the OTel convention developments closely, and I think that’s a great way to instrument for end to end telemetry use cases, across multiple services. Given that, while the full fruit of reorganizing our telemetry will come up after we instrument the complete genAI stack, it is still always the best bet to align with the standard conventions. Also, deprecating the native endpoint makes it easier. We can now get the MCP spans in On telemetry, let us reorganize Do let me know if you have any thoughts or need any changes in here. Will update the PR with this plan if it looks good? And once done, we can then align metrics as well with the conventions? |
|
@AjmeraParth132 Awesome, I do agree that deprecating the native endpoint will make the traces much simpler (that way we don't have to keep consistency between the 2 endpoints).
Example or spans: STDIO tool call: SSE connection: HTTP initialize: HTTP tool call: If this looks good to you, feel free to update this PR, thank you so much once again! |
|
That's great @Yuan325! 1,2,4 - perfectly aligned Yes, the flows are exactly what they should be! Will start working on the PR 🚀 |
|
@AjmeraParth132 awesome, thank you! Will take a look when Let's add the toolset name in attribute then 👍 I'll make sure to prioritize for review once this is updated! Thank you so much again :) |
f8f8004 to
686b315
Compare
|
Hey @Yuan325, updated the PR with the required changes, lmk your thoughts. |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces detailed telemetry for MCP sessions over HTTP and STDIO transports, aiming to improve observability by propagating W3C trace context and enriching spans. However, several security vulnerabilities have been identified, including a potential Denial of Service due to unrestricted request body reading, an SSE injection vulnerability in the SSE handler, and telemetry injection via untrusted input in span names. Additionally, there are performance concerns and code duplication due to the request body being unmarshalled multiple times across different functions. Addressing these security and performance issues will enhance the robustness and efficiency of the telemetry implementation.
Yuan325
left a comment
There was a problem hiding this comment.
Thank you for updating this!! Left some feedback for code updates :)
|
Thanks @Yuan325 for the detailed feedback, rightly agreed. Incorporated the above requested changes to the PR. |
Yuan325
left a comment
There was a problem hiding this comment.
yay, last 2 comments! Overall LGTM! I'll merge this after the 2 comments is resolved. Thank you once again!! :)
internal/server/mcp.go
Outdated
There was a problem hiding this comment.
| if err = s.write(msgCtx, res); err != nil { |
internal/server/mcp.go
Outdated
| // Check if message is a notification | ||
| if baseMessage.Id == nil { | ||
| err := mcp.NotificationHandler(ctx, body) | ||
| return "", nil, err | ||
| } |
There was a problem hiding this comment.
Might want to move this to after setting the span attributes since span are added before notification handling? Also will appreciate if you can setStatus as error if there's error.
if err != nil {
span.SetStatus(codes.Error, err.Error())
}
return "", nil, err
|
/gcbrun |
|
Perfect! Pushed the changes. |
Description
This PR adds consistent and actionable telemetry for MCP sessions across HTTP and STDIO transports, enabling quick visibility into toolset discovery and tool invocation activity with minimal setup.
PR Checklist
before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
!if this involve a breaking change