-
Notifications
You must be signed in to change notification settings - Fork 118
Add otel tracing instrumentation #506
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
🚨 Unsigned commits detected! Please sign your commits. For instructions on how to set up GPG/SSH signing and verify your commits, |
|
This PR is marked as stale after 21d of inactivity. After an additional 14d of inactivity (7d to become rotten, then 7d more), it will be closed. To prevent this PR from being closed, add a comment or remove the |
9c56950 to
c55c5a6
Compare
884c096 to
8c4a03d
Compare
|
The GAIE entry request span PR is kubernetes-sigs/gateway-api-inference-extension#2057 |
24dc5b9 to
15eb137
Compare
| if request != nil && request.TargetModel != "" { | ||
| span.SetAttributes(attribute.String("gen_ai.request.model", request.TargetModel)) | ||
| } | ||
| if request != nil && request.RequestId != "" { | ||
| span.SetAttributes(attribute.String("gen_ai.request.id", request.RequestId)) | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
| if request != nil && request.TargetModel != "" { | |
| span.SetAttributes(attribute.String("gen_ai.request.model", request.TargetModel)) | |
| } | |
| if request != nil && request.RequestId != "" { | |
| span.SetAttributes(attribute.String("gen_ai.request.id", request.RequestId)) | |
| } | |
| if request != nil { | |
| if request.TargetModel != "" { | |
| span.SetAttributes(attribute.String("gen_ai.request.model", request.TargetModel)) | |
| } | |
| if request.RequestId != "" { | |
| span.SetAttributes(attribute.String("gen_ai.request.id", request.RequestId)) | |
| } | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, ty!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated, but I went with a check & handle nil/error early instead
Dockerfile.localdev.epp
Outdated
| COPY llm-d-inference-scheduler/go.mod go.mod | ||
| COPY llm-d-inference-scheduler/go.sum go.sum | ||
|
|
||
| # Copy the go source | ||
| COPY llm-d-inference-scheduler/cmd/ cmd/ | ||
| COPY llm-d-inference-scheduler/pkg/ pkg/ | ||
| COPY llm-d-inference-scheduler/test/ test/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are the "src" part correct here? or the intention is to build image from one level up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this Dockerfile I was mostly using in order to build & test from local sources- I'll clean it up!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add the localdev Dockerfile in a follow-up PR if others want it, but I've removed it from this PR
| tracer := telemetry.Tracer() | ||
| ctx, span := tracer.Start(ctx, "llm_d.epp.startup") | ||
| span.SetAttributes( | ||
| attribute.String("component", "llm-d-inference-scheduler"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we use "epp" than "llm-d-inference-scheduler?
| attribute.String("component", "llm-d-pd-proxy"), | ||
| attribute.String("operation", "startup"), | ||
| ) | ||
| defer span.End() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be a "defer" on the "epp" as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
pkg/telemetry/tracing.go
Outdated
| logger := log.FromContext(ctx) | ||
|
|
||
| // Get service name from environment, fallback to default | ||
| svcName := os.Getenv("OTEL_SERVICE_NAME") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems svcName is not really needed?
either serviceName set to os.Getenv("OTEL_SERVICE_NAME") or use defaultServiceName
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed this, ty!
| func (s *Server) chatCompletionsHandler(w http.ResponseWriter, r *http.Request) { | ||
| requestStart := time.Now() | ||
| tracer := telemetry.Tracer() | ||
| ctx, span := tracer.Start(r.Context(), "llm_d.pd_proxy.request", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
read the proposal part seem iit has decided to use llm_d.pd_proxy.
not for this PR, but should we change the folder to pkg/pd/proxy than pkg/sidecar/proxy for better naming?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Down the road we may have the sidecar proxy handle more than P/D disagg so I would suggest not renaming.
Dockerfile.localdev.epp
Outdated
| @@ -0,0 +1,110 @@ | |||
| # Build Stage: using Go 1.25 image | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this intended to check-in in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably not - unless it's useful and people want it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add this local-build Dockerfile in a follow-up PR if people want it - I removed it from this PR
pkg/telemetry/tracing.go
Outdated
| func stripScheme(endpoint string) string { | ||
| endpoint = strings.TrimPrefix(endpoint, "http://") | ||
| endpoint = strings.TrimPrefix(endpoint, "https://") | ||
| return endpoint | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like a common func, should it be moved to pkg/common/common.go ?
The GAIE PR merged! |
| if err := shutdownTracing(ctx); err != nil { | ||
| ctrl.Log.Error(err, "Failed to shutdown tracing") | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, this is valid.
we could wrap up the code in a function.
and from main to call this function.
func main() {
os.Exit(run())
}
func run() int {
// the current logic
// move os.Exit(1) to return 1
}
then both defer and os.Exit(1) should work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree but I'd rather leave that to a follow-up - since it's not related so much to tracing and I'm already out of my neighborhood here in this repo :) I'm trying to keep my changes as minimal as can be.
Add centralized telemetry package and custom spans
following the llm-d distributed tracing proposal.
Co-Authored-By: Claude <[email protected]>
Signed-off-by: sallyom <[email protected]>
Signed-off-by: sallyom <[email protected]>
Signed-off-by: sallyom <[email protected]>
|
@sallyom happy to review and help get it merged. Could you please
|
| ctrl.Log.Error(err, "Failed to initialize tracing") | ||
| } | ||
|
|
||
| // Add startup span to verify tracing is working |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: how does the below verify that tracing is working? Is there an expected error returned anywhere?
| DataParallelPodHeader = "x-data-parallel-host-port" | ||
| ) | ||
|
|
||
| // StripScheme removes http:// or https:// prefix from endpoint URL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This could be improved with url.Parse() to better handle corner cases?
| @@ -0,0 +1,87 @@ | |||
| /* | |||
| Copyright 2025 The llm-d Authors. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: probably (c) 2026-...?
| func TestStripScheme(t *testing.T) { | ||
| tests := []struct { | ||
| name string | ||
| input string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: additional test cases that may make sense?
//192.168.1.1:80HTTP://localhost:4317:9090
| func (p *PrefillHeaderHandler) PreRequest(_ context.Context, request *types.LLMRequest, schedulingResult *types.SchedulingResult) { | ||
| func (p *PrefillHeaderHandler) PreRequest(ctx context.Context, request *types.LLMRequest, schedulingResult *types.SchedulingResult) { | ||
| tracer := telemetry.Tracer() | ||
| _, span := tracer.Start(ctx, "llm_d.epp.prerequest.pd_disaggregation", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: consider making consistent use of llm_d.epp (or llm-d.epp) in all span names (e.g., the "startup" span in main.go)?
| prefillProfileRunResult, exists := schedulingResult.ProfileResults[p.prefillProfile] | ||
| if !exists { | ||
| span.SetAttributes( | ||
| attribute.Bool("llm_d.epp.pd.disaggregation_enabled", false), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it is enabled, just not used for this request
| profileResults map[string]*types.ProfileRunResult) map[string]*framework.SchedulerProfile { | ||
| // Start tracing span for profile picking operation | ||
| tracer := telemetry.Tracer() | ||
| ctx, span := tracer.Start(ctx, "llm_d.epp.profile_handler.pick", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: do you want to span to explicitly state this is a P/D profile handler Pick() function? Missing P/D designation - all profile handler plugins have a Pick function...
| func (s *Server) chatCompletionsHandler(w http.ResponseWriter, r *http.Request) { | ||
| requestStart := time.Now() | ||
| tracer := telemetry.Tracer() | ||
| ctx, span := tracer.Start(r.Context(), "llm_d.pd_proxy.request", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Down the road we may have the sidecar proxy handle more than P/D disagg so I would suggest not renaming.
This PR satisfies the llm-d distributed tracing proposal llm-d/llm-d#119 implementation plan.