Skip to content

Conversation

@sallyom
Copy link

@sallyom sallyom commented Dec 5, 2025

This PR satisfies the llm-d distributed tracing proposal llm-d/llm-d#119 implementation plan.

@github-actions
Copy link

github-actions bot commented Dec 5, 2025

🚨 Unsigned commits detected! Please sign your commits.

For instructions on how to set up GPG/SSH signing and verify your commits,
please see GitHub Documentation.

@github-actions
Copy link

This PR is marked as stale after 21d of inactivity. After an additional 14d of inactivity (7d to become rotten, then 7d more), it will be closed. To prevent this PR from being closed, add a comment or remove the lifecycle/stale label.

@sallyom sallyom force-pushed the tracing branch 4 times, most recently from 9c56950 to c55c5a6 Compare January 5, 2026 20:14
@sallyom sallyom force-pushed the tracing branch 3 times, most recently from 884c096 to 8c4a03d Compare January 7, 2026 04:20
@sallyom sallyom changed the title [WIP] Tracing Add otel tracing instrumentation Jan 7, 2026
@sallyom
Copy link
Author

sallyom commented Jan 7, 2026

The GAIE entry request span PR is kubernetes-sigs/gateway-api-inference-extension#2057

@sallyom sallyom force-pushed the tracing branch 7 times, most recently from 24dc5b9 to 15eb137 Compare January 7, 2026 05:44
Comment on lines 150 to 156
if request != nil && request.TargetModel != "" {
span.SetAttributes(attribute.String("gen_ai.request.model", request.TargetModel))
}
if request != nil && request.RequestId != "" {
span.SetAttributes(attribute.String("gen_ai.request.id", request.RequestId))
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
if request != nil && request.TargetModel != "" {
span.SetAttributes(attribute.String("gen_ai.request.model", request.TargetModel))
}
if request != nil && request.RequestId != "" {
span.SetAttributes(attribute.String("gen_ai.request.id", request.RequestId))
}
if request != nil {
if request.TargetModel != "" {
span.SetAttributes(attribute.String("gen_ai.request.model", request.TargetModel))
}
if request.RequestId != "" {
span.SetAttributes(attribute.String("gen_ai.request.id", request.RequestId))
}
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, ty!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, but I went with a check & handle nil/error early instead

Comment on lines 16 to 27
COPY llm-d-inference-scheduler/go.mod go.mod
COPY llm-d-inference-scheduler/go.sum go.sum

# Copy the go source
COPY llm-d-inference-scheduler/cmd/ cmd/
COPY llm-d-inference-scheduler/pkg/ pkg/
COPY llm-d-inference-scheduler/test/ test/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the "src" part correct here? or the intention is to build image from one level up

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this Dockerfile I was mostly using in order to build & test from local sources- I'll clean it up!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add the localdev Dockerfile in a follow-up PR if others want it, but I've removed it from this PR

tracer := telemetry.Tracer()
ctx, span := tracer.Start(ctx, "llm_d.epp.startup")
span.SetAttributes(
attribute.String("component", "llm-d-inference-scheduler"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use "epp" than "llm-d-inference-scheduler?

attribute.String("component", "llm-d-pd-proxy"),
attribute.String("operation", "startup"),
)
defer span.End()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be a "defer" on the "epp" as well?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

logger := log.FromContext(ctx)

// Get service name from environment, fallback to default
svcName := os.Getenv("OTEL_SERVICE_NAME")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems svcName is not really needed?
either serviceName set to os.Getenv("OTEL_SERVICE_NAME") or use defaultServiceName

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed this, ty!

func (s *Server) chatCompletionsHandler(w http.ResponseWriter, r *http.Request) {
requestStart := time.Now()
tracer := telemetry.Tracer()
ctx, span := tracer.Start(r.Context(), "llm_d.pd_proxy.request",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read the proposal part seem iit has decided to use llm_d.pd_proxy.

not for this PR, but should we change the folder to pkg/pd/proxy than pkg/sidecar/proxy for better naming?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Down the road we may have the sidecar proxy handle more than P/D disagg so I would suggest not renaming.

@@ -0,0 +1,110 @@
# Build Stage: using Go 1.25 image
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this intended to check-in in this PR?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably not - unless it's useful and people want it :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add this local-build Dockerfile in a follow-up PR if people want it - I removed it from this PR

Comment on lines 124 to 139
func stripScheme(endpoint string) string {
endpoint = strings.TrimPrefix(endpoint, "http://")
endpoint = strings.TrimPrefix(endpoint, "https://")
return endpoint
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like a common func, should it be moved to pkg/common/common.go ?

@sallyom
Copy link
Author

sallyom commented Jan 17, 2026

The GAIE entry request span PR is kubernetes-sigs/gateway-api-inference-extension#2057

The GAIE PR merged!

if err := shutdownTracing(ctx); err != nil {
ctrl.Log.Error(err, "Failed to shutdown tracing")
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this is valid.
we could wrap up the code in a function.
and from main to call this function.

func main() {
	os.Exit(run())
}
func run() int {
// the current logic
// move os.Exit(1) to return 1
}

then both defer and os.Exit(1) should work.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree but I'd rather leave that to a follow-up - since it's not related so much to tracing and I'm already out of my neighborhood here in this repo :) I'm trying to keep my changes as minimal as can be.

@sallyom sallyom marked this pull request as ready for review January 20, 2026 15:51
@elevran elevran added this to the v0.6 milestone Jan 22, 2026
sallyom and others added 3 commits January 23, 2026 14:40
    Add centralized telemetry package and custom spans
    following the llm-d distributed tracing proposal.

Co-Authored-By: Claude <[email protected]>
Signed-off-by: sallyom <[email protected]>
@elevran
Copy link
Collaborator

elevran commented Jan 26, 2026

@sallyom happy to review and help get it merged. Could you please

  • ensure all commits are signed; and
  • fix linter error

ctrl.Log.Error(err, "Failed to initialize tracing")
}

// Add startup span to verify tracing is working
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: how does the below verify that tracing is working? Is there an expected error returned anywhere?

DataParallelPodHeader = "x-data-parallel-host-port"
)

// StripScheme removes http:// or https:// prefix from endpoint URL
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This could be improved with url.Parse() to better handle corner cases?

@@ -0,0 +1,87 @@
/*
Copyright 2025 The llm-d Authors.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: probably (c) 2026-...?

func TestStripScheme(t *testing.T) {
tests := []struct {
name string
input string
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: additional test cases that may make sense?

  • //192.168.1.1:80
  • HTTP://localhost:4317
  • :9090

func (p *PrefillHeaderHandler) PreRequest(_ context.Context, request *types.LLMRequest, schedulingResult *types.SchedulingResult) {
func (p *PrefillHeaderHandler) PreRequest(ctx context.Context, request *types.LLMRequest, schedulingResult *types.SchedulingResult) {
tracer := telemetry.Tracer()
_, span := tracer.Start(ctx, "llm_d.epp.prerequest.pd_disaggregation",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: consider making consistent use of llm_d.epp (or llm-d.epp) in all span names (e.g., the "startup" span in main.go)?

prefillProfileRunResult, exists := schedulingResult.ProfileResults[p.prefillProfile]
if !exists {
span.SetAttributes(
attribute.Bool("llm_d.epp.pd.disaggregation_enabled", false),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it is enabled, just not used for this request

profileResults map[string]*types.ProfileRunResult) map[string]*framework.SchedulerProfile {
// Start tracing span for profile picking operation
tracer := telemetry.Tracer()
ctx, span := tracer.Start(ctx, "llm_d.epp.profile_handler.pick",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do you want to span to explicitly state this is a P/D profile handler Pick() function? Missing P/D designation - all profile handler plugins have a Pick function...

func (s *Server) chatCompletionsHandler(w http.ResponseWriter, r *http.Request) {
requestStart := time.Now()
tracer := telemetry.Tracer()
ctx, span := tracer.Start(r.Context(), "llm_d.pd_proxy.request",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Down the road we may have the sidecar proxy handle more than P/D disagg so I would suggest not renaming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

4 participants