Skip to content

Conversation

nicolexin
Copy link
Contributor

@nicolexin nicolexin commented Aug 21, 2025

Fixes #689

/kind documentation

What this PR does / why we need it:
Adding initial Troubleshooting guide for common issues

@k8s-ci-robot k8s-ci-robot added the kind/documentation Categorizes issue or PR as related to documentation. label Aug 21, 2025
Copy link

netlify bot commented Aug 21, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit 651ff89
🔍 Latest deploy log https://app.netlify.com/projects/gateway-api-inference-extension/deploys/68ade05f7e35e000085ec2ef
😎 Deploy Preview https://deploy-preview-1430--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot requested a review from ahg-g August 21, 2025 21:22
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 21, 2025
@k8s-ci-robot k8s-ci-robot requested a review from liu-cong August 21, 2025 21:22
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 21, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @nicolexin. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Aug 21, 2025
@nicolexin
Copy link
Contributor Author

/assign @kfswain

@kfswain
Copy link
Collaborator

kfswain commented Aug 22, 2025

Looks good! Just some minor tweaks, thanks so much!

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 22, 2025
Copy link
Contributor

@liu-cong liu-cong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks really great!


For unexpected routing behaviors:

* Verify the expected metrics are being emitted from the model server. Some model servers aren't fully compatible with the default expected metrics, vLLM is generally the most up-to-date in this regard.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

* `DefaultMetricsStalenessThreshold`: This defines the maximum age of metrics data before it's considered outdated. The default is 200 milliseconds. The saturation detector needs up-to-date metrics to make accurate decisions about system load. If the metrics are older than this threshold, the detector won't use them. This value is tied to how often metrics are refreshed, and setting it slightly higher ensures that there's always fresh data available. To override this, set the `SD_METRICS_STALENESS_THRESHOLD` environment variable.

## 500 Internal Server Error
### `fault filter abort`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these errors specific to GKE gateway or are they common Gateway API error codes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. The 500 fault filter abort error is common when the backend does not exist. The 503 is also common when there is no healthy endpoint to route to. The only thing that can diverge is when port is misconfigured - I believe GKE and Istio gives 503 Service Available but KGateway gives 502 Bad Gateway. I have updated the doc to add 502 error code.

@liu-cong
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 26, 2025
@kfswain
Copy link
Collaborator

kfswain commented Aug 26, 2025

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kfswain, nicolexin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 26, 2025
@k8s-ci-robot k8s-ci-robot merged commit fb77ef2 into kubernetes-sigs:main Aug 26, 2025
10 checks passed
Comment on lines +37 to +51
apiVersion: gateway.networking.k8s.io/v1beta1
kind: ReferenceGrant
metadata:
name: ref-grant
namespace: ns2
spec:
from:
- group: gateway.networking.k8s.io
kind: HTTPRoute
namespace: ns1
to:
- group: inference.networking.k8s.io
kind: InferencePool
name: my-inference-pool
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of curiosity, was this ever tested?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about other implementations but recently we found a bug where ReferenceGrant does not work properly with InferencePool, that was fixed in GKE.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was asking because I've never seen this being used (I never used it and never seen a question about it from others).
it does make sense to support it, was just trying to understand if that was actually tested.

does it worth adding a conformance test? cc @robscott @zetxqx

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we better have one. Just created a github issue: #1487 to track it.

But note: the refGrant is never mentioned in the inference extension API spec.

### `failed to list <InferencePool or InferenceObjective or Pod>: … is forbidden`
The EPP needs to watch the InferencePool, InferenceObjectives and Pods that belong to them. This constant watching and reconciliation allows the EPP to maintain an up-to-date view of the environment, enabling it to make dynamic decisions. This particular error indicates that the service account used by the EPP doesn't have the necessary permissions to list the resources it’s watching.

**Solution**: Create or update the RBAC configuration to grant the [required permissions](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/137a0b4660b96487caac626ed135b3600be876ed/config/manifests/inferencepool-resources.yaml#L129) to the EPP service account.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should avoid pointing to inferencepool-resources.yaml and only use the helm charts, we need to deprecate this manifest, I created to #1480 to track deprecating it.

Copy link
Contributor Author

@nicolexin nicolexin Aug 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out! I will create a follow up PR to fix this.

kfswain added a commit to kfswain/llm-instance-gateway that referenced this pull request Sep 1, 2025
* Add initial troubleshooting guide

* Update mkdocs.yml to add troubleshooting guide

* Rename toubleshooting.md to troubleshooting.md

* Fix bullet points

* Fix bullet points

* Make solutions bold and clear

* Update mkdocs.yml

Co-authored-by: Kellen Swain <[email protected]>

* Update site-src/guides/troubleshooting.md

Co-authored-by: Kellen Swain <[email protected]>

* Update site-src/guides/troubleshooting.md

Co-authored-by: Kellen Swain <[email protected]>

* Update site-src/guides/troubleshooting.md

Co-authored-by: Cong Liu <[email protected]>

* Update troubleshooting.md

---------

Co-authored-by: Kellen Swain <[email protected]>
Co-authored-by: Cong Liu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/documentation Categorizes issue or PR as related to documentation. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Write a troubleshooting guide

7 participants