Add initial troubleshooting guide #1430

nicolexin · 2025-08-21T21:22:04Z

Fixes #689

/kind documentation

What this PR does / why we need it:
Adding initial Troubleshooting guide for common issues

netlify · 2025-08-21T21:22:09Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`651ff89`
🔍 Latest deploy log	https://app.netlify.com/projects/gateway-api-inference-extension/deploys/68ade05f7e35e000085ec2ef
😎 Deploy Preview	https://deploy-preview-1430--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

k8s-ci-robot · 2025-08-21T21:22:14Z

Hi @nicolexin. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

nicolexin · 2025-08-21T21:34:37Z

/assign @kfswain

mkdocs.yml

site-src/guides/troubleshooting.md

kfswain · 2025-08-22T16:20:43Z

Looks good! Just some minor tweaks, thanks so much!

/ok-to-test

Co-authored-by: Kellen Swain <[email protected]>

liu-cong

Thanks, this looks really great!

liu-cong · 2025-08-25T21:34:05Z

site-src/guides/troubleshooting.md

+
+For unexpected routing behaviors: 
+
+* Verify the expected metrics are being emitted from the model server. Some model servers aren't fully compatible with the default expected metrics, vLLM is generally the most up-to-date in this regard.


Pls add the supported model server link https://gateway-api-inference-extension.sigs.k8s.io/implementations/model-servers/

site-src/guides/troubleshooting.md

liu-cong · 2025-08-25T21:35:16Z

site-src/guides/troubleshooting.md

+    * `DefaultMetricsStalenessThreshold`: This defines the maximum age of metrics data before it's considered outdated. The default is 200 milliseconds. The saturation detector needs up-to-date metrics to make accurate decisions about system load. If the metrics are older than this threshold, the detector won't use them. This value is tied to how often metrics are refreshed, and setting it slightly higher ensures that there's always fresh data available. To override this, set the `SD_METRICS_STALENESS_THRESHOLD` environment variable.
+
+## 500 Internal Server Error
+### `fault filter abort`


Are these errors specific to GKE gateway or are they common Gateway API error codes?

That's a good question. The 500 fault filter abort error is common when the backend does not exist. The 503 is also common when there is no healthy endpoint to route to. The only thing that can diverge is when port is misconfigured - I believe GKE and Istio gives 503 Service Available but KGateway gives 502 Bad Gateway. I have updated the doc to add 502 error code.

Co-authored-by: Cong Liu <[email protected]>

liu-cong · 2025-08-26T16:42:20Z

/lgtm

kfswain · 2025-08-26T16:46:54Z

/approve

k8s-ci-robot · 2025-08-26T16:47:03Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kfswain, nicolexin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [kfswain]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

nirrozenbaum · 2025-08-27T11:16:46Z

site-src/guides/troubleshooting.md

+apiVersion: gateway.networking.k8s.io/v1beta1
+kind: ReferenceGrant
+metadata:
+  name: ref-grant
+  namespace: ns2
+spec:
+  from:
+    - group: gateway.networking.k8s.io
+      kind: HTTPRoute
+      namespace: ns1
+  to:
+    - group: inference.networking.k8s.io
+      kind: InferencePool
+      name: my-inference-pool
+```


out of curiosity, was this ever tested?

Not sure about other implementations but recently we found a bug where ReferenceGrant does not work properly with InferencePool, that was fixed in GKE.

I was asking because I've never seen this being used (I never used it and never seen a question about it from others).
it does make sense to support it, was just trying to understand if that was actually tested.

does it worth adding a conformance test? cc @robscott @zetxqx

I think we better have one. Just created a github issue: #1487 to track it.

But note: the refGrant is never mentioned in the inference extension API spec.

ahg-g · 2025-08-27T11:30:19Z

site-src/guides/troubleshooting.md

+### `failed to list <InferencePool or InferenceObjective or Pod>: … is forbidden`
+The EPP needs to watch the InferencePool, InferenceObjectives and Pods that belong to them. This constant watching and reconciliation allows the EPP to maintain an up-to-date view of the environment, enabling it to make dynamic decisions. This particular error indicates that the service account used by the EPP doesn't have the necessary permissions to list the resources it’s watching.
+
+**Solution**: Create or update the RBAC configuration to grant the [required permissions](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/137a0b4660b96487caac626ed135b3600be876ed/config/manifests/inferencepool-resources.yaml#L129) to the EPP service account.


we should avoid pointing to inferencepool-resources.yaml and only use the helm charts, we need to deprecate this manifest, I created to #1480 to track deprecating it.

Thanks for pointing this out! I will create a follow up PR to fix this.

* Add initial troubleshooting guide * Update mkdocs.yml to add troubleshooting guide * Rename toubleshooting.md to troubleshooting.md * Fix bullet points * Fix bullet points * Make solutions bold and clear * Update mkdocs.yml Co-authored-by: Kellen Swain <[email protected]> * Update site-src/guides/troubleshooting.md Co-authored-by: Kellen Swain <[email protected]> * Update site-src/guides/troubleshooting.md Co-authored-by: Kellen Swain <[email protected]> * Update site-src/guides/troubleshooting.md Co-authored-by: Cong Liu <[email protected]> * Update troubleshooting.md --------- Co-authored-by: Kellen Swain <[email protected]> Co-authored-by: Cong Liu <[email protected]>

nicolexin added 2 commits August 21, 2025 14:16

Add initial troubleshooting guide

dd58183

Update mkdocs.yml to add troubleshooting guide

475d69e

k8s-ci-robot added the kind/documentation Categorizes issue or PR as related to documentation. label Aug 21, 2025

k8s-ci-robot requested a review from ahg-g August 21, 2025 21:22

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 21, 2025

k8s-ci-robot requested a review from liu-cong August 21, 2025 21:22

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 21, 2025

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Aug 21, 2025

nicolexin added 3 commits August 21, 2025 14:25

Rename toubleshooting.md to troubleshooting.md

b5c8e48

Fix bullet points

92a312a

Fix bullet points

c271df2

k8s-ci-robot assigned kfswain Aug 21, 2025

Make solutions bold and clear

fe87eed

kfswain reviewed Aug 22, 2025

View reviewed changes

mkdocs.yml Outdated Show resolved Hide resolved

site-src/guides/troubleshooting.md Outdated Show resolved Hide resolved

site-src/guides/troubleshooting.md Outdated Show resolved Hide resolved

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 22, 2025

nicolexin and others added 3 commits August 25, 2025 08:51

Update mkdocs.yml

28826c6

Co-authored-by: Kellen Swain <[email protected]>

Update site-src/guides/troubleshooting.md

08eacde

Co-authored-by: Kellen Swain <[email protected]>

Update site-src/guides/troubleshooting.md

cac2e56

Co-authored-by: Kellen Swain <[email protected]>

liu-cong reviewed Aug 25, 2025

View reviewed changes

nicolexin and others added 2 commits August 25, 2025 16:37

Update site-src/guides/troubleshooting.md

914d9d2

Co-authored-by: Cong Liu <[email protected]>

Update troubleshooting.md

651ff89

k8s-ci-robot assigned liu-cong Aug 26, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 26, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 26, 2025

k8s-ci-robot merged commit fb77ef2 into kubernetes-sigs:main Aug 26, 2025
10 checks passed

nirrozenbaum reviewed Aug 27, 2025

View reviewed changes

ahg-g reviewed Aug 27, 2025

View reviewed changes


		For unexpected routing behaviors:

		* Verify the expected metrics are being emitted from the model server. Some model servers aren't fully compatible with the default expected metrics, vLLM is generally the most up-to-date in this regard.

Add initial troubleshooting guide #1430

Add initial troubleshooting guide #1430

Uh oh!

Conversation

nicolexin commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for gateway-api-inference-extension ready!

Uh oh!

k8s-ci-robot commented Aug 21, 2025

Uh oh!

nicolexin commented Aug 21, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kfswain commented Aug 22, 2025

Uh oh!

liu-cong left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liu-cong commented Aug 26, 2025

Uh oh!

kfswain commented Aug 26, 2025

Uh oh!

k8s-ci-robot commented Aug 26, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicolexin Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

nicolexin commented Aug 21, 2025 •

edited

Loading

netlify bot commented Aug 21, 2025 •

edited

Loading

nicolexin Aug 27, 2025 •

edited

Loading