More guidance/resources for troubleshooting a failed Nebari deployment? #3085

mwengren · 2025-06-26T18:34:22Z

mwengren
Jun 26, 2025

I'm working on adapting the Nebari TF to work for my particular AWS environment. Details on our set up can be found in this ongoing topic, specifically this comment for reference.

I'm starting a new discussion here because I've reached a point where the Nebari deployment is failing and I'm struggling to know where to look to diagnose the cause. I've looked through the docs, including the Debug Nebari, Troubleshooting and pretty much everywhere else and while those are good, they don't really go to the level I need and I could use some advice.

Does anyone have a recommendation for how I can understand:

Where to find logs for Nebari deployment errors beyond what's output in the console from nebari deploy. If I've successfully deployed the k8s cluster, does this mean going to the pod logs? Or are there logs on the system where I'm running nebari deploy I can look at?
How to view the TF variables passed from one Nebari stage to the next?

My particular issue is that Nebari is failing in the 04-kubernetes-ingress stage with the following error:

[tofu]: │ Error: client rate limiter Wait returned an error: context deadline exceeded
[tofu]: │
[tofu]: │   with module.kubernetes-ingress.kubernetes_service.main,
[tofu]: │   on modules/kubernetes/ingress/main.tf line 114, in resource "kubernetes_service" "main":
[tofu]: │  114: resource "kubernetes_service" "main" {
[tofu]: │
[tofu]: ╵

My self-guided troubleshooting so far has involved:

Installing OpenTofu and examining output from each of the Nebari stages directories (e.g. tofu state list, tofu output) to try to identify what's been created in each stage. Mixed results here: I can get tofu output from stages 03 and 04 but not 02-infrastructure, which seems odd to me since 02 is where most of the underlying AWS resources are deployed. When I run tofu state list from the 02-infrastructure dir, I get a No state file found error.
Examining Nebari source code and trying to understand what is passed to stage_outputs per Nebari Stages documentation in order to understand which variables are being passed to each successive Nebari stage (this has been challenging, the code is a lot to understand). It's hard to troubleshoot a particular stage in the chain when you don't know what parameters it's received from the previous stage(s) via the stage_outputs dict.
Installing kubectl and k9s to interrogate the k8s cluster to troubleshoot pods etc (just starting down this path).

I'm sure most Nebari installations go much more smoothly than mine and are much less complicated, but I feel like the docs could benefit from some details on how to interrogate some of the Nebari internals like I'm trying to do, or more detail on troubleshooting interactions/dependencies between the Nebari stages in the Nebari Stages page, because that's where I feel like I'm struggling the most and might be the source of my issues.

TIA for any advice or guidance!

verhulstm · 2025-06-26T20:51:39Z

verhulstm
Jun 26, 2025

I also use kubectl and k9s to interrogate the k8s cluster. plus https://kubernetes.io/docs/tasks/access-application-cluster/web-ui-dashboard/

0 replies

verhulstm · 2025-06-26T20:53:58Z

verhulstm
Jun 26, 2025

the error " Error: client rate limiter Wait returned an error: context deadline exceeded" is common. in our use case it often means the k8s namespace is not correct

0 replies

verhulstm · 2025-06-27T06:37:42Z

verhulstm
Jun 27, 2025

i think you might want to start with adding extra logging statement here

https://github.com/nebari-dev/nebari/blob/main/src/_nebari/subcommands/init.py

0 replies

mwengren · 2025-06-27T13:11:59Z

mwengren
Jun 27, 2025
Author

@verhulstm thanks for the suggestions! I'll look into those.

In the meantime, I have some information to start from as far as troubleshooting goes from both tofu commands and k9s. If anyone has insight, it would be much appreciated.

Using tofu from my Nebari deploy machine, I get the following outputs from the Nebari stages I've been able to reach so far:

03-kubernetes-initialize]$ tofu state list
data.aws_eks_cluster.default
data.aws_eks_cluster_auth.default
module.kubernetes-autoscaling[0].helm_release.autoscaler
module.kubernetes-initialization.kubernetes_namespace.main
module.traefik-crds.kubernetes_manifest.ingress_route
module.traefik-crds.kubernetes_manifest.ingress_route_tcp
module.traefik-crds.kubernetes_manifest.ingress_route_udp
module.traefik-crds.kubernetes_manifest.middleware
module.traefik-crds.kubernetes_manifest.middlewaretcp
module.traefik-crds.kubernetes_manifest.serverstransports
module.traefik-crds.kubernetes_manifest.tls_option
module.traefik-crds.kubernetes_manifest.tls_stores
module.traefik-crds.kubernetes_manifest.traefik_service

04-kubernetes-ingress]$ tofu state list
data.aws_eks_cluster.default
data.aws_eks_cluster_auth.default
module.kubernetes-ingress.kubernetes_cluster_role.main
module.kubernetes-ingress.kubernetes_cluster_role_binding.main
module.kubernetes-ingress.kubernetes_deployment.main
module.kubernetes-ingress.kubernetes_manifest.tlsstore_default[0]
module.kubernetes-ingress.kubernetes_persistent_volume_claim.traefik_certs_pvc
module.kubernetes-ingress.kubernetes_service.main
module.kubernetes-ingress.kubernetes_service.traefik_internal
module.kubernetes-ingress.kubernetes_service_account.main

I think it's odd that there's no similar state available from stages 01 and 02, not sure if that's something to do with Nebari or an issue with my Nebari conda deployment environment:

01-terraform-state]$ tofu state list
No state file was found!

State management commands require a state file. Run this command
in a directory where OpenTofu has been run or use the -state flag
to point the command to a specific state location.

02-infrastructure]$ tofu state list
No state file was found!

State management commands require a state file. Run this command
in a directory where OpenTofu has been run or use the -state flag
to point the command to a specific state location.

In my k8s cluster, I have two pods deployed to the dev environment as named in my nebari-config.yaml:

NAMESPACE↑        NAME                                                            PF      READY      STATUS             RESTARTS IP                 NODE                                             AGE        │
│ dev               cluster-autoscaler-aws-cluster-autoscaler-5696559dc-ld6mf       ●       1/1        Running                   0 10.26.37.126       ip-10-26-37-135.us-east-2.compute.internal       9d         │
│ dev               nebari-traefik-ingress-95bc76545-rt6l2                          ●       1/1        Running                   0 10.26.37.146       ip-10-26-37-135.us-east-2.compute.internal       9d         │
│

The best starting point I have are these error messages in the traefik pod shortly after I ran nebari deploy last week:

 time="2025-06-17T17:11:39Z" level=info msg="Configuration loaded from flags."                                                                                                                                   │
│ time="2025-06-17T17:11:39Z" level=error msg="Secret dev/ does not exist" providerName=kubernetescrd TLSStore=default namespace=dev                                                                              │
│ time="2025-06-17T17:11:57Z" level=error msg="Secret dev/ does not exist" TLSStore=default namespace=dev providerName=kubernetescrd                                                                              │
│ time="2025-06-17T17:11:59Z" level=error msg="Secret dev/ does not exist" TLSStore=default namespace=dev providerName=kubernetescrd

I'm going to see what these turn up.

No doubt the traefik errors are at least a factor in why my Nebari ingress is failing at: https://github.com/nebari-dev/nebari/blob/main/src/_nebari/stages/kubernetes_ingress/template/modules/kubernetes/ingress/main.tf#L114

0 replies

viniciusdc · 2025-06-28T13:06:24Z

viniciusdc
Jun 28, 2025
Maintainer

Hey folks, great to see the discussion here.

@mwengren Sorry for the delayed response. While I’m not certain why you’re hitting this error, I recommend trying out the k9s tool, as @verhulstm suggested. It’s incredibly helpful for debugging Kubernetes clusters, and we rely on it heavily since it saves you from constantly crafting kubectl commands by hand. Here’s a guide to get started: Using k9s with Nebari

As a first step, I suggest checking the status of the Traefik pod:

kubectl get pods -A

This will return a lot of output—look for the Traefik pod and check if its status is Running. If it isn’t, please copy the pod’s name and share the output of:

kubectl logs <pod-name>

(Please review the logs before sharing to ensure there’s no private information.)

Regarding your main challenge, if I understand correctly, you need clearer guidance on debugging variable migrations (input/output) during deployment and understanding why your stage is failing.

A quick note: stages 1 and 2 differ in directory organization. If you run ls inside these folders, you’ll notice they have an additional layer based on your cloud provider, for example:

02-infrastructure/aws
02-infrastructure/gcp

Your stage file should be inside this deeper directory. If it’s not, there may be a bug, and you’ll also need to check your cloud provider’s S3 bucket (look for a bucket named like <cluster-name>-<namespace>-*state) to locate the state files, unless you’re using local state files (which is a separate discussion).

On debugging outputs: for security reasons, outputs aren’t printed to the terminal. If you need to inspect them, adding log or print statements, as @verhulstm suggested, is the quickest way. Since we introduced the stages mechanism, we’ve been working to improve this area of the documentation, but we’ve lacked direct user perspective on what was missing—so this discussion is valuable.

0 replies

mwengren · 2025-07-03T14:25:20Z

mwengren
Jul 3, 2025
Author

@viniciusdc thanks for your feedback. A couple comments:

This will return a lot of output—look for the Traefik pod and check if its status is Running. If it isn’t, please copy the pod’s name and share the output of:
kubectl logs <pod-name>

I've already used k9s and found the traefik pod logs (there are only four lines of logs, clearly not starting up properly):

| time="2025-06-17T17:11:39Z" level=info msg="Configuration loaded from flags."                                                                                                                                   │
│ time="2025-06-17T17:11:39Z" level=error msg="Secret dev/ does not exist" providerName=kubernetescrd TLSStore=default namespace=dev                                                                              │
│ time="2025-06-17T17:11:57Z" level=error msg="Secret dev/ does not exist" TLSStore=default namespace=dev providerName=kubernetescrd                                                                              │
│ time="2025-06-17T17:11:59Z" level=error msg="Secret dev/ does not exist" TLSStore=default namespace=dev providerName=kubernetescrd

It seems to be looking for a secret named /dev which does not exist. Any ideas? This matches the value for my namespace setting in nebari-config.yaml FWIW, so maybe not a coincidence.

For kubectl get pods -A there are only two Nebari-namespaced pods and several others in the kube-system namespace. I'll just copy them all below, I don't think there's anything sensitive beyond private IP addressing:

 | dev               cluster-autoscaler-aws-cluster-autoscaler-5696559dc-s52xq       ●       1/1        Running                   0 10.26.35.129       ip-10-26-34-35.us-east-2.compute.internal       4d4h        │
│ dev               nebari-traefik-ingress-95bc76545-vcqzb                          ●       1/1        Running                   0 10.26.35.156       ip-10-26-34-35.us-east-2.compute.internal       4d4h        │
│ kube-system       aws-node-mmknp                                                  ●       2/2        Running                   0 10.26.34.35        ip-10-26-34-35.us-east-2.compute.internal       4d4h        │
│ kube-system       coredns-7d7dfc6cd4-kjnhr                                        ●       1/1        Running                   0 10.26.35.91        ip-10-26-34-35.us-east-2.compute.internal       4d4h        │
│ kube-system       coredns-7d7dfc6cd4-vvbzt                                        ●       1/1        Running                   0 10.26.35.49        ip-10-26-34-35.us-east-2.compute.internal       4d4h        │
│ kube-system       ebs-csi-controller-6cc8cc8dd5-6fqn4                             ●       6/6        Running                   0 10.26.34.31        ip-10-26-34-35.us-east-2.compute.internal       4d4h        │
│ kube-system       ebs-csi-controller-6cc8cc8dd5-79gzh                             ●       6/6        Running                   0 10.26.34.125       ip-10-26-34-35.us-east-2.compute.internal       4d4h        │
│ kube-system       ebs-csi-node-mh76k                                              ●       3/3        Running                   0 10.26.34.221       ip-10-26-34-35.us-east-2.compute.internal       4d4h        │
│ kube-system       kube-proxy-dn7t9                                                ●       1/1        Running                   0 10.26.34.35        ip-10-26-34-35.us-east-2.compute.internal       4d4h        │
│

Regarding the tofu outputs, thanks, I should have noticed that about the 01 and 02 stages directories. I get proper output from the 02 stage aws directory, but not 01, which I assume is to be expected.

I'll be doing some more redeploys today and if I'm still not successful I'll look into the debug logging options. Thanks for the recommendations!

If you have any insight about causes for the traefik pod log errors, please let me know.

0 replies

viniciusdc · 2025-07-03T22:33:27Z

viniciusdc
Jul 3, 2025
Maintainer

@mwengren These here are expected and should not break your exposure of services:

│ time="2025-06-17T17:11:57Z" level=error msg="Secret dev/ does not exist" TLSStore=default namespace=dev providerName=kubernetescrd │

This issue occurs due to a legacy code line that was previously included for a TLSStore but has since been removed and refactored.

I suggest you port forward the Traefik web UI interface (hover over the Traefik pod and type :shift+f:, select the container Port to be 9000; the local port should be completed automatically. Then confirm with Ok. You should then be able to see the Traefik interface at localhost:9000. Check if there are any apparent errors (there should be none) -- focus on Services and Routers

7 replies

viniciusdc Jul 10, 2025
Maintainer

Hey @mwengren, I think I might've splits my answeres here and there and assumed a few things when writting the posts from before. Let me know if the following would help:

For the traefik port forwarding (ie :shift+f: etc), do you mean that should be done from the Kubernetes web dashboard? So far, I haven't tried deploying the k8s dashboard, and haven't been able to access traefik console (see below).

I was refering to k9s, which allows you to connect to your cluster resources. But you can achieve the same with:

kubectl port-forward pod/mypod 5000 6000

This would allow you to see if there is anything weird with the ingress routes, but not help with the actual SG and policies.

One challenge I'm having is whether/how to open up security groups beyond our VPC (presumably needed to access either the k8s dashboard or traefik console apart from inside my VPC). AFAICT, the SG allows all incoming traffic originating within the VPC and all outgoing traffic, and our route tables are configured appropriately, but it's possible I'm missing something there that might be impacting my deployment.

Based on the overall progress of your deployment, the VPC outgoing/incoming traffic seems to be correct. Can you check what does the traefik service external IP looks like? does it match your internal VPC/subnet CIDR schema?

viniciusdc Jul 10, 2025
Maintainer

For this one, once you reach this point again, I suggest to enable tracing of logs TF_LOG=trace, and check what does the kuberntes provider says near this part:

[tofu]: │ Error: client rate limiter Wait returned an error: context deadline exceeded
[tofu]: │
[tofu]: │   with module.kubernetes-ingress.kubernetes_service.main,
[tofu]: │   on modules/kubernetes/ingress/main.tf line 114, in resource "kubernetes_service" "main":
[tofu]: │  114: resource "kubernetes_service" "main" {
[tofu]: │
[tofu]: ╵

mwengren Jul 10, 2025
Author

Thanks! I was able to port forward the traefik dashboard (with a few extra steps involved due to my VPC set up).

It looks like the traefik external IP is not configured/pending. Sanitized output:

 kubectl get service -n dev
NAME                                        TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)                                                                                  AGE
cluster-autoscaler-aws-cluster-autoscaler   ClusterIP      xyz.xyz.50.201   <none>        8085/TCP                                                                                 2d
nebari-traefik-ingress                      LoadBalancer   xyz.xyz.181.128   <pending>     80:30108/TCP,443:31747/TCP,8022:31169/TCP,8023:32742/TCP,9080:30270/TCP,8786:30797/TCP   2d
nebari-traefik-internal                     ClusterIP       xyz.xyz.224.39   <none>        9000/TCP

Could this be because of the failure at the kubernetes_service.main step? I have not attempted another redeploy yet, but that would be my next step I think, with TF_LOG=trace as you mentioned.

FWIW, the cluster IP space for the services listed above that are obfuscated do not match my VPC/private subnet LAN, but I assume that's to be expected and the cluster LAN is internal to k8s? It's the external IPs that matter?

Also, I was noticing after reviewing the code in more detail that I have some certificate-related settings in my nebari-config.yaml that might be contributing? I guess I focused on these because they are relevant to the KubernetesIngressStage input_vars function and that's where my deployment is failing. Also referencing the certificate section of the docs.

I believe this were output as a result of running the nebari guided init when I first set up my nebari-config.yaml

certificate:
  type: self-signed
  secret_name: nebari-coastalsb
  acme_email:
  acme_server: https://acme-v02.api.letsencrypt.org/directory
  acme_challenge_type: tls

If that's not the direction to look, I'll redeploy again and see what results I get.

viniciusdc Jul 14, 2025
Maintainer

Hey @mwengren, regarding this:

FWIW, the cluster IP space for the services listed above that are obfuscated do not match my VPC/private subnet LAN, but I assume that's to be expected, and the cluster LAN is internal to k8s? Is it the external IPs that matter?

Yes, that's internal kubelet networking, all local to the current node VM, and they end up being proxied over through the internal mechanics of Kubernetes. The important thing there is the Traefik external IP, since the cloud provider itself should handle that.

certificate:
  type: self-signed
  secret_name: nebari-coastalsb
  acme_email:
  acme_server: https://acme-v02.api.letsencrypt.org/directory
  acme_challenge_type: tls

The acme_email is sanitized, right? If you want Let's Encrypt to handle your certs, you will need to pass a valid email to it. If you already have a way to generate the certs associated with your IP and domain, you can switch this later to use custom certs

mwengren Jul 14, 2025
Author

Actually, no, these are as-is in my nebari-config.yaml. I don't believe I entered these values myself. I recall running a command against my nebari-config or, possibly as an output flag from running nebari init to output all valid options for my nebari-config (rather than the default output which is more minimal and specific to values entered, such as cloud provider, etc).

If I had to guess, these were output as a result of using that flag in nebari init.

I'll go back an re-read the documentation, but I'm guessing for my case, I'd be better off using a self-signed certificate for now. So I'll update accordingly.

Mostly, I'm wondering if this might be a cause for my deployment failing at, or before, the traefik ingress set up. Thanks!

mwengren · 2025-07-10T20:30:13Z

mwengren
Jul 10, 2025
Author

For reference, here's my traefik dashboard. Not sure how to tell if all is working, but there are no errors at least. No TCP services listed as in yours, however.

0 replies

viniciusdc · 2025-07-14T17:18:54Z

viniciusdc
Jul 14, 2025
Maintainer

I think now, I would try another deployment with the tracing enabled, just to be sure where you are getting held back as of now

0 replies

mwengren · 2025-07-17T17:30:07Z

mwengren
Jul 17, 2025
Author

@viniciusdc I ran another deployment with TF_LOG=debug.

I also modified the certificate section of my nebari-config.yaml as mentioned above to match the self-signed certificate option in the docs to test if that was contributing to the problem, but no change there.

I'll share a few what seem relevant log lines below. I'm going to continue looking through the full log to see if anything jumps out, but wanted to share some parts here, hopefully sanitized sufficiently, in case anything jumps out at you. TIA.

If you think that trace level logs are necessary, pls let me know. I tried a nebari destroy with that setting and it seemed very verbose, but if these logs aren't helpful, I'll try again with trace logs enabled. Also, if you're willing to look at it, I can share the full log with you privately.

[tofu]: 2025-07-15T20:13:59.174Z [WARN]  Provider "registry.opentofu.org/hashicorp/kubernetes" produced an invalid plan for module.kubernetes-ingress.kubernetes_service.main, but we are tolerating it because it is using the legacy plugin SDK.
[tofu]:     The following problems may be the cause of any confusing errors from downstream operations:
[tofu]:       - .spec[0].publish_not_ready_addresses: planned value cty.False for a non-computed attribute
[tofu]:       - .spec[0].session_affinity: planned value cty.StringVal("None") for a non-computed attribute
[tofu]:       - .spec[0].allocate_load_balancer_node_ports: planned value cty.True for a non-computed attribute
[tofu]:       - .spec[0].session_affinity_config: attribute representing nested block must not be unknown itself; set nested attribute values to unknown instead
[tofu]: ^[[0m^[[1mmodule.kubernetes-ingress.kubernetes_service.main: Creating...^[[0m^[[0m
[tofu]: 2025-07-15T20:13:59.174Z [INFO]  Starting apply for module.kubernetes-ingress.kubernetes_service.main

[tofu]: 2025-07-15T20:13:59.252Z [WARN]  Provider "provider[\"registry.opentofu.org/hashicorp/kubernetes\"]" produced an unexpected new value for module.kubernetes-ingress.kubernetes_service.traefik_internal, but we are tolerating it because it is using the legacy plugin SDK.
[tofu]:     The following problems may be the cause of any confusing errors from downstream operations:
[tofu]:       - .metadata[0].generate_name: was null, but now cty.StringVal("")
[tofu]:       - .spec[0].external_name: was null, but now cty.StringVal("")
[tofu]:       - .spec[0].load_balancer_ip: was null, but now cty.StringVal("")
[tofu]:       - .spec[0].load_balancer_class: was null, but now cty.StringVal("")
[tofu]:       - .spec[0].port[0].app_protocol: was null, but now cty.StringVal("")
[tofu]: ^[[0m^[[1mmodule.kubernetes-ingress.kubernetes_service.traefik_internal: Creation complete after 0s [id=dev/nebari-traefik-internal]^[[0m

[tofu]: 2025-07-15T20:13:59.273Z [DEBUG] provider.terraform-provider-kubernetes: 2025/07/15 20:13:59 [DEBUG] Flattened ClusterRoleBinding roleRef: []interface {}{map[string]interface {}{"api_group":"rbac.authorization.k8s.io", "kind":"ClusterRole", "name":"nebari-traefik-ingress"}}
[tofu]: 2025-07-15T20:13:59.273Z [DEBUG] provider.terraform-provider-kubernetes: 2025/07/15 20:13:59 [DEBUG] Flattened ClusterRoleBinding subjects: []interface {}{map[string]interface {}{"kind":"ServiceAccount", "name":"nebari-traefik-ingress", "namespace":"dev"}}
[tofu]: 2025-07-15T20:13:59.273Z [WARN]  Provider "provider[\"registry.opentofu.org/hashicorp/kubernetes\"]" produced an unexpected new value for module.kubernetes-ingress.kubernetes_cluster_role_binding.main, but we are tolerating it because it is using the legacy plugin SDK.
[tofu]:     The following problems may be the cause of any confusing errors from downstream operations:
[tofu]:       - .metadata[0].generate_name: was null, but now cty.StringVal("")
[tofu]: ^[[0m^[[1mmodule.kubernetes-ingress.kubernetes_cluster_role_binding.main: Creation complete after 0s [id=nebari-traefik-ingress]^[[0m
...
[tofu]: 2025-07-15T20:13:59.281Z [DEBUG] provider.terraform-provider-kubernetes: 2025/07/15 20:13:59 [DEBUG] Waiting for load balancer to assign IP/hostname
[tofu]: 2025-07-15T20:13:59.281Z [DEBUG] provider.terraform-provider-kubernetes: 2025/07/15 20:13:59 [DEBUG] Waiting for state to become: [success]
[tofu]: 2025-07-15T20:13:59.351Z [DEBUG] provider.terraform-provider-kubernetes: 2025/07/15 20:13:59 [DEBUG] Waiting for deployment dev/nebari-traefik-ingress to schedule 1 replicas
[tofu]: 2025-07-15T20:13:59.351Z [DEBUG] provider.terraform-provider-kubernetes: 2025/07/15 20:13:59 [INFO] Waiting for deployment dev/nebari-traefik-ingress to rollout
[tofu]: 2025-07-15T20:13:59.351Z [DEBUG] provider.terraform-provider-kubernetes: 2025/07/15 20:13:59 [DEBUG] Waiting for state to become: [success]
[tofu]: ^[[0m^[[1mmodule.kubernetes-ingress.kubernetes_manifest.tlsstore_default[0]: Creating...^[[0m^[[0m
[tofu]: 2025-07-15T20:13:59.561Z [INFO]  Starting apply for module.kubernetes-ingress.kubernetes_manifest.tlsstore_default[0]
[tofu]: 2025-07-15T20:13:59.562Z [DEBUG] skipping FixUpBlockAttrs
[tofu]: 2025-07-15T20:13:59.562Z [DEBUG] module.kubernetes-ingress.kubernetes_manifest.tlsstore_default[0]: applying the planned Create change
...
[tofu]: 2025-07-15T20:14:14.900Z [DEBUG] provider.terraform-provider-kubernetes: 2025/07/15 20:14:14 [TRACE] Waiting 10s before next try
[tofu]: ^[[0m^[[1mmodule.kubernetes-ingress.kubernetes_service.main: Still creating... [20s elapsed]^[[0m^[[0m
[tofu]: ^[[0m^[[1mmodule.kubernetes-ingress.kubernetes_deployment.main: Still creating... [20s elapsed]^[[0m^[[0m

Traefik-ingress service looks to be have replicas deployed successfully below (in the very long first log line):

[tofu]: 2025-07-15T20:14:34.926Z [DEBUG] provider.terraform-provider-kubernetes: 2025/07/15 20:14:34 [INFO] Received deployment: &v1.Deployment{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"nebari-traefik-ingress", GenerateName:"", Namespace:"dev", SelfLink:"", UID:"17169ab3-69bf-465d-bce3-044b87a90168", ResourceVersion:"2555", Generation:1, CreationTimestamp:time.Date(2025, time.July, 15, 20, 13, 59, 0, time.Local), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string{"deployment.kubernetes.io/revision":"1"}, OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry{v1.ManagedFieldsEntry{Manager:"HashiCorp", Operation:"Update", APIVersion:"apps/v1", Time:time.Date(2025, time.July, 15, 20, 13, 59, 0, time.Local), FieldsType:"FieldsV1", FieldsV1:(*v1.FieldsV1)(0xc001cc9950), Subresource:""}, v1.ManagedFieldsEntry{Manager:"kube-controller-manager", Operation:"Update", APIVersion:"apps/v1", Time:time.Date(2025, time.July, 15, 20, 14, 26, 0, time.Local), FieldsType:"FieldsV1", FieldsV1:(*v1.FieldsV1)(0xc001cc9980), Subresource:"status"}}}, Spec:v1.DeploymentSpec{Replicas:(*int32)(0xc001ae10f0), Selector:(*v1.LabelSelector)(0xc001b2c460), Template:v1.PodTemplateSpec{ObjectMeta:v1.ObjectMeta{Name:"", GenerateName:"", Namespace:"", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{"app.kubernetes.io/component":"traefik-ingress"}, Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Spec:v1.PodSpec{Volumes:[]v1.Volume{v1.Volume{Name:"acme-certificates", VolumeSource:v1.VolumeSource{HostPath:(*v1.HostPathVolumeSource)(nil), EmptyDir:(*v1.EmptyDirVolumeSource)(nil), GCEPersistentDisk:(*v1.GCEPersistentDiskVolumeSource)(nil), AWSElasticBlockStore:(*v1.AWSElasticBlockStoreVolumeSource)(nil), GitRepo:(*v1.GitRepoVolumeSource)(nil), Secret:(*v1.SecretVolumeSource)(nil), NFS:(*v1.NFSVolumeSource)(nil), ISCSI:(*v1.ISCSIVolumeSource)(nil), Glusterfs:(*v1.GlusterfsVolumeSource)(nil), PersistentVolumeClaim:(*v1.PersistentVolumeClaimVolumeSource)(0xc001cc9a40), RBD:(*v1.RBDVolumeSource)(nil), FlexVolume:(*v1.FlexVolumeSource)(nil), Cinder:(*v1.CinderVolumeSource)(nil), CephFS:(*v1.CephFSVolumeSource)(nil), Flocker:(*v1.FlockerVolumeSource)(nil), DownwardAPI:(*v1.DownwardAPIVolumeSource)(nil), FC:(*v1.FCVolumeSource)(nil), AzureFile:(*v1.AzureFileVolumeSource)(nil), ConfigMap:(*v1.ConfigMapVolumeSource)(nil), VsphereVolume:(*v1.VsphereVirtualDiskVolumeSource)(nil), Quobyte:(*v1.QuobyteVolumeSource)(nil), AzureDisk:(*v1.AzureDiskVolumeSource)(nil), PhotonPersistentDisk:(*v1.PhotonPersistentDiskVolumeSource)(nil), Projected:(*v1.ProjectedVolumeSource)(nil), PortworxVolume:(*v1.PortworxVolumeSource)(nil), ScaleIO:(*v1.ScaleIOVolumeSource)(nil), StorageOS:(*v1.StorageOSVolumeSource)(nil), CSI:(*v1.CSIVolumeSource)(nil), Ephemeral:(*v1.EphemeralVolumeSource)(nil)}}}, InitContainers:[]v1.Container(nil), Containers:[]v1.Container{v1.Container{Name:"nebari", Image:"traefik:2.9.1", Command:[]string(nil), Args:[]string{"--global.checknewversion=false", "--global.sendanonymoususage=false", "--api.insecure=true", "--api.dashboard=true", "--ping=true", "--providers.kubernetesingress=true", "--providers.kubernetesingress.namespaces=dev", "--providers.kubernetesingress.ingressclass=traefik", "--providers.kubernetescrd", "--providers.kubernetescrd.namespaces=dev", "--providers.kubernetescrd.throttleduration=2s", "--providers.kubernetescrd.allowcrossnamespace=false", "--entryPoints.web.address=:80", "--entryPoints.websecure.address=:443", "--entrypoints.ssh.address=:8022", "--entrypoints.sftp.address=:8023", "--entryPoints.tcp.address=:8786", "--entryPoints.traefik.address=:9000", "--entryPoints.minio.address=:9080", "--entrypoints.web.http.redirections.entryPoint.to=websecure", "--entrypoints.web.http.redirections.entryPoint.scheme=https", "--metrics.prometheus=true", "--log.level=WARN", "--entrypoints.websecure.http.tls.certResolver=default", "--entrypoints.minio.http.tls.certResolver=default"}, WorkingDir:"", Ports:[]v1.ContainerPort{v1.ContainerPort{Name:"http", HostPort:0, ContainerPort:80, Protocol:"TCP", HostIP:""}, v1.ContainerPort{Name:"https", HostPort:0, ContainerPort:443, Protocol:"TCP", HostIP:""}, v1.ContainerPort{Name:"ssh", HostPort:0, ContainerPort:8022, Protocol:"TCP", HostIP:""}, v1.ContainerPort{Name:"sftp", HostPort:0, ContainerPort:8023, Protocol:"TCP", HostIP:""}, v1.ContainerPort{Name:"tcp", HostPort:0, ContainerPort:8786, Protocol:"TCP", HostIP:""}, v1.ContainerPort{Name:"traefik", HostPort:0, ContainerPort:9000, Protocol:"TCP", HostIP:""}, v1.ContainerPort{Name:"minio", HostPort:0, ContainerPort:9080, Protocol:"TCP", HostIP:""}}, EnvFrom:[]v1.EnvFromSource(nil), Env:[]v1.EnvVar(nil), Resources:v1.ResourceRequirements{Limits:v1.ResourceList(nil), Requests:v1.ResourceList(nil), Claims:[]v1.ResourceClaim(nil)}, ResizePolicy:[]v1.ContainerResizePolicy(nil), RestartPolicy:(*v1.ContainerRestartPolicy)(nil), VolumeMounts:[]v1.VolumeMount{v1.VolumeMount{Name:"acme-certificates", ReadOnly:false, MountPath:"/mnt/acme-certificates", SubPath:"", MountPropagation:(*v1.MountPropagationMode)(0xc0024d86b0), SubPathExpr:""}}, VolumeDevices:[]v1.VolumeDevice(nil), LivenessProbe:(*v1.Probe)(0xc0024e2180), ReadinessProbe:(*v1.Probe)(0xc0024e21c0), StartupProbe:(*v1.Probe)(nil), Lifecycle:(*v1.Lifecycle)(nil), TerminationMessagePath:"/dev/termination-log", TerminationMessagePolicy:"File", ImagePullPolicy:"IfNotPresent", SecurityContext:(*v1.SecurityContext)(0xc0014c1ec0), Stdin:false, StdinOnce:false, TTY:false}}, EphemeralContainers:[]v1.EphemeralContainer(nil), RestartPolicy:"Always", TerminationGracePeriodSeconds:(*int64)(0xc001ae1200), ActiveDeadlineSeconds:(*int64)(nil), DNSPolicy:"ClusterFirst", NodeSelector:map[string]string(nil), ServiceAccountName:"nebari-traefik-ingress", DeprecatedServiceAccount:"nebari-traefik-ingress", AutomountServiceAccountToken:(*bool)(0xc001ae120a), NodeName:"", HostNetwork:false, HostPID:false, HostIPC:false, ShareProcessNamespace:(*bool)(0xc001ae120b), SecurityContext:(*v1.PodSecurityContext)(0xc000312a10), ImagePullSecrets:[]v1.LocalObjectReference(nil), Hostname:"", Subdomain:"", Affinity:(*v1.Affinity)(0xc001cc9c50), SchedulerName:"default-scheduler", Tolerations:[]v1.Toleration(nil), HostAliases:[]v1.HostAlias(nil), PriorityClassName:"", Priority:(*int32)(nil), DNSConfig:(*v1.PodDNSConfig)(nil), ReadinessGates:[]v1.PodReadinessGate(nil), RuntimeClassName:(*string)(nil), EnableServiceLinks:(*bool)(0xc001ae1227), PreemptionPolicy:(*v1.PreemptionPolicy)(nil), Overhead:v1.ResourceList(nil), TopologySpreadConstraints:[]v1.TopologySpreadConstraint(nil), SetHostnameAsFQDN:(*bool)(nil), OS:(*v1.PodOS)(nil), HostUsers:(*bool)(nil), SchedulingGates:[]v1.PodSchedulingGate(nil), ResourceClaims:[]v1.PodResourceClaim(nil)}}, Strategy:v1.DeploymentStrategy{Type:"RollingUpdate", RollingUpdate:(*v1.RollingUpdateDeployment)(0xc0024d86c0)}, MinReadySeconds:0, RevisionHistoryLimit:(*int32)(0xc001ae1240), Paused:false, ProgressDeadlineSeconds:(*int32)(0xc001ae1248)}, Status:v1.DeploymentStatus{ObservedGeneration:1, Replicas:1, UpdatedReplicas:1, ReadyReplicas:1, AvailableReplicas:1, UnavailableReplicas:0, Conditions:[]v1.DeploymentCondition{v1.DeploymentCondition{Type:"Available", Status:"True", LastUpdateTime:time.Date(2025, time.July, 15, 20, 14, 26, 0, time.Local), LastTransitionTime:time.Date(2025, time.July, 15, 20, 14, 26, 0, time.Local), Reason:"MinimumReplicasAvailable", Message:"Deployment has minimum availability."}, v1.DeploymentCondition{Type:"Progressing", Status:"True", LastUpdateTime:time.Date(2025, time.July, 15, 20, 14, 26, 0, time.Local), LastTransitionTime:time.Date(2025, time.July, 15, 20, 13, 59, 0, time.Local), Reason:"NewReplicaSetAvailable", Message:"ReplicaSet \"nebari-traefik-ingress-95bc76545\" has successfully progressed."}}, CollisionCount:(*int32)(nil)}}
[tofu]: 2025-07-15T20:14:34.935Z [WARN]  Provider "provider[\"registry.opentofu.org/hashicorp/kubernetes\"]" produced an unexpected new value for module.kubernetes-ingress.kubernetes_deployment.main, but we are tolerating it because it is using the legacy plugin SDK.
[tofu]:     The following problems may be the cause of any confusing errors from downstream operations:
[tofu]:       - .metadata[0].generate_name: was null, but now cty.StringVal("")
[tofu]:       - .spec[0].template[0].metadata[0].generate_name: was null, but now cty.StringVal("")
[tofu]:       - .spec[0].template[0].metadata[0].namespace: was null, but now cty.StringVal("")
[tofu]:       - .spec[0].template[0].spec[0].runtime_class_name: was null, but now cty.StringVal("")
[tofu]:       - .spec[0].template[0].spec[0].priority_class_name: was null, but now cty.StringVal("")
[tofu]:       - .spec[0].template[0].spec[0].subdomain: was null, but now cty.StringVal("")
[tofu]:       - .spec[0].template[0].spec[0].active_deadline_seconds: was null, but now cty.NumberIntVal(0)
[tofu]:       - .spec[0].template[0].spec[0].container[0].working_dir: was null, but now cty.StringVal("")
[tofu]:       - .spec[0].template[0].spec[0].container[0].readiness_probe[0].http_get[0].host: was null, but now cty.StringVal("")
[tofu]:       - .spec[0].template[0].spec[0].container[0].security_context[0].run_as_group: was null, but now cty.StringVal("")
[tofu]:       - .spec[0].template[0].spec[0].container[0].security_context[0].run_as_non_root: was null, but now cty.False
[tofu]:       - .spec[0].template[0].spec[0].container[0].security_context[0].run_as_user: was null, but now cty.StringVal("")
[tofu]:       - .spec[0].template[0].spec[0].container[0].liveness_probe[0].http_get[0].host: was null, but now cty.StringVal("")
[tofu]:       - .spec[0].template[0].spec[0].container[0].port[0].host_ip: was null, but now cty.StringVal("")
[tofu]:       - .spec[0].template[0].spec[0].container[0].port[0].host_port: was null, but now cty.NumberIntVal(0)
[tofu]:       - .spec[0].template[0].spec[0].container[0].port[1].host_ip: was null, but now cty.StringVal("")
[tofu]:       - .spec[0].template[0].spec[0].container[0].port[1].host_port: was null, but now cty.NumberIntVal(0)
[tofu]:       - .spec[0].template[0].spec[0].container[0].port[2].host_ip: was null, but now cty.StringVal("")
[tofu]:       - .spec[0].template[0].spec[0].container[0].port[2].host_port: was null, but now cty.NumberIntVal(0)
[tofu]:       - .spec[0].template[0].spec[0].container[0].port[3].host_ip: was null, but now cty.StringVal("")
[tofu]:       - .spec[0].template[0].spec[0].container[0].port[3].host_port: was null, but now cty.NumberIntVal(0)
[tofu]:       - .spec[0].template[0].spec[0].container[0].port[4].host_ip: was null, but now cty.StringVal("")
[tofu]:       - .spec[0].template[0].spec[0].container[0].port[4].host_port: was null, but now cty.NumberIntVal(0)
[tofu]:       - .spec[0].template[0].spec[0].container[0].port[5].host_ip: was null, but now cty.StringVal("")
[tofu]:       - .spec[0].template[0].spec[0].container[0].port[5].host_port: was null, but now cty.NumberIntVal(0)
[tofu]:       - .spec[0].template[0].spec[0].container[0].port[6].host_port: was null, but now cty.NumberIntVal(0)
[tofu]:       - .spec[0].template[0].spec[0].container[0].port[6].host_ip: was null, but now cty.StringVal("")
[tofu]:       - .spec[0].template[0].spec[0].container[0].volume_mount[0].sub_path: was null, but now cty.StringVal("")
[tofu]: ^[[0m^[[1mmodule.kubernetes-ingress.kubernetes_deployment.main: Creation complete after 36s [id=dev/nebari-traefik-ingress]^[[0m
[tofu]: 2025-07-15T20:14:34.938Z [DEBUG] states/remote: state read serial is: 0; serial is: 0
[tofu]: 2025-07-15T20:14:34.938Z [DEBUG] states/remote: state read lineage is: ; lineage is:
...
[tofu]: 2025-07-15T20:14:35.028Z [WARN]  failed to fetch state md5: invalid md5
[tofu]: 2025-07-15T20:14:35.028Z [DEBUG] states/remote: after refresh, state read serial is: 0; serial is: 0
[tofu]: 2025-07-15T20:14:35.028Z [DEBUG] states/remote: after refresh, state read lineage is: ; lineage is:

Then, there are a number of repeated checks against what look to be the LoadBalancer status, related to the kubernetes-ingress.kubernetes_service.main module deployment (these repeat until it times out after about 10 minutes and fails with the error above ).

[tofu]: 2025-07-15T20:15:44.967Z [DEBUG] provider.terraform-provider-kubernetes: 2025/07/15 20:15:44 [INFO] Received service status: v1.ServiceStatus{LoadBalancer:v1.LoadBalancerStatus{Ingress:[]v1.LoadBalancerIngress(nil)}, Conditions:[]v1.Condition(nil)}
[tofu]: 2025-07-15T20:15:44.967Z [DEBUG] provider.terraform-provider-kubernetes: 2025/07/15 20:15:44 [TRACE] Waiting 10s before next try
[tofu]: ^[[0m^[[1mmodule.kubernetes-ingress.kubernetes_service.main: Still creating... [1m50s elapsed]^[[0m^[[0m

1 reply

viniciusdc Jul 18, 2025
Maintainer

Yeah, the issues with the MD5 stuff and the others are pretty common. The main problem you're facing with the load balancer step is that Terraform waits for the external IP to become available so it can pass it as an output to the other deployments. Since the IP remains in a pending state, the process eventually times out.

The right course of action now is to figure out why the IP isn’t being assigned. That comes from AWS, so you'll need to check whether there are any error messages in the CloudTrail console related to the Kubernetes service, or if there's something unusual going on with the subnets.

I’ll try to replicate your deployment state next week—probably on Tuesday. If you could outline the main changes you made to your setup, that would help a lot.

mwengren · 2025-07-28T15:58:05Z

mwengren
Jul 28, 2025
Author

@viniciusdc My working branch is here: mwengren/nebari@main...mw-aws-no-public-ip-2.

There are quite a few changes in this branch. I was following @dcmcand's initial changes in this branch to enable public/private subnet deployment for Nebari with some of my own that I think are necessary to be able to use the nebari deploy command with my VPC set up - mainly, additional local variables and resource counts to reference existing VPC resources if available, rather than creating them as part of the deploy process.

I'll look into my AWS logs further. So far, I see some calls to CreateNetworkInterface, CreateNetworkInterfacePermission that return Client.DryRunOperation errors/message, which seems to suggest the cluster is testing obtaining an IP address but maybe not going further. Not really sure about that.

After reviewing the source code some more, however, should I be passing an available IP address from my public subnet in the nebari-config ingress -> terraform_overrides -> load-balancer-ip parameter that maps to

nebari/src/_nebari/stages/kubernetes_ingress/template/variables.tf

Lines 58 to 62 in d680ca8

    
           variable "load-balancer-ip" { 
        
             description = "IP Address of the load balancer" 
        
             type        = string 
        
             default     = null 
        
           }

rather than leaving blank?

I though I'd tried this before but it wasn't successful, but if that's the recommendation, it's probably worth trying again. I'm not actually sure the process behind how AWS would allocate an IP address here. In our case at least, we have a CIDR range of IPs for our public subnet that are managed outside of AWS, so I'm thinking that's outside of a typical deployment scenario.

0 replies

mwengren · 2025-08-15T20:44:00Z

mwengren
Aug 15, 2025
Author

@viniciusdc @dcmcand I'm happy to report that I was able to troubleshoot the remaining issues I was having deploying the load balancer and finally get through the remaining deploy phases to reach the wonderful

Nebari deployed successfully

output from nebari deploy, at long last !!!

The issue that was blocking the load balancer as mentioned above was that I needed to add the proper annotations to configure the load balancer, and to also not include a load-balancer-ip as I had been for a few deployments. This is what worked for me:

ingress: 
  terraform_overrides:
    load-balancer-ip:
    load_balancer_annotations:
      "service.beta.kubernetes.io/aws-load-balancer-internal": "true"
      "service.beta.kubernetes.io/aws-load-balancer-subnets": "subnet-1, subnet-2"

I also ran into some issues in the kubernetes_ingress stage check_ingress_dns function where I had the wrong domain set in my nebari-config.yaml file. This was from following the Deploy Nebari on AWS docs, which have a minor error in the URL examples at the bottom - I'll submit an issue and PR to address this.

The example in Deploy Nebari on AWS shows the resulting Nebari URL as a combination of project_name + domain, whereas it's actually just the domain instead. I realized this eventually by cross-referencing the example in Advanced Configuration, which shows the right approach (for AWS at least) - project_name and domain are unrelated, at least as far as DNS goes apparently. I'm assuming project_name could be anything, although I haven't actually tested this and instead followed the pattern below:

Correct example in General Configuration Settings:

### General configuration ###
project_name: demo
namespace: dev
provider: gcp
domain: demo.nebari.dev

Incorrect example in Deploy Nebari on AWS:

Services:
 - argo-workflows -> https://projectname.domain/argo/
 - conda_store -> https://projectname.domain/conda-store/
 - dask_gateway -> https://projectname.domain/gateway/
 - jupyterhub -> https://projectname.domain/
 - keycloak -> https://projectname.domain/auth/
 - monitoring -> https://projectname.domain/monitoring/

As a result, I was only using the TLD plus my second level domain in nebari-config.yaml domain, rather than the full domain name Nebari should deploy to, which caused DNS checks to fail.

Once resolving the above and maybe just the right number of retries... success!

I still have some issues to resolve related to the Load Balancer and my network set up:

No matter the load_balancer_annotations I pass, I get a classic load balancer from AWS, rather than an ALB or NLB. Adding some documentation about this, if there's a particular type that's preferrable to Nebari, to the docs would be helpful.
If I specify my public subnet(s) for the load balancer in the above service.beta.kubernetes.io/aws-load-balancer-subnets annotation, it isn't able to connect to the EKS cluster nodes, which are in my private subnets. At the moment, a workaround I used was to pass my one 'public' subnet, plus the private subnet in the opposite AZ of the EKS cluster, but this isn't a durable solution as the LB is only able to connect to instances in one of my two private subnets that EKS uses.

I'm going to look into solutions for this and what type of load balancer I should be using ideally, but any advice appreciated on either the LB type or how to resolve the load balancer subnet config for the public/private subnet use case I'm working with here!

I'm still working out SSL certificates so I can actually connect to my deployment and verify login etc, but I'm anticipating that should be easy and otherwise all looks pretty good. The traefik console once deployed:

0 replies

nebari-dev

More guidance/resources for troubleshooting a failed Nebari deployment? #3085

Uh oh!

Uh oh!

mwengren Jun 26, 2025

Replies: 12 comments · 8 replies

Uh oh!

verhulstm Jun 26, 2025

Uh oh!

verhulstm Jun 26, 2025

Uh oh!

verhulstm Jun 27, 2025

Uh oh!

mwengren Jun 27, 2025 Author

Uh oh!

Uh oh!

viniciusdc Jun 28, 2025 Maintainer

Uh oh!

mwengren Jul 3, 2025 Author

Uh oh!

Uh oh!

viniciusdc Jul 3, 2025 Maintainer

Uh oh!

Uh oh!

viniciusdc Jul 10, 2025 Maintainer

Uh oh!

viniciusdc Jul 10, 2025 Maintainer

Uh oh!

Uh oh!

mwengren Jul 10, 2025 Author

Uh oh!

viniciusdc Jul 14, 2025 Maintainer

Uh oh!

mwengren Jul 14, 2025 Author

Uh oh!

mwengren Jul 10, 2025 Author

Uh oh!

viniciusdc Jul 14, 2025 Maintainer

Uh oh!

Uh oh!

mwengren Jul 17, 2025 Author

Uh oh!

Uh oh!

viniciusdc Jul 18, 2025 Maintainer

Uh oh!

mwengren Jul 28, 2025 Author

Uh oh!

mwengren Aug 15, 2025 Author

mwengren
Jun 26, 2025

Replies: 12 comments 8 replies

verhulstm
Jun 26, 2025

verhulstm
Jun 26, 2025

verhulstm
Jun 27, 2025

mwengren
Jun 27, 2025
Author

viniciusdc
Jun 28, 2025
Maintainer

mwengren
Jul 3, 2025
Author

viniciusdc
Jul 3, 2025
Maintainer

viniciusdc Jul 10, 2025
Maintainer

viniciusdc Jul 10, 2025
Maintainer

mwengren Jul 10, 2025
Author

viniciusdc Jul 14, 2025
Maintainer

mwengren Jul 14, 2025
Author

mwengren
Jul 10, 2025
Author

viniciusdc
Jul 14, 2025
Maintainer

mwengren
Jul 17, 2025
Author

viniciusdc Jul 18, 2025
Maintainer

mwengren
Jul 28, 2025
Author

mwengren
Aug 15, 2025
Author