Replies: 12 comments 8 replies
-
I also use kubectl and k9s to interrogate the k8s cluster. plus https://kubernetes.io/docs/tasks/access-application-cluster/web-ui-dashboard/ |
Beta Was this translation helpful? Give feedback.
-
the error " Error: client rate limiter Wait returned an error: context deadline exceeded" is common. in our use case it often means the k8s namespace is not correct |
Beta Was this translation helpful? Give feedback.
-
i think you might want to start with adding extra logging statement here https://github.com/nebari-dev/nebari/blob/main/src/_nebari/subcommands/init.py |
Beta Was this translation helpful? Give feedback.
-
@verhulstm thanks for the suggestions! I'll look into those. In the meantime, I have some information to start from as far as troubleshooting goes from both tofu commands and k9s. If anyone has insight, it would be much appreciated. Using tofu from my Nebari deploy machine, I get the following outputs from the Nebari stages I've been able to reach so far:
I think it's odd that there's no similar state available from stages 01 and 02, not sure if that's something to do with Nebari or an issue with my Nebari conda deployment environment:
In my k8s cluster, I have two pods deployed to the
The best starting point I have are these error messages in the traefik pod shortly after I ran
I'm going to see what these turn up. No doubt the traefik errors are at least a factor in why my Nebari ingress is failing at: https://github.com/nebari-dev/nebari/blob/main/src/_nebari/stages/kubernetes_ingress/template/modules/kubernetes/ingress/main.tf#L114 |
Beta Was this translation helpful? Give feedback.
-
Hey folks, great to see the discussion here. @mwengren Sorry for the delayed response. While I’m not certain why you’re hitting this error, I recommend trying out the As a first step, I suggest checking the status of the Traefik pod: kubectl get pods -A This will return a lot of output—look for the Traefik pod and check if its status is kubectl logs <pod-name> (Please review the logs before sharing to ensure there’s no private information.) Regarding your main challenge, if I understand correctly, you need clearer guidance on debugging variable migrations (input/output) during deployment and understanding why your stage is failing. A quick note: stages 1 and 2 differ in directory organization. If you run
Your stage file should be inside this deeper directory. If it’s not, there may be a bug, and you’ll also need to check your cloud provider’s S3 bucket (look for a bucket named like On debugging outputs: for security reasons, outputs aren’t printed to the terminal. If you need to inspect them, adding log or print statements, as @verhulstm suggested, is the quickest way. Since we introduced the stages mechanism, we’ve been working to improve this area of the documentation, but we’ve lacked direct user perspective on what was missing—so this discussion is valuable. |
Beta Was this translation helpful? Give feedback.
-
@viniciusdc thanks for your feedback. A couple comments:
I've already used k9s and found the traefik pod logs (there are only four lines of logs, clearly not starting up properly):
It seems to be looking for a secret named For
Regarding the I'll be doing some more redeploys today and if I'm still not successful I'll look into the debug logging options. Thanks for the recommendations! If you have any insight about causes for the traefik pod log errors, please let me know. |
Beta Was this translation helpful? Give feedback.
-
@mwengren These here are expected and should not break your exposure of services:
This issue occurs due to a legacy code line that was previously included for a TLSStore but has since been removed and refactored. I suggest you port forward the Traefik web UI interface (hover over the Traefik pod and type |
Beta Was this translation helpful? Give feedback.
-
For reference, here's my traefik dashboard. Not sure how to tell if all is working, but there are no errors at least. No TCP services listed as in yours, however. |
Beta Was this translation helpful? Give feedback.
-
I think now, I would try another deployment with the tracing enabled, just to be sure where you are getting held back as of now |
Beta Was this translation helpful? Give feedback.
-
@viniciusdc I ran another deployment with I also modified the I'll share a few what seem relevant log lines below. I'm going to continue looking through the full log to see if anything jumps out, but wanted to share some parts here, hopefully sanitized sufficiently, in case anything jumps out at you. TIA. If you think that trace level logs are necessary, pls let me know. I tried a
Traefik-ingress service looks to be have replicas deployed successfully below (in the very long first log line):
Then, there are a number of repeated checks against what look to be the LoadBalancer status, related to the
|
Beta Was this translation helpful? Give feedback.
-
@viniciusdc My working branch is here: mwengren/nebari@main...mw-aws-no-public-ip-2. There are quite a few changes in this branch. I was following @dcmcand's initial changes in this branch to enable public/private subnet deployment for Nebari with some of my own that I think are necessary to be able to use the I'll look into my AWS logs further. So far, I see some calls to CreateNetworkInterface, CreateNetworkInterfacePermission that return After reviewing the source code some more, however, should I be passing an available IP address from my public subnet in the nebari-config nebari/src/_nebari/stages/kubernetes_ingress/template/variables.tf Lines 58 to 62 in d680ca8 I though I'd tried this before but it wasn't successful, but if that's the recommendation, it's probably worth trying again. I'm not actually sure the process behind how AWS would allocate an IP address here. In our case at least, we have a CIDR range of IPs for our public subnet that are managed outside of AWS, so I'm thinking that's outside of a typical deployment scenario. |
Beta Was this translation helpful? Give feedback.
-
@viniciusdc @dcmcand I'm happy to report that I was able to troubleshoot the remaining issues I was having deploying the load balancer and finally get through the remaining deploy phases to reach the wonderful
output from The issue that was blocking the load balancer as mentioned above was that I needed to add the proper annotations to configure the load balancer, and to also not include a load-balancer-ip as I had been for a few deployments. This is what worked for me:
I also ran into some issues in the kubernetes_ingress stage The example in Deploy Nebari on AWS shows the resulting Nebari URL as a combination of Correct example in General Configuration Settings:
Incorrect example in Deploy Nebari on AWS:
As a result, I was only using the TLD plus my second level domain in nebari-config.yaml Once resolving the above and maybe just the right number of retries... success! I still have some issues to resolve related to the Load Balancer and my network set up:
I'm going to look into solutions for this and what type of load balancer I should be using ideally, but any advice appreciated on either the LB type or how to resolve the load balancer subnet config for the public/private subnet use case I'm working with here! I'm still working out SSL certificates so I can actually connect to my deployment and verify login etc, but I'm anticipating that should be easy and otherwise all looks pretty good. The traefik console once deployed: ![]() |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm working on adapting the Nebari TF to work for my particular AWS environment. Details on our set up can be found in this ongoing topic, specifically this comment for reference.
I'm starting a new discussion here because I've reached a point where the Nebari deployment is failing and I'm struggling to know where to look to diagnose the cause. I've looked through the docs, including the Debug Nebari, Troubleshooting and pretty much everywhere else and while those are good, they don't really go to the level I need and I could use some advice.
Does anyone have a recommendation for how I can understand:
nebari deploy
. If I've successfully deployed the k8s cluster, does this mean going to the pod logs? Or are there logs on the system where I'm runningnebari deploy
I can look at?My particular issue is that Nebari is failing in the 04-kubernetes-ingress stage with the following error:
My self-guided troubleshooting so far has involved:
tofu state list
,tofu output
) to try to identify what's been created in each stage. Mixed results here: I can get tofu output from stages 03 and 04 but not 02-infrastructure, which seems odd to me since 02 is where most of the underlying AWS resources are deployed. When I runtofu state list
from the 02-infrastructure dir, I get aNo state file found
error.stage_outputs
per Nebari Stages documentation in order to understand which variables are being passed to each successive Nebari stage (this has been challenging, the code is a lot to understand). It's hard to troubleshoot a particular stage in the chain when you don't know what parameters it's received from the previous stage(s) via thestage_outputs
dict.I'm sure most Nebari installations go much more smoothly than mine and are much less complicated, but I feel like the docs could benefit from some details on how to interrogate some of the Nebari internals like I'm trying to do, or more detail on troubleshooting interactions/dependencies between the Nebari stages in the Nebari Stages page, because that's where I feel like I'm struggling the most and might be the source of my issues.
TIA for any advice or guidance!
Beta Was this translation helpful? Give feedback.
All reactions