HELP - I just lost all my data on the Search node! #9767
Replies: 2 comments 8 replies
-
This definitely sounds like an automated cleanup happening -- how much disk space do you have on your search nodes, and how much data are you collecting per day? You can check the current amount of data in Elasticsearch with "so-elasticsearch-indices-list" on the search node CLI. |
Beta Was this translation helpful? Give feedback.
-
It looks like it's all internal traffic (AWS). Here's the top 10 weird log types for the last 24 hours: What would be your suggestion on tuning this? The search node whose /nsm was at 50% on Friday is now at 98%. I've made the change to global.sls that you mentioned and ran 'salt-call state.apply' but that didn't seem to have an affect. Is there a way to have that force an immediate run? Thank you again for your help -- I really appreciate it! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi there,
I logged into my SO console this morning and it was throwing 502 errors. I've noticed the Search node has been running high on CPU resource utilization recently so went to take a look in Grafana and noticed that the CPU utilization was 0% and that the disk utilization on the /nsm filesystem had dropped from 85% to 5%, both around the same time yesterday evening (~midnight UTC).
I rebooted the Search node so it's no longer throwing 502s and it's displaying new data but it seems all historical data is gone. It's not the end of the world as I'm currently doing a POC but if this were actually production, it would be bad.
Does anyone know what might have happened? Is there a cronjob that kicks off at midnight that could have this impact? Or a salt job that runs once storage hits 85% utilization? I saw this in yesterday's log in /opt/so/log/elasticsearch/ on the Search node:
So I'm guessing it blew everything away after hitting that threshold? I just want to make sure I understand what happened so I can ensure it doesn't happen again.
Thank you very much!
Beta Was this translation helpful? Give feedback.
All reactions