Elastic Agent Status after reboot - 'Starting' #13285
-
Version2.4.70 Installation MethodSecurity Onion ISO image Descriptionother (please provide detail below) Installation TypeStandalone Locationon-prem with Internet access Hardware SpecsMeets minimum requirements CPU8 RAM20 Storage for /200gb Storage for /nsm200 Network Traffic Collectionspan port Network Traffic Speeds1Gbps to 10Gbps StatusYes, all services on all nodes are running OK Salt StatusNo, there are no failures LogsNo, there are no additional clues DetailSorry, long post, (and a long week) but hopefully there's an answer and maybe this will help someone else too. My environment is an ESXi host, with SO installed from 2.4.70 iso. A few 'failed' attempts at install this week but eventually got everything working as expected. But there is a 'However'! However, I don't think the installs were actually 'failed' I think I was being impatient. Following the install, and I did select the correct monitoring NIC during setup, I was not seeing all traffic on the 'bond0' interface using tcpdump. Using 'tcpdump' to look at the ens'xxx' interface I WAS seeing all traffic. Checking with 'ip addr' it looked like the monitoring interface was not part of 'bond0' so I added it using 'so-monitor-add'. The ESXi host had already been set up with the vSwitch for the monitoring interface set up with VLAN ID 4095 (and also read that MTU for the interface needed to be 4096 so did that also). However after trying 'curl testmyids.com' from several hosts, no alerts were generated in SOC. Tried 'so-test' and replayed the test traffic. Still no alerts in SOC. 'SO-Status' - all good and running. No issues in GRID. So, I tried a fresh install and went for EVALUATION instead. Same issues as above (if I recall correctly). Next, tried another STANDALONE install and again, same as the first time - Gave up and decided to give KALI Linux a try with StamusNetworks - SELKS. Used the same vSwitches for the KALI box, got it up and running. Installed SELKS and bingo. IDS working and also the added bonus of OpenVAS working too. Not one for giving up I gave SO another go. Again Standalone with the 2.4.70 iso. All apparently up and running (again the selected NIIC during setup was not added to 'bond0' so added it manually. 'SO-Status' - all good and running. No issues in GRID. But also no alerts! Ran 'so-test' again and replayed the data, but again, no alerts in SOC. Went for lunch. About an hour later I came back, logged into SOC and bingo! All the alerts were there. Tried 'curl testmyids.com' from a few hosts and within 30 seconds or so these new alerts also appeared. Acknowledged all the Alerts to 'clear them out', shut down the guest and took a snapshot just in case. Booted it back up, waited for 'so-status' to be 'GREEEN' & Running. Tried a few more 'curl testmyids.com' from various hosts. None of the alerts appeared! Was quite disheartened, but started digging. After about an hour the alerts appeared. I rebooted the SO box again, waited for so-status to be all good and ran tests. No alerts. Then checked the elastic agent status with 'elastic-agent status' Hmmm..... 'Filestream-monitoring - 'Starting' & 'log-default - 'Starting'. After while a few of these statuses changed to 'Degraded' and some still 'Starting'. Sure enough, in SOC > Elastic Fleet > Host 'securityonion' is in unhealthy state. 'fleetserver_securityonion' is healthy. The log window in SOC > Elastic Fleet > Host 'securityonion' has over 8000 rows in the space of one hour following the reboot. After about an hour following reboot the Host turns 'Healthy' 'elastic-agent status' is Healthy / Connected & Healthy / Running. I'm guessing its not normal for it to take that long to turn healthy? Where to start looking? And go easy on me - I'm quite new to linux! Might take a snapshot and upgrade to 2.4.80 using soup to see if its any different. Guidelines
|
Beta Was this translation helpful? Give feedback.
Replies: 13 comments 11 replies
-
Updated to 2.4.80 - same issue approx 1 hour to get up and running after reboot. |
Beta Was this translation helpful? Give feedback.
-
Wiped the VM and started again with fresh 2.4.80 ISO. Same issue as above. A bit more in depth research and I found this: #12475 which is similar to what I was seeing after a reboot. The progress is that after a reboot, I stop the Elastic-Agent service and simply restart it. a couple of mins and its all working as expected. So at least I'm now up and running in a few minutes rather than having to wait a full hour. I think my question has therefore changed to "Why must I restart my elastic-agent after a reboot?" which is not answered in the post I found earlier. |
Beta Was this translation helpful? Give feedback.
-
Have you tried increasing the RAM assigned to the VM to see if that makes any difference? How many other VMs are running on this ESXi host? Is it possible that the ESXi host is oversubscribed? What kind of storage does the ESXi host have (NVME, SSD, or rotational)? |
Beta Was this translation helpful? Give feedback.
-
The ESXi box is a retired production server from a smallish environment approx 4 year old. Used to run several Guest VM's including a DC, SQL Server, RDS Server (for 50 users) and a file server. To specifically answer the questions, its a test environment now and although its got several guest VM's, at the times of testing only Security Onion VM is running. I could increase the ram by a couple of gig, but when monitoring the CPU, Memory & Disk usage from the EXSi host for the guest SO VM nothing is getting remotely pushed too hard. The disks are 6 x 600Gb 15K in a RAID10. so 6 x read and 3x write. As previously mentioned, If I reboot and wait for everything to come up 'naturally' it takes about an hour. If I reboot, wait for SO-Status to report everything running (about 5-10 mins) then 'sudo service elastic-agent stop' & 'sudo service elastic-agent start' I can be up and running pretty much straight after. So I don't think its hardware constraints. |
Beta Was this translation helpful? Give feedback.
-
I would go ahead and increase the RAM in the guest as much as possible just so that we can rule that out. Next I would start looking in /opt/so/log/ for additional clues. Which option did you choose at this Setup prompt? If hostname, then you might try IP to try to rule out possible DNS issues. Another possibility might be a networking or routing issue if your environment uses the 172.17.0.0/16 range: |
Beta Was this translation helpful? Give feedback.
-
I've decided to start afresh one last time. This time I've added some SSD storage for /nsm to keep it separate from the RAID10 ESXi datastore. Although originally the /nsm was on a separate virtual disk, its still on the same RAID10 array. Who knows, clutching at straws now. reviewing logs did not really get me anywhere so decided to flatten, start again & check logs on a fresh install. |
Beta Was this translation helpful? Give feedback.
-
Started afresh - Straight after the install completed, all working fine. After a reboot the same issue. stopping and starting the elastic-agent gets things going. Had a look through the logs and found this: "[WARN ][org.elasticsearch.ingest.common.GrokProcessor] character class has '-' without escape" Which I googled and found someone with exactly the same issue on a discussion on here: Unfortunately the final comment is simply that restarting the Elastic-Agent fix's it. But that's more of a workaround at reboot than a fix. Any ideas? |
Beta Was this translation helpful? Give feedback.
-
I'm away for a few days will fire it back up when I get back. But, no domain - this is in an isolated test environment. I recall that this: "[WARN ][org.elasticsearch.ingest.common.GrokProcessor] character class has '-' without escape" Iterates over and over in the log and clears up after the elastic agent restart. No hyphens in any of the names, just went with all the install defaults. I doubt its anything to do with the default name 'fleet-server' or others I guess would see this too. A little off topic but when it does all kick in and I connect to SO via SSH an alert is thrown up about the latest OpenSSL vulnerability (CVE-2024-6387] being present on the SO server itself. Tried 'yum' and 'dnf' to patch after a bit of digging but can't seem to patch. As I said at the start I'm quite new to linux but learning fast. |
Beta Was this translation helpful? Give feedback.
-
/opt/so/log/elasticsearch/securityonion.log Log attached and location. Will take a look at the link on the SSHServer - thanks. |
Beta Was this translation helpful? Give feedback.
-
I took some time to compare your elasticsearch log to my local EVAL and STANDALONE deployments and those messages appear to be normal and benign. I then tried rebooting my EVAL and STANDALONE installations and checking to see how long it took for elastic-agent to show as healthy. On both of them, so-status shows all containers running at a system uptime of 6 minutes. Within 10 minutes of that, elastic-agent is fully HEALTHY. For example, at a system uptime of 14 minutes, elastic-agent shows its fleet connection is FAILED:
At a system uptime of 15 minutes, elastic-agent is fully HEALTHY:
So if you're seeing elastic-agent take an hour to go fully HEALTHY, then that is is definitely NOT normal. Would it possible for you to do a test installation on a physical machine (not VM) just so we can rule out your virtualization environment? |
Beta Was this translation helpful? Give feedback.
-
Thanks for your time looking into this - it really is appreciated. Unfortunately I don't have any other hardware at hand that comes close to the minimum specs for running SO standalone. I'm just going to boot SO one more time and have a look through other logs and see if there are any more clues. Will also take some more exact time measurements of start-up times etc and be back here soon. I'll also have a dig around some of the Elastic forums etc to see if there are any similar issues noted anywhere. |
Beta Was this translation helpful? Give feedback.
-
OK - I've been hammering at this all day, detail in the attached but: 6 mins after SO Standalone VM boots, all the containers are 'Green and Running' - Same as yours Doug. Fleet displays similar errors to yours also Doug, and Elastic-Agent status is also similar. That's where the similarity ends. This time instead of stopping and starting the agents, I waited. More or less (within 30-40 seconds or so) after one hour post boot the Elastic-Agents are running & healthy. Why one hour? I checked the ESXi host time and its correct (UTC). Checked the SO VM time and its also correct UTC. But why one hour? Clutching at straws here, but where can I view the all the certificates validity start dates (and more importantly times) that are in use? My 'real world' time here is UTC+1, could it be that at boot certificates used are being created with a validity start time one hour ahead of UTC? Hence after one hour it all fires up? Or by restarting Elastic-Agents time zones have corrected themselves and certs are created with the correct validity time? |
Beta Was this translation helpful? Give feedback.
-
Facepalm Moment! The 1 hour wait bit made me think. Although both the SO VM and the ESXi host both showed the correct UTC time I thought I'd hit the little checkbox in ESXi here: SO_Guest > Edit Settings > VM Options > VMWare Tools > Synchronise Guest Time with Host. Reboot the SO Guest VM and start checking status. After 16 min all is working fine: [onion@securityonion ~]$ sudo elastic-agent status I don't quite get why the time sync between host & guest would be essential in this case as both the guest and the host WERE displaying the correct time (and date). All I can think is that some certificates were being created based on UTC+1 'somehow'??? Ideally I'd like to untick the sync Guest & Host and recreate the problem / look at certificates to see if my hunch is correct? But then again, why does it matter if the SO Guest is synced for time with the host when the SO VM has 0.pool.ntp.org I'll start having another look to see if I can work out how to view certificates in use by the SO VM and then try and recreate the issue. |
Beta Was this translation helpful? Give feedback.
I've added a note to our documentation in case anybody else runs into this:
https://docs.securityonion.net/en/dev/vmware.html#esxi