Elastic Agent Status after reboot - 'Starting' #13285

moonlightbob · 2024-06-29T12:13:35Z

moonlightbob
Jun 29, 2024

Version

2.4.70

Installation Method

Security Onion ISO image

Description

other (please provide detail below)

Installation Type

Standalone

Location

on-prem with Internet access

Hardware Specs

Meets minimum requirements

CPU

8

RAM

20

Storage for /

200gb

Storage for /nsm

200

Network Traffic Collection

span port

Network Traffic Speeds

1Gbps to 10Gbps

Status

Yes, all services on all nodes are running OK

Salt Status

No, there are no failures

Logs

No, there are no additional clues

Detail

Sorry, long post, (and a long week) but hopefully there's an answer and maybe this will help someone else too.

My environment is an ESXi host, with SO installed from 2.4.70 iso. A few 'failed' attempts at install this week but eventually got everything working as expected. But there is a 'However'! However, I don't think the installs were actually 'failed' I think I was being impatient.

Following the install, and I did select the correct monitoring NIC during setup, I was not seeing all traffic on the 'bond0' interface using tcpdump. Using 'tcpdump' to look at the ens'xxx' interface I WAS seeing all traffic. Checking with 'ip addr' it looked like the monitoring interface was not part of 'bond0' so I added it using 'so-monitor-add'. The ESXi host had already been set up with the vSwitch for the monitoring interface set up with VLAN ID 4095 (and also read that MTU for the interface needed to be 4096 so did that also). However after trying 'curl testmyids.com' from several hosts, no alerts were generated in SOC. Tried 'so-test' and replayed the test traffic. Still no alerts in SOC. 'SO-Status' - all good and running. No issues in GRID.

So, I tried a fresh install and went for EVALUATION instead. Same issues as above (if I recall correctly).

Next, tried another STANDALONE install and again, same as the first time - Gave up and decided to give KALI Linux a try with StamusNetworks - SELKS. Used the same vSwitches for the KALI box, got it up and running. Installed SELKS and bingo. IDS working and also the added bonus of OpenVAS working too.

Not one for giving up I gave SO another go. Again Standalone with the 2.4.70 iso. All apparently up and running (again the selected NIIC during setup was not added to 'bond0' so added it manually. 'SO-Status' - all good and running. No issues in GRID. But also no alerts! Ran 'so-test' again and replayed the data, but again, no alerts in SOC. Went for lunch. About an hour later I came back, logged into SOC and bingo! All the alerts were there. Tried 'curl testmyids.com' from a few hosts and within 30 seconds or so these new alerts also appeared. Acknowledged all the Alerts to 'clear them out', shut down the guest and took a snapshot just in case.

Booted it back up, waited for 'so-status' to be 'GREEEN' & Running. Tried a few more 'curl testmyids.com' from various hosts. None of the alerts appeared! Was quite disheartened, but started digging. After about an hour the alerts appeared. I rebooted the SO box again, waited for so-status to be all good and ran tests. No alerts. Then checked the elastic agent status with 'elastic-agent status' Hmmm..... 'Filestream-monitoring - 'Starting' & 'log-default - 'Starting'. After while a few of these statuses changed to 'Degraded' and some still 'Starting'.

Sure enough, in SOC > Elastic Fleet > Host 'securityonion' is in unhealthy state. 'fleetserver_securityonion' is healthy. The log window in SOC > Elastic Fleet > Host 'securityonion' has over 8000 rows in the space of one hour following the reboot. After about an hour following reboot the Host turns 'Healthy' 'elastic-agent status' is Healthy / Connected & Healthy / Running.

I'm guessing its not normal for it to take that long to turn healthy? Where to start looking? And go easy on me - I'm quite new to linux! Might take a snapshot and upgrade to 2.4.80 using soup to see if its any different.

Guidelines

I have read the discussion guidelines at Read before posting! #1720 and assert that I have followed the guidelines.

Answered by dougburks

Jul 30, 2024

I've added a note to our documentation in case anybody else runs into this:
https://docs.securityonion.net/en/dev/vmware.html#esxi

View full answer

moonlightbob · 2024-06-29T13:18:42Z

moonlightbob
Jun 29, 2024
Author

Updated to 2.4.80 - same issue approx 1 hour to get up and running after reboot.

0 replies

moonlightbob · 2024-07-02T11:30:32Z

moonlightbob
Jul 2, 2024
Author

Wiped the VM and started again with fresh 2.4.80 ISO. Same issue as above.

A bit more in depth research and I found this: #12475 which is similar to what I was seeing after a reboot.

The progress is that after a reboot, I stop the Elastic-Agent service and simply restart it. a couple of mins and its all working as expected. So at least I'm now up and running in a few minutes rather than having to wait a full hour.

I think my question has therefore changed to "Why must I restart my elastic-agent after a reboot?" which is not answered in the post I found earlier.

0 replies

dougburks · 2024-07-05T17:18:38Z

dougburks
Jul 5, 2024
Maintainer

Have you tried increasing the RAM assigned to the VM to see if that makes any difference?

How many other VMs are running on this ESXi host? Is it possible that the ESXi host is oversubscribed?

What kind of storage does the ESXi host have (NVME, SSD, or rotational)?

0 replies

moonlightbob · 2024-07-06T06:28:01Z

moonlightbob
Jul 6, 2024
Author

The ESXi box is a retired production server from a smallish environment approx 4 year old. Used to run several Guest VM's including a DC, SQL Server, RDS Server (for 50 users) and a file server. To specifically answer the questions, its a test environment now and although its got several guest VM's, at the times of testing only Security Onion VM is running. I could increase the ram by a couple of gig, but when monitoring the CPU, Memory & Disk usage from the EXSi host for the guest SO VM nothing is getting remotely pushed too hard. The disks are 6 x 600Gb 15K in a RAID10. so 6 x read and 3x write.

As previously mentioned, If I reboot and wait for everything to come up 'naturally' it takes about an hour. If I reboot, wait for SO-Status to report everything running (about 5-10 mins) then 'sudo service elastic-agent stop' & 'sudo service elastic-agent start' I can be up and running pretty much straight after. So I don't think its hardware constraints.

0 replies

dougburks · 2024-07-08T17:05:39Z

dougburks
Jul 8, 2024
Maintainer

I would go ahead and increase the RAM in the guest as much as possible just so that we can rule that out.

Next I would start looking in /opt/so/log/ for additional clues.

Which option did you choose at this Setup prompt? If hostname, then you might try IP to try to rule out possible DNS issues.

Another possibility might be a networking or routing issue if your environment uses the 172.17.0.0/16 range:
https://docs.securityonion.net/en/2.4/docker.html#networking-and-bridging

2 replies

moonlightbob Jul 9, 2024
Author

So I increased to 22b RAM with no noticeable difference. i.e about 1 hour to come up 'naturally' and the workaround / fast method by restarting the Elastic Agents after SO-STATUS is all 'OK. I tried REDUCING the RAM to 16GB and its exactly the same.

Any logs in particular to look at? Can I clear the logs somehow then do a reboot so its only recent data to trawl through?

At setup I chose the IP address method as per your screen grab.

And I'm using 192.168.1.0/24 subnet for this test environment.

Thanks for the advice so far.

dougburks Jul 10, 2024
Maintainer

I'd start by looking at /opt/so/log/elasticfleet/ as that's where the Elastic Agent logs are.

Depending on what you find there, you may also want to investigate:
/opt/so/log/elasticsearch/
/opt/so/log/logstash/
/opt/so/log/redis/
or others.

moonlightbob · 2024-07-13T16:00:06Z

moonlightbob
Jul 13, 2024
Author

I've decided to start afresh one last time. This time I've added some SSD storage for /nsm to keep it separate from the RAID10 ESXi datastore. Although originally the /nsm was on a separate virtual disk, its still on the same RAID10 array. Who knows, clutching at straws now. reviewing logs did not really get me anywhere so decided to flatten, start again & check logs on a fresh install.

0 replies

moonlightbob · 2024-07-14T12:09:23Z

moonlightbob
Jul 14, 2024
Author

Started afresh - Straight after the install completed, all working fine. After a reboot the same issue. stopping and starting the elastic-agent gets things going. Had a look through the logs and found this: "[WARN ][org.elasticsearch.ingest.common.GrokProcessor] character class has '-' without escape" Which I googled and found someone with exactly the same issue on a discussion on here:

#12475

Unfortunately the final comment is simply that restarting the Elastic-Agent fix's it. But that's more of a workaround at reboot than a fix. Any ideas?

1 reply

dougburks Jul 16, 2024
Maintainer

Is there any other context around that log entry to indicate where it's seeing that hyphen without escape?

Does your hostname or domain name have a hyphen in it?

moonlightbob · 2024-07-16T11:14:25Z

moonlightbob
Jul 16, 2024
Author

I'm away for a few days will fire it back up when I get back. But, no domain - this is in an isolated test environment. I recall that this:

"[WARN ][org.elasticsearch.ingest.common.GrokProcessor] character class has '-' without escape"

Iterates over and over in the log and clears up after the elastic agent restart. No hyphens in any of the names, just went with all the install defaults. I doubt its anything to do with the default name 'fleet-server' or others I guess would see this too.

A little off topic but when it does all kick in and I connect to SO via SSH an alert is thrown up about the latest OpenSSL vulnerability (CVE-2024-6387] being present on the SO server itself. Tried 'yum' and 'dnf' to patch after a bit of digging but can't seem to patch. As I said at the start I'm quite new to linux but learning fast.

1 reply

dougburks Jul 17, 2024
Maintainer

I'm away for a few days will fire it back up when I get back.
"[WARN ][org.elasticsearch.ingest.common.GrokProcessor] character class has '-' without escape"

When you get back, please let us know which log this came from (Elastic Agent or Elasticsearch) and share the full log if possible.

A little off topic but when it does all kick in and I connect to SO via SSH an alert is thrown up about the latest OpenSSL vulnerability (https://github.com/advisories/GHSA-2x8c-95vh-gfv4] being present on the SO server itself. Tried 'yum' and 'dnf' to patch after a bit of digging but can't seem to patch. As I said at the start I'm quite new to linux but learning fast.

Please see our blog post at https://blog.securityonion.net/2024/07/security-onion-and-regresshion-cve-2024.html.

If you have further questions or problems not related to the title of this discussion, then please start a new discussion with appropriate title.

moonlightbob · 2024-07-23T09:45:23Z

moonlightbob
Jul 23, 2024
Author

securityonion.log

/opt/so/log/elasticsearch/securityonion.log

Log attached and location.

Will take a look at the link on the SSHServer - thanks.

0 replies

dougburks · 2024-07-24T11:45:10Z

dougburks
Jul 24, 2024
Maintainer

I took some time to compare your elasticsearch log to my local EVAL and STANDALONE deployments and those messages appear to be normal and benign.

I then tried rebooting my EVAL and STANDALONE installations and checking to see how long it took for elastic-agent to show as healthy. On both of them, so-status shows all containers running at a system uptime of 6 minutes. Within 10 minutes of that, elastic-agent is fully HEALTHY.

For example, at a system uptime of 14 minutes, elastic-agent shows its fleet connection is FAILED:

 11:23:20 up 14 min,  1 user,  load average: 0.86, 2.62, 2.31
┌─ fleet
│  └─ status: (FAILED) fail to checkin to fleet-server: all hosts failed: 2 errors occurred:
│         * requester 0/2 to host https://standalone.hq.acmeonions.com:8220/ errored: Post "https://standalone.hq.acmeonions.com:8220/api/fleet/agents/c3795e4a-4979-40f9-919a-08cd479ca0f7/checkin?": dial tcp 192.168.199.120:8220: connect: no route to host
│         * requester 1/2 to host https://standalone:8220/ errored: Post "https://standalone:8220/api/fleet/agents/c3795e4a-4979-40f9-919a-08cd479ca0f7/checkin?": dial tcp 127.0.0.1:8220: connect: connection refused
│     
│     
└─ elastic-agent
   └─ status: (HEALTHY) Running

At a system uptime of 15 minutes, elastic-agent is fully HEALTHY:

 11:24:20 up 15 min,  1 user,  load average: 2.37, 2.72, 2.37
┌─ fleet
│  └─ status: (HEALTHY) Connected
└─ elastic-agent
   └─ status: (HEALTHY) Running

So if you're seeing elastic-agent take an hour to go fully HEALTHY, then that is is definitely NOT normal.

Would it possible for you to do a test installation on a physical machine (not VM) just so we can rule out your virtualization environment?

0 replies

moonlightbob · 2024-07-24T12:47:58Z

moonlightbob
Jul 24, 2024
Author

Thanks for your time looking into this - it really is appreciated.

Unfortunately I don't have any other hardware at hand that comes close to the minimum specs for running SO standalone.

I'm just going to boot SO one more time and have a look through other logs and see if there are any more clues. Will also take some more exact time measurements of start-up times etc and be back here soon. I'll also have a dig around some of the Elastic forums etc to see if there are any similar issues noted anywhere.

0 replies

moonlightbob · 2024-07-24T15:05:47Z

moonlightbob
Jul 24, 2024
Author

OK - I've been hammering at this all day, detail in the attached but:

6 mins after SO Standalone VM boots, all the containers are 'Green and Running' - Same as yours Doug.

Fleet displays similar errors to yours also Doug, and Elastic-Agent status is also similar.

That's where the similarity ends.

This time instead of stopping and starting the agents, I waited. More or less (within 30-40 seconds or so) after one hour post boot the Elastic-Agents are running & healthy. Why one hour? I checked the ESXi host time and its correct (UTC). Checked the SO VM time and its also correct UTC. But why one hour?

Clutching at straws here, but where can I view the all the certificates validity start dates (and more importantly times) that are in use? My 'real world' time here is UTC+1, could it be that at boot certificates used are being created with a validity start time one hour ahead of UTC? Hence after one hour it all fires up? Or by restarting Elastic-Agents time zones have corrected themselves and certs are created with the correct validity time?

status over time.txt

0 replies

moonlightbob · 2024-07-24T16:13:06Z

moonlightbob
Jul 24, 2024
Author

Facepalm Moment!

The 1 hour wait bit made me think. Although both the SO VM and the ESXi host both showed the correct UTC time I thought I'd hit the little checkbox in ESXi here: SO_Guest > Edit Settings > VM Options > VMWare Tools > Synchronise Guest Time with Host. Reboot the SO Guest VM and start checking status. After 16 min all is working fine:

[onion@securityonion ~]$ sudo elastic-agent status
┌─ fleet
│ └─ status: (HEALTHY) Connected
└─ elastic-agent
└─ status: (HEALTHY) Running

I don't quite get why the time sync between host & guest would be essential in this case as both the guest and the host WERE displaying the correct time (and date). All I can think is that some certificates were being created based on UTC+1 'somehow'???

Ideally I'd like to untick the sync Guest & Host and recreate the problem / look at certificates to see if my hunch is correct? But then again, why does it matter if the SO Guest is synced for time with the host when the SO VM has 0.pool.ntp.org
1.pool.ntp.org in there as default for its time anyway?

I'll start having another look to see if I can work out how to view certificates in use by the SO VM and then try and recreate the issue.

7 replies

moonlightbob Jul 26, 2024
Author

Unchecked the ESXi option to sync guest time with host.
Disconnected the VM from management and sniffing interfaces, so essentially both interfaces unplugged.
Booted the SO VM straight into BIOS to see what the BIOS clock thinks the time is. Its UTC (my time -1hr)
Exit BIOS and let SO Boot - as soon as I get console, checked the time. Its also UTC
Elastic-Agents not started after 30 mins, so shut down.
Reconnected the management and sniffing interfaces and boot up the SO VM.
Elastic-Agents not started after 30 mins, so shut down.
Re-tick the sync host <> Guest time ESXi option & Boot up
After circa 16min, all SO-Status green and Elastic-agent & Fleet all good.

I'm baffled by this. Makes no sense as UTC time seems correct in all boot instances.

But it works with the SYNC ticked! Just don't like a mystery though.

dougburks Jul 29, 2024
Maintainer

If you haven't already, it might be good to check the output of the following just to verify that chronyd is working as expected and is not perhaps changing the time in some unexpected way:

sudo systemctl status chronyd

You might also try changing the NTP configuration to a local NTP server if you have one just to see if that makes any difference.
https://docs.securityonion.net/en/2.4/ntp.html

moonlightbob Jul 30, 2024
Author

Regarding NTP, nothing internal. Tend to use pool.ntp.org. For 'sudo systemctl status chronyd' all looks good, reports enabled, running and time is correct .

I think I'm going to say 'Ticking the Sync Time ESXi Option' fixed this and get on with what I started out to do and have a good look around SO and check out the features etc.

dougburks Jul 30, 2024
Maintainer

I've added a note to our documentation in case anybody else runs into this:
https://docs.securityonion.net/en/dev/vmware.html#esxi

Answer selected by moonlightbob

Elastic Agent Status after reboot - 'Starting' #13285

Uh oh!

Uh oh!

moonlightbob Jun 29, 2024

Version

Installation Method

Description

Installation Type

Location

Hardware Specs

CPU

RAM

Storage for /

Storage for /nsm

Network Traffic Collection

Network Traffic Speeds

Status

Salt Status

Logs

Detail

Guidelines

Replies: 13 comments · 11 replies

Uh oh!

moonlightbob Jun 29, 2024 Author

Uh oh!

moonlightbob Jul 2, 2024 Author

Uh oh!

dougburks Jul 5, 2024 Maintainer

Uh oh!

moonlightbob Jul 6, 2024 Author

Uh oh!

dougburks Jul 8, 2024 Maintainer

Uh oh!

moonlightbob Jul 9, 2024 Author

Uh oh!

dougburks Jul 10, 2024 Maintainer

Uh oh!

moonlightbob Jul 13, 2024 Author

Uh oh!

moonlightbob Jul 14, 2024 Author

Uh oh!

dougburks Jul 16, 2024 Maintainer

Uh oh!

moonlightbob Jul 16, 2024 Author

Uh oh!

dougburks Jul 17, 2024 Maintainer

Uh oh!

moonlightbob Jul 23, 2024 Author

Uh oh!

dougburks Jul 24, 2024 Maintainer

Uh oh!

moonlightbob Jul 24, 2024 Author

Uh oh!

moonlightbob Jul 24, 2024 Author

Uh oh!

moonlightbob Jul 24, 2024 Author

Uh oh!

moonlightbob Jul 26, 2024 Author

Uh oh!

dougburks Jul 29, 2024 Maintainer

Uh oh!

moonlightbob Jul 30, 2024 Author

Uh oh!

dougburks Jul 30, 2024 Maintainer

moonlightbob
Jun 29, 2024

Replies: 13 comments 11 replies

moonlightbob
Jun 29, 2024
Author

moonlightbob
Jul 2, 2024
Author

dougburks
Jul 5, 2024
Maintainer

moonlightbob
Jul 6, 2024
Author

dougburks
Jul 8, 2024
Maintainer

moonlightbob Jul 9, 2024
Author

dougburks Jul 10, 2024
Maintainer

moonlightbob
Jul 13, 2024
Author

moonlightbob
Jul 14, 2024
Author

dougburks Jul 16, 2024
Maintainer

moonlightbob
Jul 16, 2024
Author

dougburks Jul 17, 2024
Maintainer

moonlightbob
Jul 23, 2024
Author

dougburks
Jul 24, 2024
Maintainer

moonlightbob
Jul 24, 2024
Author

moonlightbob
Jul 24, 2024
Author

moonlightbob
Jul 24, 2024
Author

moonlightbob Jul 26, 2024
Author

dougburks Jul 29, 2024
Maintainer

moonlightbob Jul 30, 2024
Author

dougburks Jul 30, 2024
Maintainer