Sensor Status Shows Fault - So-Status shows OK #13894

mdhill89 · 2024-11-01T15:37:32Z

mdhill89
Nov 1, 2024

Version

2.4.100

Installation Method

Security Onion ISO image

Description

other (please provide detail below)

Installation Type

Distributed

Location

on-prem with Internet access

Hardware Specs

Meets minimum requirements

CPU

8

RAM

16

Storage for /

211GB

Storage for /nsm

427GB

Network Traffic Collection

span port

Network Traffic Speeds

1Gbps to 10Gbps

Status

No, one or more services are failed (please provide detail below)

Salt Status

No, there are no failures

Logs

No, there are no additional clues

Detail

Hello, I currently have 7 of 36 sensors reporting fault in our grid, however, I can ping, ssh, run salt commands to these nodes from the master without issue and the command so-status shows these sensors are in good working condition. I have changed the NTP configuration from the SOC and verified the change has passed down to all nodes, so NTP is not an issue any longer in our cluster nor should any local or network firewalls be preventing traffic to these affected nodes. I have also attempted to reboot the affected nodes to no avail.

Below are the outputs from the SOC and So-status:

I am unsure of what can be done to refresh the sensor's status in the grid. Please let me know what additional details I can provide as I'm sure I'm leaving out something.

Guidelines

I have read the discussion guidelines at Read before posting! #1720 and assert that I have followed the guidelines.

Answered by mdhill89

Nov 4, 2024

The issue was with DNS, more specifically, connection to DNS. The master node had not previously had a name registered in DNS, and our nodes just used the entry built in /etc/hosts during setup. I gave the master node a DNS name in our internal DNS and added TLS/SSL certificates to it for a better end user experience. This change brought to light an issue in that some of our nodes are not able to connect to our internal DNS servers. This wasn't an issue until I made this change because SO added static names to the /etc/hosts file, so until now they never really needed to reach our DNS servers. I was able to prove this by adding the new DNS name as an alias in /etc/hosts and one of the pro…

View full answer

rh-ops · 2024-11-01T17:04:41Z

rh-ops
Nov 1, 2024

Try "so-telegraf-restart" and then run "so-checkin" and see if that works. If so-checkin is taking longer than it does normally, stop it, then run "salt-call saltutil.kill_all_jobs", then try so-checkin again.

5 replies

mdhill89 Nov 1, 2024
Author

No dice unfortunately. Here's the final output from the check-in and so-status commands for a separate node than the one pictured in my OP.

I noticed in my first screenshot that one of the few affected nodes actually shows so-zeek as "Unhealthy" which is not a symptom of the other nodes.

mdhill89 Nov 1, 2024
Author

As a new twist, all of the nodes that were in a fault status have disappeared from the SOC but the salt keys are still active. Here's a list of keys accepted by our master node, I have circled the ones that are no longer showing in the UI, but are still online, reachable, and seemingly operational.

As you can see, we have several other sensors with the exact same config that have no issue.

TOoSmOotH Nov 4, 2024
Maintainer

It means it can't hit 443 on the manager.

mdhill89 Nov 4, 2024
Author

It means it can't hit 443 on the manager.

This can't be the issue because I tested all of the required ports based on the SO documentation and they are still functional. I also picked a few and ran openssl s_client -connect MasterIP:443 -showcerts and received the expected output from that command meaning that port 443 connectivity is not the issue.

TOoSmOotH Nov 4, 2024
Maintainer

Check the sensoroni log on the sensor. Did you replace the certificate with your own? Did you use a different URL?

mdhill89 · 2024-11-04T14:55:00Z

mdhill89
Nov 4, 2024
Author

The issue was with DNS, more specifically, connection to DNS. The master node had not previously had a name registered in DNS, and our nodes just used the entry built in /etc/hosts during setup. I gave the master node a DNS name in our internal DNS and added TLS/SSL certificates to it for a better end user experience. This change brought to light an issue in that some of our nodes are not able to connect to our internal DNS servers. This wasn't an issue until I made this change because SO added static names to the /etc/hosts file, so until now they never really needed to reach our DNS servers. I was able to prove this by adding the new DNS name as an alias in /etc/hosts and one of the problem nodes was then able to reconnect to the manager. (See screenshots below)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sensor Status Shows Fault - So-Status shows OK #13894

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Sensor Status Shows Fault - So-Status shows OK #13894

Uh oh!

Uh oh!

mdhill89 Nov 1, 2024

Version

Installation Method

Description

Installation Type

Location

Hardware Specs

CPU

RAM

Storage for /

Storage for /nsm

Network Traffic Collection

Network Traffic Speeds

Status

Salt Status

Logs

Detail

Guidelines

Replies: 2 comments · 5 replies

Uh oh!

rh-ops Nov 1, 2024

Uh oh!

mdhill89 Nov 1, 2024 Author

Uh oh!

Uh oh!

mdhill89 Nov 1, 2024 Author

Uh oh!

TOoSmOotH Nov 4, 2024 Maintainer

Uh oh!

mdhill89 Nov 4, 2024 Author

Uh oh!

TOoSmOotH Nov 4, 2024 Maintainer

Uh oh!

mdhill89 Nov 4, 2024 Author

mdhill89
Nov 1, 2024

Replies: 2 comments 5 replies

rh-ops
Nov 1, 2024

mdhill89 Nov 1, 2024
Author

mdhill89 Nov 1, 2024
Author

TOoSmOotH Nov 4, 2024
Maintainer

mdhill89 Nov 4, 2024
Author

TOoSmOotH Nov 4, 2024
Maintainer

mdhill89
Nov 4, 2024
Author