Skip to content

Support eventdb to record reported alarms#5

Merged
jjin62 merged 3 commits into202411_otnfrom
202411_otn_alarm_eventdb_upstream
Mar 20, 2026
Merged

Support eventdb to record reported alarms#5
jjin62 merged 3 commits into202411_otnfrom
202411_otn_alarm_eventdb_upstream

Conversation

@dudu579
Copy link

@dudu579 dudu579 commented Mar 17, 2026

Why I did it

Work item tracking

How I did it

  • Per ONIE_PLATFORM (device) owns their own alarm list: sonic-buildimage/device/molex/x86_64-otn-kvm_x86_64-r0/default.json
  • Eventd container installs sonic-eventd-otn-profile debian package to apply device level alarm list.
  • sonic-eventd-otn-profile will copy ONIE_PLATFORM alarm list and syslog plugin to eventd.
Architecture
orchagent (docker-swss)
  │  SWSS_LOG_NOTICE → syslog tag: swss#orchagent
  ▼
Host rsyslogd
  │  matches programname "orchagent" (partial match via otn_events.conf)
  │  action: omprog → rsyslog_plugin -r /etc/rsyslog.d/otn_regex.json -m sonic-events-otn
  ▼
rsyslog_plugin  (host process)
  │  regex extracts: severity, resource, type-id, action
  │  publishes JSON to ZMQ tcp://127.0.0.1:5570
  ▼
eventd (docker-eventd) — ZMQ proxy XSUB:5570 → XPUB:5571
  ▼
eventdb / eventconsume.cpp (docker-eventd) — subscribes 5571
  │  looks up type-id in /etc/evprofile/default.json  (inside container)
  │  writes EVENT table
  │  action=RAISE/CLEAR → writes ALARM table
  ▼
Redis EVENT_DB (DB 19)
  ├── EVENT|<id>   { type-id, text, time-created, action, resource, severity }
  ├── EVENT_STATS
  ├── ALARM|<id>   { type-id, text, time-created, acknowledged, resource, severity }
  └── ALARM_STATS

Bug fix and changes

  • evendb does not start inside evend: priority.
  • Add OTN-OA min&max gain range in device config.
  • Possible failure on parallel compiling.(events test timeout)

@dudu579 dudu579 requested review from jjin62, oplklum and pkable March 17, 2026 17:38
@dudu579 dudu579 self-assigned this Mar 17, 2026
@dudu579
Copy link
Author

dudu579 commented Mar 17, 2026

Improvements:

  • Try to use a general type-id with description(descripion->type-id from default.json in sonic-swss(orchagent))
  • Record description in eventdb with text
  • lower-case
  • Desired alarm-syslog format: timestamp, alarm-severity, nodename:entity, type-id, description, state
  • Alarm does not always have actions. Think about how to decide it
  • Make sure alarm flapping will be blocked by eventdb. (Found this bug in SONiC...)

@dudu579
Copy link
Author

dudu579 commented Mar 18, 2026

@jjin62 For the description -> type-id mirror, there is a concern over here. If we do this, it will drop the severity field automatically since they are using the same type-id's severity. Attached my experiment:

root@sonic:~# logger -t "swss#orchagent" "2026-03-17T22:46:16Z, CRITICAL, sonic:OA0-0, Out of Gain Range, RAISE"

root@sonic:~# redis-cli -n 19 hgetall "ALARM|10"
 1) "time-created"
 2) "1773797133627957758"
 3) "action"
 4) "RAISE"
 5) "resource"
 6) "sonic:OA0-0"
 7) "type-id"
 8) "Out of Gain Range"
 9) "severity"
10) "MINOR"
11) "id"
12) "10"
13) "acknowledged"
14) "false" 

I tried to raise a CRITICAL alarm, but finally it used Out of Gain Range's severity which is defined in default.json. I do think we need to keep type-id = alarm-id and per alarm uses their own severity.(Same as SAI defination)

@dudu579
Copy link
Author

dudu579 commented Mar 18, 2026

Improvements Conflicts
General type-id with description(descripion->type-id : default.json) Above severity tight coupling
Record description in eventdb with text No conflicts
lower-case If we need to follow up the naming rules in sonic-yang? They defined action/severity as an enumeration(ALL UPPER CASE). If not, we can change it.
Desired alarm-syslog format No conflicts
Alarm does not always have actions. Think about how to decide it Basing on consumer codes, they are distinguishing alarms and events basing on action. Alarm has an action but events do not. We could follow on their rules. Alarms are the subset of Events.
Make sure alarm flapping will be blocked by eventdb. (Found this bug in SONiC...) Molex should not have the responsibility to fix this. Or this is not a bug at all. SAI or driver should never raise the same alarms.

@dudu579 dudu579 force-pushed the 202411_otn_alarm_eventdb_upstream branch 2 times, most recently from 7c6e83b to 6e94484 Compare March 18, 2026 22:32
@dudu579
Copy link
Author

dudu579 commented Mar 19, 2026

Try to use sonic-event.yang to generate command lines. Right now they are working. https://github.com/sonic-molex/sonic-swss only supports events nor alarms.(Out of Gain Range is not a good demo case. Eventdb will treat event with action syslog as an alarm.)

Type syslog format Example
Event timestamp, alarm-severity, nodename:entity, type-id, description logger -t "swss#orchagent" "2026-03-05T22:47:16Z, MINOR, sonic:OA0-0, Out of Gain Range, test"
Alarm timestamp, alarm-severity, nodename:entity, type-id, description, action logger -t "swss#orchagent" "2026-03-17T22:46:16Z, CRITICAL, sonic:OA0-0, Out of Gain Range, Out of Gain Range, RAISE"

admin@sonic:~$ show event
ID    EVENT STATE
----  -------------
admin@sonic:~$ show event-stats
ID    EVENTS    RAISED    ACKED    CLEARED
----  --------  --------  -------  ---------
admin@sonic:~$ config otn-oa update OA0-0 --target-gain 40
Root privileges are required for this operation
admin@sonic:~$ sudo -i
root@sonic:~# config otn-oa update OA0-0 --target-gain 40
sonic_yang(6):Note: Below table(s) have no YANG models: OTN_OCM, OTN_OCM_CHANNEL, OTN_WSS, OTN_WSS_SPEC_POWER
sonic_yang(6):Note: Below table(s) have no YANG models: OTN_OCM, OTN_OCM_CHANNEL, OTN_WSS, OTN_WSS_SPEC_POWER
root@sonic:~# cat /usr/bin/yang_auto_cli.sh ^C
root@sonic:~# ^C
root@sonic:~# show event
  ID  EVENT STATE
----  -----------------------------------------------------------------
   1  resource:      sonic:OA0-0
      text:          Target gain value 4000 is out of range [300, 2000]
      time-created:  1773879207694653367
      type-id:       Out of Gain Range
      severity:      MINOR
root@sonic:~# show event-stats
ID       EVENTS    RAISED    ACKED    CLEARED
-----  --------  --------  -------  ---------
state         1         0        0          0

Support sonic-event.yang CLI generation in yang_auto_cli.sh
@dudu579 dudu579 force-pushed the 202411_otn_alarm_eventdb_upstream branch from 6e94484 to ad1948e Compare March 19, 2026 00:31
@dudu579
Copy link
Author

dudu579 commented Mar 19, 2026

Open questions:

  • In default.json, they mentioned: "use 'event profile ' command to apply that profile without having to send SIGINT to eventd." But right now I can not run event command in SONiC at all... I believe it is hiding in some PRs. Event Management support sonic-net/sonic-mgmt-framework#85. Using this way we can avoid installing debian package.
  • Make sure alarm flapping will be blocked by eventdb. (Found this bug in SONiC...)
  • sonic-alarm.yang is not extended with sonic-yang. If we want show alarm enable in SONiC, we need to change sonic-alarm.yang.

@jjin62 jjin62 merged commit e131fdc into 202411_otn Mar 20, 2026
1 check passed
@jjin62 jjin62 deleted the 202411_otn_alarm_eventdb_upstream branch March 20, 2026 18:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants