Skip to content

sharanch/operations-autobot

Repository files navigation

sre-alarm-cli

A command-line tool for SRE on-call automation. Given a Jira ticket number, it fetches the ticket details, identifies the alarm type from the title, extracts the target node from the description, runs the appropriate Ansible playbook, then updates the ticket with a mitigation comment, assigns it to the on-call engineer, and transitions it to In Progress — all in one command.

python cli.py --ticket OPS-1234

How it works

ticket number
    │
    ▼
Jira API (fetch ticket)
    │
    ├── summary  → alarm type  (keyword match)
    └── description → node hostname (regex parse)
                │
                ▼
        alarm_handler.py (switch/case)
                │
                ▼
        ansible-playbook <playbook> -i <node>,
                │
                ▼
        Jira: post comment + assign + transition → In Progress

Supported alarm types

Alarm Type Trigger keywords (in ticket title) Playbook
HIGH_CPU high cpu, cpu spike, cpu utilization remediate_high_cpu.yml
DISK_FULL disk full, disk usage, low disk, no space left remediate_disk_full.yml
SERVICE_DOWN service down, service unavailable, process not running remediate_service_down.yml
OOM_KILL oom kill, out of memory, memory killed remediate_oom_kill.yml
NTP_DRIFT ntp drift, clock skew, time drift remediate_ntp_drift.yml
HIGH_MEMORY high memory, memory usage, memory pressure remediate_high_memory.yml
NETWORK_LATENCY network latency, packet loss, high latency remediate_network_latency.yml
SSL_EXPIRY ssl expiry, certificate expir, tls expir remediate_ssl_expiry.yml
LOAD_AVERAGE load average, high load, load avg remediate_load_average.yml
SWAP_USAGE swap usage, swap full, high swap remediate_swap_usage.yml

Ticket format expectations

The tool parses two fields from the Jira ticket:

Summary (title) — must contain a recognisable alarm keyword:

[ALERT] High CPU utilization on prod-web-07 — threshold 90%

Description — must contain the target node on its own line:

Node: prod-web-07.internal
Runbook: https://wiki.example.com/runbooks/high-cpu

Accepted prefixes: Node:, Host:, Server:, Target:, Affected Node:, Affected Host:


Setup

Prerequisites

  • Python 3.10+
  • Ansible installed and in $PATH
  • SSH access from your machine to target nodes

Install

git clone https://github.com/yourusername/sre-alarm-cli.git
cd sre-alarm-cli
pip install -r requirements.txt
cp env.example .env

Edit .env with your Jira credentials and Ansible settings.


Usage

# Run full automation
python cli.py --ticket OPS-1234

# Dry run — fetch, parse and log what would happen, but don't execute anything
python cli.py --ticket OPS-1234 --dry-run

Example output

2024-01-15 14:23:01  INFO     sre-alarm-cli — Starting automation for ticket: OPS-1234
2024-01-15 14:23:01  INFO     sre-alarm-cli — Fetched ticket — Summary: [ALERT] High CPU utilization on prod-web-07
2024-01-15 14:23:01  INFO     sre-alarm-cli — Alarm type : HIGH_CPU
2024-01-15 14:23:01  INFO     sre-alarm-cli — Target node: prod-web-07.internal
2024-01-15 14:23:01  INFO     sre-alarm-cli — Playbook   : playbooks/remediate_high_cpu.yml
2024-01-15 14:23:01  INFO     ansible_runner — Running: ansible-playbook playbooks/remediate_high_cpu.yml ...
2024-01-15 14:23:08  INFO     ansible_runner — Playbook completed successfully.
2024-01-15 14:23:08  INFO     jira_updater — Comment posted to OPS-1234
2024-01-15 14:23:09  INFO     jira_updater — Ticket OPS-1234 assigned to jane.doe
2024-01-15 14:23:09  INFO     jira_updater — Ticket OPS-1234 transitioned to In Progress.
2024-01-15 14:23:09  INFO     sre-alarm-cli — Done. Ticket OPS-1234 updated and set to In Progress.

Project structure

sre-alarm-cli/
├── cli.py                  # Entry point and orchestration
├── jira_client.py          # Jira REST API auth and ticket fetching
├── alarm_handler.py        # Alarm type resolution and playbook mapping
├── ansible_runner.py       # subprocess wrapper for ansible-playbook
├── jira_updater.py         # Comment, assign, and transition via Jira API
├── config.py               # Environment variable loading
├── playbooks/
│   ├── remediate_high_cpu.yml
│   ├── remediate_disk_full.yml
│   ├── remediate_service_down.yml
│   ├── remediate_oom_kill.yml
│   ├── remediate_ntp_drift.yml
│   ├── remediate_high_memory.yml
│   ├── remediate_network_latency.yml
│   ├── remediate_ssl_expiry.yml
│   ├── remediate_load_average.yml
│   └── remediate_swap_usage.yml
├── env.example
├── requirements.txt
└── README.md

Extending the tool

Adding a new alarm type

  1. Add a new entry to ALARM_KEYWORDS in alarm_handler.py:
"MY_ALARM": ["keyword one", "keyword two"],
  1. Add it to ALARM_PLAYBOOK_MAP:
"MY_ALARM": "remediate_my_alarm.yml",
  1. Create playbooks/remediate_my_alarm.yml.

Finding your Jira transition ID

curl -u your@email.com:your_api_token \
  https://your-org.atlassian.net/rest/api/3/issue/OPS-1/transitions \
  | python3 -m json.tool | grep -A2 '"name"'

Set the id for "In Progress" as JIRA_TRANSITION_IN_PROGRESS in your .env.


License

MIT

About

An anonimized version of pythoncli tool i built during my tenure at oracle.

Topics

Resources

Stars

Watchers

Forks

Contributors

Languages