Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 47 additions & 0 deletions .github/workflows/build-and-push-uptime-service.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
---
name: Build and push Uptime Service docker image

on:
workflow_call:
inputs:
ref:
description: "git ref: hash, branch, tag to build uptime-service files from"
type: string
required: true

jobs:
main:
name: Build Uptime Service
runs-on: ubuntu-24.04
steps:
- name: Checkout source code
uses: actions/checkout@v4
with:
ref: ${{ inputs.ref }}
fetch-depth: 0

- name: Call action get-ref-properties
id: get-ref-properties
uses: Cardinal-Cryptography/github-actions/get-ref-properties@v7

- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3

- name: Login to Public Amazon ECR
uses: docker/login-action@v3
with:
registry: ${{ vars.ECR_PUBLIC_HOST }}
username: ${{ secrets.AWS_MAINNET_ECR_CC_ACCESS_KEY_ID }}
password: ${{ secrets.AWS_MAINNET_ECR_CC_ACCESS_KEY }}

- name: Build and push docker image
id: build-image
uses: docker/build-push-action@v6
with:
context: ./ts/uptime-service
file: ./ts/uptime-service/Dockerfile
push: true
# yamllint disable rule:line-length
tags: |
${{ vars.ECR_CC_RES_PUBLIC_REGISTRY }}uptime-service:${{ steps.get-ref-properties.outputs.sha }}
${{ github.ref == 'refs/heads/main' && format('{0}uptime-service:latest', vars.ECR_CC_RES_PUBLIC_REGISTRY) || '' }}
1 change: 1 addition & 0 deletions ts/pnpm-workspace.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,4 @@ packages:
- "shielder-sdk"
- "shielder-sdk-tests"
- "!shielder-sdk-crypto-mobile"
- "!uptime-service"
27 changes: 27 additions & 0 deletions ts/uptime-service/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Port for the metrics HTTP server
PORT=9615

# Interval between health check probes in milliseconds
PROBE_INTERVAL=10000

# HTTP request timeout in milliseconds
TIMEOUT=5000

# List of endpoints to monitor (JSON array format)
# Each endpoint should have:
# - name: Unique identifier for the service
# - url: Full URL of the health endpoint
# - method: HTTP method (optional, defaults to GET)
# - expectedStatus: Expected HTTP status code (optional, defaults to 200)
ENDPOINTS='[
{
"name": "example-api",
"url": "http://localhost:3000/health",
"method": "GET",
"expectedStatus": 200
},
{
"name": "example-database",
"url": "http://localhost:5432/health"
}
]'
7 changes: 7 additions & 0 deletions ts/uptime-service/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
node_modules/
.env
*.log
.DS_Store
dist/
build/
coverage/
41 changes: 41 additions & 0 deletions ts/uptime-service/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
FROM oven/bun:1.1.38-slim AS builder

WORKDIR /app

# Copy package files
COPY package.json bun.lock* ./

# Install dependencies
RUN bun install --frozen-lockfile --production

# Copy source code
COPY src ./src

FROM oven/bun:1.1.38-slim

WORKDIR /app

# Install ca-certificates for HTTPS requests
RUN apt-get update && \
apt-get install -y ca-certificates && \
rm -rf /var/lib/apt/lists/*

# Copy dependencies and source from builder
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/package.json ./package.json
COPY --from=builder /app/src ./src

# Create non-root user
RUN useradd -r -s /bin/false appuser && \
chown -R appuser:appuser /app

USER appuser

# Expose metrics port
EXPOSE 9615

# Health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
CMD bun run -e "fetch('http://localhost:9615/health').then(r => r.ok ? process.exit(0) : process.exit(1)).catch(() => process.exit(1))"
Comment on lines +35 to +39
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Observe the port-shy service revealing a mismatch. From our hide, the container broadcasts and probes port 9615, yet loadConfig leaves the creature nesting on 9090 by default. Without setting PORT, every healthcheck will cry “unhealthy.” Please align the habitat by exporting the expected port inside the image or updating the runtime default.

+# Ensure the runtime listens on the probed port
+ENV PORT=9615
 # Expose metrics port
 EXPOSE 9615
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
EXPOSE 9615
# Health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
CMD bun run -e "fetch('http://localhost:9615/health').then(r => r.ok ? process.exit(0) : process.exit(1)).catch(() => process.exit(1))"
# Ensure the runtime listens on the probed port
ENV PORT=9615
EXPOSE 9615
# Health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
CMD bun run -e "fetch('http://localhost:9615/health').then(r => r.ok ? process.exit(0) : process.exit(1)).catch(() => process.exit(1))"
🤖 Prompt for AI Agents
In ts/uptime-service/Dockerfile around lines 35 to 39, the HEALTHCHECK and
EXPOSE use port 9615 while the app defaults to port 9090, causing healthchecks
to fail; fix by making the image and runtime agree — either export the expected
PORT (e.g., add ENV PORT=9615) so the service listens on 9615, or change the
EXPOSE and HEALTHCHECK to 9090 to match the app default; apply only one approach
and ensure the chosen port is documented in the container runtime config.

Comment on lines +38 to +39
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Behold the healthcheck’s misfire. Here the script calls bun run -e, but Bun’s evaluative call is the solitary bun -e, leaving this probe to stumble before it begins. Switching to bun -e lets the sentinel report health as intended.

-  CMD bun run -e "fetch('http://localhost:9615/health').then(r => r.ok ? process.exit(0) : process.exit(1)).catch(() => process.exit(1))"
+  CMD bun -e "fetch('http://localhost:9615/health').then(r => r.ok ? process.exit(0) : process.exit(1)).catch(() => process.exit(1))"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
CMD bun run -e "fetch('http://localhost:9615/health').then(r => r.ok ? process.exit(0) : process.exit(1)).catch(() => process.exit(1))"
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
CMD bun -e "fetch('http://localhost:9615/health').then(r => r.ok ? process.exit(0) : process.exit(1)).catch(() => process.exit(1))"
🤖 Prompt for AI Agents
In ts/uptime-service/Dockerfile around lines 38 to 39, the HEALTHCHECK uses "bun
run -e" which is incorrect for evaluating inline JS with Bun; change the command
to use "bun -e" (i.e., replace "bun run -e" with "bun -e") so the healthcheck
executes the inline fetch expression correctly and returns proper exit codes.


ENTRYPOINT ["bun", "run", "src/index.js"]
181 changes: 181 additions & 0 deletions ts/uptime-service/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,181 @@
# Uptime Monitoring Service

A lightweight Node.js/Bun service that monitors health endpoints and exposes metrics in Prometheus format for Grafana dashboards and alerting.

## Prerequisites

- [Bun](https://bun.sh/) installed on your system
- Services with health check endpoints to monitor

## Installation

1. Clone or download this repository
2. Install dependencies:

```bash
bun install
```

3. Create a `.env` file based on `.env.example`:

```bash
cp .env.example .env
```

4. Configure your endpoints in the `.env` file

## Configuration

All configuration is done via environment variables:

| Variable | Description | Default | Required |
| ---------------- | ----------------------------------- | ------- | -------- |
| `PORT` | Port for the metrics server | `9090` | No |
| `PROBE_INTERVAL` | Interval between health checks (ms) | `30000` | No |
| `TIMEOUT` | HTTP request timeout (ms) | `5000` | No |
| `ENDPOINTS` | JSON array of endpoints to monitor | - | Yes |

### Endpoint Configuration

The `ENDPOINTS` variable should contain a JSON array with the following structure:

```json
[
{
"name": "api-service",
"url": "http://api.example.com/health",
"method": "GET",
"expectedStatus": 200
},
{
"name": "database",
"url": "http://localhost:5432/health"
}
]
```

**Endpoint fields:**

- `name` (required): Unique identifier for the service
- `url` (required): Full URL of the health endpoint
- `method` (optional): HTTP method, defaults to `GET`
- `expectedStatus` (optional): Expected HTTP status code, defaults to `200`

### Example Configuration

```env
PORT=9090
PROBE_INTERVAL=30000
TIMEOUT=5000
ENDPOINTS='[
{"name":"frontend","url":"http://localhost:3000/health"},
{"name":"backend-api","url":"http://localhost:8080/health","expectedStatus":200},
{"name":"redis","url":"http://localhost:6379/health"}
]'
```

## Running the Service

### Development Mode

```bash
bun run dev
```

This runs the service with auto-reload on file changes.

### Production Mode

```bash
bun start
```

Or run directly:

```bash
bun run src/index.js
```

## Exposed Endpoints

The service exposes the following HTTP endpoints:

- **`/metrics`** - Prometheus metrics endpoint (for scraping)
- **`/health`** - Health check for the service itself
- **`/`** - Service information and available endpoints

## Prometheus Metrics

The service exposes the following metrics:

### `service_up`

**Type:** Gauge
**Description:** Service availability status (1 = up, 0 = down)
**Labels:** `service_name`, `endpoint`

### `service_response_time_seconds`

**Type:** Histogram
**Description:** Service response time in seconds
**Labels:** `service_name`, `endpoint`
**Buckets:** 0.001, 0.01, 0.1, 0.5, 1, 2, 5, 10

### `service_last_probe_timestamp`

**Type:** Gauge
**Description:** Unix timestamp of the last probe attempt
**Labels:** `service_name`, `endpoint`

## Grafana Dashboard

### Example Queries

**Current Uptime Status:**

```promql
service_up
```

**Uptime Percentage (last 24h):**

```promql
avg_over_time(service_up[24h]) * 100
```

**Average Response Time:**

```promql
rate(service_response_time_seconds_sum[5m]) / rate(service_response_time_seconds_count[5m])
```

### Alert Rules

**Service Down Alert:**

```yaml
groups:
- name: uptime_alerts
rules:
- alert: ServiceDown
expr: service_up == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.service_name }} is down"
description: "{{ $labels.service_name }} has been down for more than 2 minutes"
```

**High Response Time Alert:**

```yaml
- alert: HighResponseTime
expr: rate(service_response_time_seconds_sum[5m]) / rate(service_response_time_seconds_count[5m]) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High response time for {{ $labels.service_name }}"
description: "{{ $labels.service_name }} response time is above 1 second"
```
Loading
Loading