Skip to content

Commit d6c2561

Browse files
committed
Pipeline regression: cancel stale runs; refresh optimization tracker
1 parent 68f2c88 commit d6c2561

File tree

2 files changed

+57
-65
lines changed

2 files changed

+57
-65
lines changed

.github/workflows/pipeline_regression.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,10 @@ on:
66
branches:
77
- main
88

9+
concurrency:
10+
group: pipeline-regression-${{ github.ref }}
11+
cancel-in-progress: true
12+
913
jobs:
1014
regression:
1115
runs-on: ubuntu-latest
Lines changed: 53 additions & 65 deletions
Original file line numberDiff line numberDiff line change
@@ -1,93 +1,81 @@
11
# CleanApp Optimization Status (Rolling)
22

33
This is the rolling task/status tracker for the “big upgrade push” workstream.
4-
5-
Source of truth inputs:
6-
- Latest prod xray snapshot: `/Users/anon16/Downloads/cleanapp_back_end_v2/xray/prod/2026-02-09-postdlq2/`
7-
- Latest prod digest manifest: `/Users/anon16/Downloads/cleanapp_back_end_v2/platform_blueprint/manifests/prod/2026-02-09-postdlq2.json`
4+
Last updated: 2026-02-11 (UTC).
85

96
## 1) Network Hardening + Surface Area Reduction
107

11-
Status: **In progress**
8+
Status: **Mostly complete**
129

1310
Done:
14-
- Most internal service ports on prod are now bound to `127.0.0.1` (reduces exposure even if firewall rules are permissive).
15-
- Prod host ports `3000` (cleanapp_web) and `8090` (cleanapp_pipelines) are now bound to `127.0.0.1` (external access removed even if firewall rules still allow them).
16-
- Pinned RabbitMQ image in prod compose (stopped relying on `rabbitmq:latest`).
17-
- Dev VM (`cleanapp-dev`) now binds host ports `3000`/`8080`/`8090` to `127.0.0.1` and its `allow-*` firewall tags have been removed (only `http-server/https-server` remain).
18-
- `cleanapp-prod2` had `allow-3000/8090/8091` tags removed (it now only retains `allow-8080` + `http-server/https-server`).
19-
- Deleted unused GCE firewall rules: `allow-3000`, `allow-8090`, `allow-8091`.
20-
21-
Next:
22-
- Rotate/replace AMQP creds (stop relying on defaults).
23-
- Reduce GCE firewall tags/rules to only what must be public.
24-
- Decide what to do with `allow-8080` long-term (prod currently uses `api.cleanapp.io:8080` and it is actively hit by mobile clients).
11+
- Internal prod ports are bound to `127.0.0.1` for most backend services.
12+
- Removed unused world-open firewall rules (`allow-3000`, `allow-8090`, `allow-8091`).
13+
- Removed matching `allow-*` instance tags from dev/prod2.
2514

26-
Evidence:
27-
- `xray/prod/2026-02-09-postdlq2/ss_listening.txt`
28-
- `xray/prod/2026-02-09-postdlq2/gcloud_firewall_rules_relevant.txt`
15+
Remaining:
16+
- Final review/closure plan for legacy `:8080` exposure path after client migration.
2917

30-
## 2) Deterministic Deploys (Digest-Pinned By Default)
18+
## 2) Deterministic Deploys (Digest-Pinned)
3119

32-
Status: **In progress**
20+
Status: **Complete (operationalized)**
3321

3422
Done:
35-
- Captured redacted prod deploy config into the blueprint:
36-
- `platform_blueprint/deploy/prod/docker-compose.yml`
37-
- `platform_blueprint/deploy/prod/nginx_conf_d/`
38-
- Captured and committed digest pins from prod:
39-
- `platform_blueprint/manifests/prod/2026-02-09-postdlq2.json`
40-
- Generated and committed a digest-pinned compose overlay:
41-
- `platform_blueprint/deploy/prod/digests/2026-02-09-postdlq2.digests.yml`
42-
43-
Next:
44-
- Decide whether we want the manifest to cover *only running containers* or *all compose services* (including stopped ones), and adjust xray capture accordingly.
23+
- Prod deploy blueprint is captured/redacted under `platform_blueprint/deploy/prod/`.
24+
- VM helper in place and used: `platform_blueprint/deploy/prod/vm/deploy_with_digests.sh`.
25+
- `docker-compose.digests.current.yml` on prod is valid YAML (no escaped newline artifact) and passes `docker compose config`.
26+
- Version drift closed for key services via pinned rollout (`/version` endpoints aligned by git SHA).
4527

46-
## 3) RabbitMQ Pipeline Reliability (Backpressure, Ack Semantics, DLQs)
28+
## 3) RabbitMQ Pipeline Reliability
4729

48-
Status: **Mostly complete (core safety)**
30+
Status: **Complete (core path)**
4931

5032
Done:
51-
- Bounded concurrency + correct ack/nack semantics for key Rust consumers (no per-message goroutine spawning, ack only after success).
52-
- DLQs enabled on prod (DLX `cleanapp-dlx` + `<queue>.dlq` + policies) for:
53-
- `report-tags-queue`
54-
- `report-renderer-queue`
55-
- `twitter-reply-queue`
56-
- `report-analysis-queue` (policy + DLQ queue present for future)
33+
- Rust consumers: bounded concurrency + ack-after-success + reconnect hardening.
34+
- Go consumers: same hardening applied on active paths.
35+
- DLQ + retry topology present in prod (`cleanapp-dlx`, `*.dlq`, `*.retry`).
36+
- Analyzer reconnect and watchdog self-heal prevent silent post-restart pipeline stalls.
5737

58-
Next:
59-
- Add retry queues / max redelivery policy (so transient errors don’t spin forever).
60-
- Add a DLQ replay/runbook (how to inspect + requeue after fixes).
38+
## 4) Observability -> Alerting
39+
40+
Status: **In progress (strong partial)**
41+
42+
Done:
43+
- Prometheus + Alertmanager installed on prod (localhost-only).
44+
- Analyzer `/metrics` live and scraped.
45+
- RabbitMQ exporter added and scraped.
46+
- Alert rules active for analyzer disconnect and queue-missing/retry-surge signals.
47+
- Watchdog now supports shared webhook fallback (`CLEANAPP_ALERT_WEBHOOK_URL`).
6148

62-
Evidence:
63-
- `xray/prod/2026-02-09-postdlq2/rabbitmq_policies.txt`
64-
- `xray/prod/2026-02-09-postdlq2/rabbitmq_queues.tsv`
49+
Remaining:
50+
- Wire real external webhook destination in prod (`CLEANAPP_ALERT_WEBHOOK_URL`) and test end-to-end delivery with a synthetic alert.
6551

66-
## 4) Observability + Debuggability (Correlation IDs, Metrics)
52+
## 5) Integration Harness / Regression Gate
6753

68-
Status: **Early**
54+
Status: **In progress (advanced)**
6955

7056
Done:
71-
- `/version` endpoints broadly deployed; xray captures include provenance.
57+
- Analyzer golden-path CI workflow is passing.
58+
- New full pipeline CI workflow added:
59+
- `platform_blueprint/tests/ci/pipeline/run.sh`
60+
- `.github/workflows/pipeline_regression.yml`
61+
- Validates analysis + tags + renderer side effects and RabbitMQ restart resilience.
7262

73-
Next:
74-
- Standardize structured logs + correlation id propagation.
75-
- Minimal metrics for queue depth/lag + consumer health.
63+
Remaining:
64+
- Keep `pipeline-regression` green on `main` and tune runtime/flakiness as needed.
7665

77-
## 5) Platform Integration Harness (Contracts + Smoke + Golden Paths)
66+
## 6) Backup Hardening + Restore Confidence
7867

79-
Status: **In progress**
68+
Status: **Mostly complete**
8069

8170
Done:
82-
- Public smoke checks exist (nginx endpoints):
83-
- `platform_blueprint/tests/smoke/smoke_prod.sh`
84-
- `platform_blueprint/tests/smoke/capture_prod_public.sh`
85-
- v4 OpenAPI contract snapshot is stored:
86-
- `platform_blueprint/contracts/openapi/api_v4_openapi.json`
87-
- Prod VM-local smoke checks exist (localhost ports + RabbitMQ invariants):
88-
- `platform_blueprint/tests/smoke/smoke_prod_vm.sh`
89-
- v4 contract checks (quick) are exercised in the public smoke:
90-
- `platform_blueprint/tests/smoke/smoke_prod.sh`
91-
92-
Next:
93-
- Optionally make the contract smoke OpenAPI-driven (validate endpoint coverage/schema drift).
71+
- PR #115 merged (backup script + schedule + metadata + docs).
72+
- Daily backup cron active on prod (`/home/deployer/backup.sh -e prod`).
73+
- Watchdog verifies backup freshness from `/home/deployer/backups/backup.log`.
74+
- Restore drill script improved for realistic online-backup drift tolerance:
75+
- `platform_blueprint/ops/db_backup/restore_drill_prod_vm.sh`
76+
- `ROW_COUNT_TOLERANCE_PCT` (default `0.2%`)
77+
- Restore drill result captured:
78+
- `xray/prod/2026-02-11/restore_drill_result.md`
79+
80+
Remaining:
81+
- Optional: run another full timed drill during a low-write window to reduce count drift even further.

0 commit comments

Comments
 (0)