|
1 | 1 | # CleanApp Optimization Status (Rolling) |
2 | 2 |
|
3 | 3 | This is the rolling task/status tracker for the “big upgrade push” workstream. |
4 | | - |
5 | | -Source of truth inputs: |
6 | | -- Latest prod xray snapshot: `/Users/anon16/Downloads/cleanapp_back_end_v2/xray/prod/2026-02-09-postdlq2/` |
7 | | -- Latest prod digest manifest: `/Users/anon16/Downloads/cleanapp_back_end_v2/platform_blueprint/manifests/prod/2026-02-09-postdlq2.json` |
| 4 | +Last updated: 2026-02-11 (UTC). |
8 | 5 |
|
9 | 6 | ## 1) Network Hardening + Surface Area Reduction |
10 | 7 |
|
11 | | -Status: **In progress** |
| 8 | +Status: **Mostly complete** |
12 | 9 |
|
13 | 10 | Done: |
14 | | -- Most internal service ports on prod are now bound to `127.0.0.1` (reduces exposure even if firewall rules are permissive). |
15 | | -- Prod host ports `3000` (cleanapp_web) and `8090` (cleanapp_pipelines) are now bound to `127.0.0.1` (external access removed even if firewall rules still allow them). |
16 | | -- Pinned RabbitMQ image in prod compose (stopped relying on `rabbitmq:latest`). |
17 | | -- Dev VM (`cleanapp-dev`) now binds host ports `3000`/`8080`/`8090` to `127.0.0.1` and its `allow-*` firewall tags have been removed (only `http-server/https-server` remain). |
18 | | -- `cleanapp-prod2` had `allow-3000/8090/8091` tags removed (it now only retains `allow-8080` + `http-server/https-server`). |
19 | | -- Deleted unused GCE firewall rules: `allow-3000`, `allow-8090`, `allow-8091`. |
20 | | - |
21 | | -Next: |
22 | | -- Rotate/replace AMQP creds (stop relying on defaults). |
23 | | -- Reduce GCE firewall tags/rules to only what must be public. |
24 | | - - Decide what to do with `allow-8080` long-term (prod currently uses `api.cleanapp.io:8080` and it is actively hit by mobile clients). |
| 11 | +- Internal prod ports are bound to `127.0.0.1` for most backend services. |
| 12 | +- Removed unused world-open firewall rules (`allow-3000`, `allow-8090`, `allow-8091`). |
| 13 | +- Removed matching `allow-*` instance tags from dev/prod2. |
25 | 14 |
|
26 | | -Evidence: |
27 | | -- `xray/prod/2026-02-09-postdlq2/ss_listening.txt` |
28 | | -- `xray/prod/2026-02-09-postdlq2/gcloud_firewall_rules_relevant.txt` |
| 15 | +Remaining: |
| 16 | +- Final review/closure plan for legacy `:8080` exposure path after client migration. |
29 | 17 |
|
30 | | -## 2) Deterministic Deploys (Digest-Pinned By Default) |
| 18 | +## 2) Deterministic Deploys (Digest-Pinned) |
31 | 19 |
|
32 | | -Status: **In progress** |
| 20 | +Status: **Complete (operationalized)** |
33 | 21 |
|
34 | 22 | Done: |
35 | | -- Captured redacted prod deploy config into the blueprint: |
36 | | - - `platform_blueprint/deploy/prod/docker-compose.yml` |
37 | | - - `platform_blueprint/deploy/prod/nginx_conf_d/` |
38 | | -- Captured and committed digest pins from prod: |
39 | | - - `platform_blueprint/manifests/prod/2026-02-09-postdlq2.json` |
40 | | -- Generated and committed a digest-pinned compose overlay: |
41 | | - - `platform_blueprint/deploy/prod/digests/2026-02-09-postdlq2.digests.yml` |
42 | | - |
43 | | -Next: |
44 | | -- Decide whether we want the manifest to cover *only running containers* or *all compose services* (including stopped ones), and adjust xray capture accordingly. |
| 23 | +- Prod deploy blueprint is captured/redacted under `platform_blueprint/deploy/prod/`. |
| 24 | +- VM helper in place and used: `platform_blueprint/deploy/prod/vm/deploy_with_digests.sh`. |
| 25 | +- `docker-compose.digests.current.yml` on prod is valid YAML (no escaped newline artifact) and passes `docker compose config`. |
| 26 | +- Version drift closed for key services via pinned rollout (`/version` endpoints aligned by git SHA). |
45 | 27 |
|
46 | | -## 3) RabbitMQ Pipeline Reliability (Backpressure, Ack Semantics, DLQs) |
| 28 | +## 3) RabbitMQ Pipeline Reliability |
47 | 29 |
|
48 | | -Status: **Mostly complete (core safety)** |
| 30 | +Status: **Complete (core path)** |
49 | 31 |
|
50 | 32 | Done: |
51 | | -- Bounded concurrency + correct ack/nack semantics for key Rust consumers (no per-message goroutine spawning, ack only after success). |
52 | | -- DLQs enabled on prod (DLX `cleanapp-dlx` + `<queue>.dlq` + policies) for: |
53 | | - - `report-tags-queue` |
54 | | - - `report-renderer-queue` |
55 | | - - `twitter-reply-queue` |
56 | | - - `report-analysis-queue` (policy + DLQ queue present for future) |
| 33 | +- Rust consumers: bounded concurrency + ack-after-success + reconnect hardening. |
| 34 | +- Go consumers: same hardening applied on active paths. |
| 35 | +- DLQ + retry topology present in prod (`cleanapp-dlx`, `*.dlq`, `*.retry`). |
| 36 | +- Analyzer reconnect and watchdog self-heal prevent silent post-restart pipeline stalls. |
57 | 37 |
|
58 | | -Next: |
59 | | -- Add retry queues / max redelivery policy (so transient errors don’t spin forever). |
60 | | -- Add a DLQ replay/runbook (how to inspect + requeue after fixes). |
| 38 | +## 4) Observability -> Alerting |
| 39 | + |
| 40 | +Status: **In progress (strong partial)** |
| 41 | + |
| 42 | +Done: |
| 43 | +- Prometheus + Alertmanager installed on prod (localhost-only). |
| 44 | +- Analyzer `/metrics` live and scraped. |
| 45 | +- RabbitMQ exporter added and scraped. |
| 46 | +- Alert rules active for analyzer disconnect and queue-missing/retry-surge signals. |
| 47 | +- Watchdog now supports shared webhook fallback (`CLEANAPP_ALERT_WEBHOOK_URL`). |
61 | 48 |
|
62 | | -Evidence: |
63 | | -- `xray/prod/2026-02-09-postdlq2/rabbitmq_policies.txt` |
64 | | -- `xray/prod/2026-02-09-postdlq2/rabbitmq_queues.tsv` |
| 49 | +Remaining: |
| 50 | +- Wire real external webhook destination in prod (`CLEANAPP_ALERT_WEBHOOK_URL`) and test end-to-end delivery with a synthetic alert. |
65 | 51 |
|
66 | | -## 4) Observability + Debuggability (Correlation IDs, Metrics) |
| 52 | +## 5) Integration Harness / Regression Gate |
67 | 53 |
|
68 | | -Status: **Early** |
| 54 | +Status: **In progress (advanced)** |
69 | 55 |
|
70 | 56 | Done: |
71 | | -- `/version` endpoints broadly deployed; xray captures include provenance. |
| 57 | +- Analyzer golden-path CI workflow is passing. |
| 58 | +- New full pipeline CI workflow added: |
| 59 | + - `platform_blueprint/tests/ci/pipeline/run.sh` |
| 60 | + - `.github/workflows/pipeline_regression.yml` |
| 61 | + - Validates analysis + tags + renderer side effects and RabbitMQ restart resilience. |
72 | 62 |
|
73 | | -Next: |
74 | | -- Standardize structured logs + correlation id propagation. |
75 | | -- Minimal metrics for queue depth/lag + consumer health. |
| 63 | +Remaining: |
| 64 | +- Keep `pipeline-regression` green on `main` and tune runtime/flakiness as needed. |
76 | 65 |
|
77 | | -## 5) Platform Integration Harness (Contracts + Smoke + Golden Paths) |
| 66 | +## 6) Backup Hardening + Restore Confidence |
78 | 67 |
|
79 | | -Status: **In progress** |
| 68 | +Status: **Mostly complete** |
80 | 69 |
|
81 | 70 | Done: |
82 | | -- Public smoke checks exist (nginx endpoints): |
83 | | - - `platform_blueprint/tests/smoke/smoke_prod.sh` |
84 | | - - `platform_blueprint/tests/smoke/capture_prod_public.sh` |
85 | | -- v4 OpenAPI contract snapshot is stored: |
86 | | - - `platform_blueprint/contracts/openapi/api_v4_openapi.json` |
87 | | -- Prod VM-local smoke checks exist (localhost ports + RabbitMQ invariants): |
88 | | - - `platform_blueprint/tests/smoke/smoke_prod_vm.sh` |
89 | | -- v4 contract checks (quick) are exercised in the public smoke: |
90 | | - - `platform_blueprint/tests/smoke/smoke_prod.sh` |
91 | | - |
92 | | -Next: |
93 | | -- Optionally make the contract smoke OpenAPI-driven (validate endpoint coverage/schema drift). |
| 71 | +- PR #115 merged (backup script + schedule + metadata + docs). |
| 72 | +- Daily backup cron active on prod (`/home/deployer/backup.sh -e prod`). |
| 73 | +- Watchdog verifies backup freshness from `/home/deployer/backups/backup.log`. |
| 74 | +- Restore drill script improved for realistic online-backup drift tolerance: |
| 75 | + - `platform_blueprint/ops/db_backup/restore_drill_prod_vm.sh` |
| 76 | + - `ROW_COUNT_TOLERANCE_PCT` (default `0.2%`) |
| 77 | +- Restore drill result captured: |
| 78 | + - `xray/prod/2026-02-11/restore_drill_result.md` |
| 79 | + |
| 80 | +Remaining: |
| 81 | +- Optional: run another full timed drill during a low-write window to reduce count drift even further. |
0 commit comments