|
1 | | -# SMAnalyzer |
| 1 | +# SMAnalyzer |
| 2 | + |
| 3 | +Scans your current cluster to check for anomolies within your L7 networking Kubernetes Services. |
| 4 | + |
| 5 | + |
| 6 | + |
| 7 | +## Why It Makes Sense |
| 8 | + |
| 9 | +There are three key aspects to a Service Mesh: |
| 10 | +1. Encryption of traffic from service to service (east/west traffic) |
| 11 | +2. Traffic routing to/from the service |
| 12 | +3. Network observability (performance, circuit breaking, retries, timeouts, load balancing) |
| 13 | + |
| 14 | +Although numbers 1 and 2 are drastically important, number 3 is the make or break between an application performing as expected and angry external or internal (your teammates) customers. |
| 15 | + |
| 16 | +All applications should perform as expected, and typically, bad performance stems from an networking issue (unless it's a specific app/code issue) |
| 17 | + |
| 18 | +## How Does The ML Piece Work? |
| 19 | + |
| 20 | +SMAnalyzer uses a K-means clustering algorithm, which automatically sorts things into groups based on how similar they are to each other. |
| 21 | + |
| 22 | +For example - imagine you have a bunch of different colored dots scattered on a piece of paper, and you want to organize them into groups where similar colors are together. K-means does this automatically. |
| 23 | + |
| 24 | +The engine is designed to identify patterns in time series data by grouping similar behavioral segments based on statistical features. |
| 25 | + |
| 26 | +## Why K-means? |
| 27 | + |
| 28 | +K-means is used here because it's well-suited for baseline behavior pattern learning in service mesh environments. |
| 29 | + |
| 30 | + 1. Automatic Pattern Discovery: K-means finds natural groupings in service |
| 31 | + behavior without requiring predefined categories. Services naturally exhibit |
| 32 | + different "behavioral modes" (normal load, peak traffic, maintenance periods, |
| 33 | + etc.) |
| 34 | + |
| 35 | + 2. Baseline Establishment: The algorithm learns what "normal" looks like by |
| 36 | + clustering historical metric patterns. This creates behavioral baselines for |
| 37 | + anomaly detection. |
| 38 | + |
| 39 | + 3. Multi-dimensional Analysis: Service mesh metrics have multiple dimensions |
| 40 | + (error rates, latency, throughput, etc.). K-means handles this |
| 41 | + multi-dimensional feature space effectively by clustering on the extracted |
| 42 | + features (mean, std dev, trend, volatility). |
| 43 | + |
| 44 | + 4. Unsupervised Learning: No manual labeling of "good" vs "bad" behavior is |
| 45 | + needed. The algorithm discovers patterns automatically from the data. |
| 46 | + |
| 47 | + 5. Computational Efficiency: K-means is fast enough for real-time monitoring |
| 48 | + scenarios where the system needs to continuously analyze incoming metrics. |
| 49 | + |
| 50 | + The clustering results help the anomaly detection engine distinguish between |
| 51 | + genuinely anomalous behavior versus normal variations in service performance |
| 52 | + patterns, reducing false positives in service mesh monitoring. |
| 53 | + |
| 54 | +## Core Components |
| 55 | + 1. CLI Framework (cmd/) - Cobra-based commands: scan, learn, monitor, status |
| 56 | + 2. Kubernetes Client (pkg/k8s/) - Simple kubeconfig-based cluster connection |
| 57 | + 3. Istio Discovery (pkg/istio/) - Service mesh metrics collection and service |
| 58 | + discovery |
| 59 | + 4. Time Series Storage (pkg/timeseries/) - In-memory storage for metric data |
| 60 | + points |
| 61 | + 5. ML Clustering (pkg/ml/) - K-means algorithm for behavior pattern learning |
| 62 | + 6. Anomaly Detection (pkg/anomaly/) - Hybrid detection engine (rule-based + ML) |
| 63 | + 7. Output Formatting (pkg/output/) - CLI-friendly output (text, table, JSON) |
| 64 | + 8. Configuration (pkg/config/) - Centralized configuration management |
| 65 | + |
| 66 | +## Key Features |
| 67 | + |
| 68 | +- Multi-modal detection: Static thresholds + ML clustering for comprehensive |
| 69 | +anomaly detection |
| 70 | +- Service mesh focus: Specifically designed for Istio environments |
| 71 | +- Learning capability: Establishes baseline behavior patterns through clustering |
| 72 | +- Real-time monitoring: Continuous scanning with configurable intervals |
| 73 | +- Multiple output formats: Human-readable and machine-parseable outputs |
| 74 | +- Configurable thresholds: Adjustable sensitivity and detection parameters |
| 75 | + |
| 76 | +## Usage |
| 77 | + |
| 78 | +You'll see four use cases within the `smanalyzer` command: |
| 79 | +1. Scan |
| 80 | +2. Learn |
| 81 | +3. Monitor |
| 82 | +4. Status |
| 83 | + |
| 84 | +`cmd/scan.go` |
| 85 | + |
| 86 | + Implements the main scan command with flags for: |
| 87 | + - --namespace - target specific K8s namespace |
| 88 | + - --duration - how long to monitor |
| 89 | + - --learn - learning mode vs detection mode |
| 90 | + - Basic scan workflow placeholder |
| 91 | + |
| 92 | +`pkg/k8s/client.go` |
| 93 | + |
| 94 | + Simple Kubernetes client wrapper that uses the standard kubeconfig from the |
| 95 | + user's environment. |
| 96 | + |
| 97 | +`pkg/istio/discovery.go` |
| 98 | + |
| 99 | + This file handles service mesh discovery and metrics collection: |
| 100 | + |
| 101 | + - ServiceDiscovery struct: Wraps the Kubernetes client to find Istio-enabled |
| 102 | + services |
| 103 | + - ServiceMeshMetrics struct: Defines the data structure for all metrics we care |
| 104 | + about (request counts, error rates, response times, circuit breaker status, |
| 105 | + retries, timeouts) |
| 106 | + - DiscoverServices(): Finds services with Istio sidecars by checking labels |
| 107 | + - CollectMetrics(): Gathers real-time metrics from Prometheus/Envoy (currently |
| 108 | + uses mock data) |
| 109 | + - hasIstioSidecar(): Helper to identify services that are part of the mesh |
| 110 | + |
| 111 | +The core idea is: scan → discover services → collect metrics → analyze patterns → detect anomalies. |
| 112 | + |
| 113 | +`pkg/timeseries/storage.go` |
| 114 | + |
| 115 | + This file provides in-memory time series data storage: |
| 116 | + |
| 117 | + - DataPoint struct: Single metric measurement with timestamp, value, and labels |
| 118 | + - TimeSeries struct: Collection of data points for a specific service/metric |
| 119 | + combination |
| 120 | + - Storage struct: Thread-safe storage managing multiple time series with mutex |
| 121 | + protection |
| 122 | + - Store(): Adds new data points to time series |
| 123 | + - GetSeries(): Retrieves a specific time series |
| 124 | + - GetTimeRange(): Gets data points within a time window for analysis |
| 125 | + - GetLatestN(): Gets the most recent N data points for real-time monitoring |
| 126 | + |
| 127 | +`pkg/ml/clustering.go` |
| 128 | + |
| 129 | + This file implements machine learning clustering for behavior |
| 130 | + pattern analysis: |
| 131 | + |
| 132 | + - ClusterPoint struct: Wraps data points with extracted |
| 133 | + feature vectors |
| 134 | + - Cluster struct: Groups similar behavior patterns with |
| 135 | + centroids |
| 136 | + - KMeansConfig: Configuration for the K-means clustering |
| 137 | + algorithm |
| 138 | + - ExtractFeatures(): Converts time series data into feature |
| 139 | + vectors (mean, std dev, trend, volatility) |
| 140 | + - KMeans(): Core clustering algorithm that groups similar |
| 141 | + network behavior patterns |
| 142 | + - Statistical functions: Calculate mean, standard deviation, |
| 143 | + trend, and volatility from time windows |
| 144 | + - Distance calculations: Euclidean distance for clustering |
| 145 | + similarity measurements |
| 146 | + |
| 147 | +This enables the system to learn "normal" traffic patterns and identify when services deviate from expected behavior. |
| 148 | + |
| 149 | +`pkg/anomaly/detector.go` |
| 150 | + |
| 151 | + This file implements the core anomaly detection engine: |
| 152 | + |
| 153 | + - AnomalyType constants: Different types of service mesh issues (traffic spikes, |
| 154 | + high error rates, latency, circuit breaker trips, retry storms, timeouts) |
| 155 | + - Anomaly struct: Complete anomaly information including type, severity, |
| 156 | + description, metrics |
| 157 | + - DetectionConfig: Configurable thresholds and sensitivity settings |
| 158 | + - LearnBaseline(): Establishes normal behavior patterns using clustering |
| 159 | + - DetectAnomalies(): Two-pronged detection approach: |
| 160 | + - Static detection: Rule-based thresholds for obvious issues |
| 161 | + - ML detection: Compares current behavior against learned baseline clusters |
| 162 | + - Severity calculation: Quantifies how severe each anomaly is |
| 163 | + - Dynamic thresholds: Adapts sensitivity based on historical variance in the data |
| 164 | + |
| 165 | +`cmd/learn.go` |
| 166 | + |
| 167 | + This command trains the baseline behavior model: |
| 168 | + |
| 169 | + - learn command: Separate CLI command for establishing normal behavior patterns |
| 170 | + - Duration flag: Specifies how much historical data to analyze for training |
| 171 | + - Output flag: Option to save the learned model to disk for later use |
| 172 | + - performLearning(): Placeholder for the actual learning process (connects to |
| 173 | + cluster, discovers services, collects metrics, trains model) |
| 174 | + |
| 175 | +`cmd/monitor.go` |
| 176 | + |
| 177 | + This command provides continuous monitoring: |
| 178 | + |
| 179 | + - monitor command: Long-running process for real-time anomaly detection |
| 180 | + - Interval flag: How often to check for anomalies (30s, 1m, etc.) |
| 181 | + - Model flag: Load a previously learned baseline model |
| 182 | + - Format flag: Choose output format for detected anomalies |
| 183 | + - performMonitoring(): Continuous loop that checks for anomalies at regular |
| 184 | + intervals and reports findings |
| 185 | + |
| 186 | +`cmd/status.go` |
| 187 | + |
| 188 | + This command provides system status overview: |
| 189 | + |
| 190 | + - status command: Quick health check and overview of the entire system |
| 191 | + - Cluster connection: Shows if connected to Kubernetes and basic cluster info |
| 192 | + - Service mesh status: Istio version, number of services with sidecars |
| 193 | + - AI model status: Whether baseline is trained, when last updated, training |
| 194 | + duration |
| 195 | + - Recent activity: Anomaly counts over different time periods |
| 196 | + - Configuration: Current detection thresholds and settings |
| 197 | + |
| 198 | +### Build Binary |
| 199 | + |
| 200 | +``` |
| 201 | +go build . |
| 202 | +``` |
| 203 | + |
| 204 | +### Run Commands |
| 205 | + |
| 206 | +- smanalyzer scan - One-time anomaly scan |
| 207 | +- smanalyzer learn - Train baseline behavior model |
| 208 | +- smanalyzer monitor - Continuous real-time monitoring |
| 209 | +- smanalyzer status - System health and configuration overview |
| 210 | + |
| 211 | + |
| 212 | +### Examples |
| 213 | + |
| 214 | +``` |
| 215 | +./smanalyzer scan |
| 216 | +
|
| 217 | +Starting Service Mesh scan... |
| 218 | +Scanning all namespaces |
| 219 | +Duration: 5m0s |
| 220 | +Learning mode: false |
| 221 | +Connecting to Kubernetes cluster... |
| 222 | +✓ Connected to Kubernetes cluster |
| 223 | +Discovering Services in Mesh... |
| 224 | +✓ Found 12 services with Istio sidecars |
| 225 | +Collecting service mesh metrics... |
| 226 | +
|
| 227 | +Found 1 anomalies: |
| 228 | +
|
| 229 | +1. High error rate: 104500.00% [CRITICAL] |
| 230 | + Service: redis. |
| 231 | + Type: error_rate_high |
| 232 | + Time: 2025-08-17T10:47:25-04:00 |
| 233 | + Metrics: |
| 234 | + error_rate: 1045.00 |
| 235 | +``` |
| 236 | + |
| 237 | +``` |
| 238 | +./smanalyzer status |
| 239 | +Service Mesh Analyzer Status |
| 240 | +============================ |
| 241 | +
|
| 242 | +🔍 Cluster Connection: |
| 243 | + Status: Connected |
| 244 | + Cluster: kind-kind |
| 245 | + Namespaces: 12 |
| 246 | +
|
| 247 | +🕸️ Service Mesh: |
| 248 | + Istio Version: 1.20.0 |
| 249 | + Services with sidecars: 15 |
| 250 | + Gateway services: 2 |
| 251 | +
|
| 252 | +🤖 AI Model: |
| 253 | + Baseline Status: Trained |
| 254 | + Last Updated: 2024-01-15 14:30:00 |
| 255 | + Training Data: 24h |
| 256 | +
|
| 257 | +📊 Recent Activity: |
| 258 | + Anomalies (last 1h): 2 |
| 259 | + Anomalies (last 24h): 12 |
| 260 | + Services monitored: 15 |
| 261 | +
|
| 262 | +⚙️ Configuration: |
| 263 | + Error rate threshold: 5% |
| 264 | + Traffic spike threshold: 2x |
| 265 | + Sensitivity level: 2.0 |
| 266 | +``` |
0 commit comments