Skip to content

Commit b2f4b4e

Browse files
initial commit
1 parent 6077bae commit b2f4b4e

File tree

20 files changed

+2178
-1
lines changed

20 files changed

+2178
-1
lines changed

.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,3 +30,7 @@ go.work.sum
3030
# Editor/IDE
3131
# .idea/
3232
# .vscode/
33+
34+
*.dccache
35+
36+
*smanalyzer*

README.md

Lines changed: 266 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,266 @@
1-
# SMAnalyzer
1+
# SMAnalyzer
2+
3+
Scans your current cluster to check for anomolies within your L7 networking Kubernetes Services.
4+
5+
![](images/showcase.gif)
6+
7+
## Why It Makes Sense
8+
9+
There are three key aspects to a Service Mesh:
10+
1. Encryption of traffic from service to service (east/west traffic)
11+
2. Traffic routing to/from the service
12+
3. Network observability (performance, circuit breaking, retries, timeouts, load balancing)
13+
14+
Although numbers 1 and 2 are drastically important, number 3 is the make or break between an application performing as expected and angry external or internal (your teammates) customers.
15+
16+
All applications should perform as expected, and typically, bad performance stems from an networking issue (unless it's a specific app/code issue)
17+
18+
## How Does The ML Piece Work?
19+
20+
SMAnalyzer uses a K-means clustering algorithm, which automatically sorts things into groups based on how similar they are to each other.
21+
22+
For example - imagine you have a bunch of different colored dots scattered on a piece of paper, and you want to organize them into groups where similar colors are together. K-means does this automatically.
23+
24+
The engine is designed to identify patterns in time series data by grouping similar behavioral segments based on statistical features.
25+
26+
## Why K-means?
27+
28+
K-means is used here because it's well-suited for baseline behavior pattern learning in service mesh environments.
29+
30+
1. Automatic Pattern Discovery: K-means finds natural groupings in service
31+
behavior without requiring predefined categories. Services naturally exhibit
32+
different "behavioral modes" (normal load, peak traffic, maintenance periods,
33+
etc.)
34+
35+
2. Baseline Establishment: The algorithm learns what "normal" looks like by
36+
clustering historical metric patterns. This creates behavioral baselines for
37+
anomaly detection.
38+
39+
3. Multi-dimensional Analysis: Service mesh metrics have multiple dimensions
40+
(error rates, latency, throughput, etc.). K-means handles this
41+
multi-dimensional feature space effectively by clustering on the extracted
42+
features (mean, std dev, trend, volatility).
43+
44+
4. Unsupervised Learning: No manual labeling of "good" vs "bad" behavior is
45+
needed. The algorithm discovers patterns automatically from the data.
46+
47+
5. Computational Efficiency: K-means is fast enough for real-time monitoring
48+
scenarios where the system needs to continuously analyze incoming metrics.
49+
50+
The clustering results help the anomaly detection engine distinguish between
51+
genuinely anomalous behavior versus normal variations in service performance
52+
patterns, reducing false positives in service mesh monitoring.
53+
54+
## Core Components
55+
1. CLI Framework (cmd/) - Cobra-based commands: scan, learn, monitor, status
56+
2. Kubernetes Client (pkg/k8s/) - Simple kubeconfig-based cluster connection
57+
3. Istio Discovery (pkg/istio/) - Service mesh metrics collection and service
58+
discovery
59+
4. Time Series Storage (pkg/timeseries/) - In-memory storage for metric data
60+
points
61+
5. ML Clustering (pkg/ml/) - K-means algorithm for behavior pattern learning
62+
6. Anomaly Detection (pkg/anomaly/) - Hybrid detection engine (rule-based + ML)
63+
7. Output Formatting (pkg/output/) - CLI-friendly output (text, table, JSON)
64+
8. Configuration (pkg/config/) - Centralized configuration management
65+
66+
## Key Features
67+
68+
- Multi-modal detection: Static thresholds + ML clustering for comprehensive
69+
anomaly detection
70+
- Service mesh focus: Specifically designed for Istio environments
71+
- Learning capability: Establishes baseline behavior patterns through clustering
72+
- Real-time monitoring: Continuous scanning with configurable intervals
73+
- Multiple output formats: Human-readable and machine-parseable outputs
74+
- Configurable thresholds: Adjustable sensitivity and detection parameters
75+
76+
## Usage
77+
78+
You'll see four use cases within the `smanalyzer` command:
79+
1. Scan
80+
2. Learn
81+
3. Monitor
82+
4. Status
83+
84+
`cmd/scan.go`
85+
86+
Implements the main scan command with flags for:
87+
- --namespace - target specific K8s namespace
88+
- --duration - how long to monitor
89+
- --learn - learning mode vs detection mode
90+
- Basic scan workflow placeholder
91+
92+
`pkg/k8s/client.go`
93+
94+
Simple Kubernetes client wrapper that uses the standard kubeconfig from the
95+
user's environment.
96+
97+
`pkg/istio/discovery.go`
98+
99+
This file handles service mesh discovery and metrics collection:
100+
101+
- ServiceDiscovery struct: Wraps the Kubernetes client to find Istio-enabled
102+
services
103+
- ServiceMeshMetrics struct: Defines the data structure for all metrics we care
104+
about (request counts, error rates, response times, circuit breaker status,
105+
retries, timeouts)
106+
- DiscoverServices(): Finds services with Istio sidecars by checking labels
107+
- CollectMetrics(): Gathers real-time metrics from Prometheus/Envoy (currently
108+
uses mock data)
109+
- hasIstioSidecar(): Helper to identify services that are part of the mesh
110+
111+
The core idea is: scan → discover services → collect metrics → analyze patterns → detect anomalies.
112+
113+
`pkg/timeseries/storage.go`
114+
115+
This file provides in-memory time series data storage:
116+
117+
- DataPoint struct: Single metric measurement with timestamp, value, and labels
118+
- TimeSeries struct: Collection of data points for a specific service/metric
119+
combination
120+
- Storage struct: Thread-safe storage managing multiple time series with mutex
121+
protection
122+
- Store(): Adds new data points to time series
123+
- GetSeries(): Retrieves a specific time series
124+
- GetTimeRange(): Gets data points within a time window for analysis
125+
- GetLatestN(): Gets the most recent N data points for real-time monitoring
126+
127+
`pkg/ml/clustering.go`
128+
129+
This file implements machine learning clustering for behavior
130+
pattern analysis:
131+
132+
- ClusterPoint struct: Wraps data points with extracted
133+
feature vectors
134+
- Cluster struct: Groups similar behavior patterns with
135+
centroids
136+
- KMeansConfig: Configuration for the K-means clustering
137+
algorithm
138+
- ExtractFeatures(): Converts time series data into feature
139+
vectors (mean, std dev, trend, volatility)
140+
- KMeans(): Core clustering algorithm that groups similar
141+
network behavior patterns
142+
- Statistical functions: Calculate mean, standard deviation,
143+
trend, and volatility from time windows
144+
- Distance calculations: Euclidean distance for clustering
145+
similarity measurements
146+
147+
This enables the system to learn "normal" traffic patterns and identify when services deviate from expected behavior.
148+
149+
`pkg/anomaly/detector.go`
150+
151+
This file implements the core anomaly detection engine:
152+
153+
- AnomalyType constants: Different types of service mesh issues (traffic spikes,
154+
high error rates, latency, circuit breaker trips, retry storms, timeouts)
155+
- Anomaly struct: Complete anomaly information including type, severity,
156+
description, metrics
157+
- DetectionConfig: Configurable thresholds and sensitivity settings
158+
- LearnBaseline(): Establishes normal behavior patterns using clustering
159+
- DetectAnomalies(): Two-pronged detection approach:
160+
- Static detection: Rule-based thresholds for obvious issues
161+
- ML detection: Compares current behavior against learned baseline clusters
162+
- Severity calculation: Quantifies how severe each anomaly is
163+
- Dynamic thresholds: Adapts sensitivity based on historical variance in the data
164+
165+
`cmd/learn.go`
166+
167+
This command trains the baseline behavior model:
168+
169+
- learn command: Separate CLI command for establishing normal behavior patterns
170+
- Duration flag: Specifies how much historical data to analyze for training
171+
- Output flag: Option to save the learned model to disk for later use
172+
- performLearning(): Placeholder for the actual learning process (connects to
173+
cluster, discovers services, collects metrics, trains model)
174+
175+
`cmd/monitor.go`
176+
177+
This command provides continuous monitoring:
178+
179+
- monitor command: Long-running process for real-time anomaly detection
180+
- Interval flag: How often to check for anomalies (30s, 1m, etc.)
181+
- Model flag: Load a previously learned baseline model
182+
- Format flag: Choose output format for detected anomalies
183+
- performMonitoring(): Continuous loop that checks for anomalies at regular
184+
intervals and reports findings
185+
186+
`cmd/status.go`
187+
188+
This command provides system status overview:
189+
190+
- status command: Quick health check and overview of the entire system
191+
- Cluster connection: Shows if connected to Kubernetes and basic cluster info
192+
- Service mesh status: Istio version, number of services with sidecars
193+
- AI model status: Whether baseline is trained, when last updated, training
194+
duration
195+
- Recent activity: Anomaly counts over different time periods
196+
- Configuration: Current detection thresholds and settings
197+
198+
### Build Binary
199+
200+
```
201+
go build .
202+
```
203+
204+
### Run Commands
205+
206+
- smanalyzer scan - One-time anomaly scan
207+
- smanalyzer learn - Train baseline behavior model
208+
- smanalyzer monitor - Continuous real-time monitoring
209+
- smanalyzer status - System health and configuration overview
210+
211+
212+
### Examples
213+
214+
```
215+
./smanalyzer scan
216+
217+
Starting Service Mesh scan...
218+
Scanning all namespaces
219+
Duration: 5m0s
220+
Learning mode: false
221+
Connecting to Kubernetes cluster...
222+
✓ Connected to Kubernetes cluster
223+
Discovering Services in Mesh...
224+
✓ Found 12 services with Istio sidecars
225+
Collecting service mesh metrics...
226+
227+
Found 1 anomalies:
228+
229+
1. High error rate: 104500.00% [CRITICAL]
230+
Service: redis.
231+
Type: error_rate_high
232+
Time: 2025-08-17T10:47:25-04:00
233+
Metrics:
234+
error_rate: 1045.00
235+
```
236+
237+
```
238+
./smanalyzer status
239+
Service Mesh Analyzer Status
240+
============================
241+
242+
🔍 Cluster Connection:
243+
Status: Connected
244+
Cluster: kind-kind
245+
Namespaces: 12
246+
247+
🕸️ Service Mesh:
248+
Istio Version: 1.20.0
249+
Services with sidecars: 15
250+
Gateway services: 2
251+
252+
🤖 AI Model:
253+
Baseline Status: Trained
254+
Last Updated: 2024-01-15 14:30:00
255+
Training Data: 24h
256+
257+
📊 Recent Activity:
258+
Anomalies (last 1h): 2
259+
Anomalies (last 24h): 12
260+
Services monitored: 15
261+
262+
⚙️ Configuration:
263+
Error rate threshold: 5%
264+
Traffic spike threshold: 2x
265+
Sensitivity level: 2.0
266+
```

cmd/learn.go

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
package cmd
2+
3+
import (
4+
"context"
5+
"fmt"
6+
"log"
7+
"time"
8+
9+
"github.com/spf13/cobra"
10+
)
11+
12+
var learnCmd = &cobra.Command{
13+
Use: "learn",
14+
Short: "Learn baseline behavior patterns from service mesh traffic",
15+
Long: `Analyzes historical service mesh traffic to establish baseline behavior patterns.
16+
This creates a model of normal operations that will be used for anomaly detection.`,
17+
Run: runLearn,
18+
}
19+
20+
var (
21+
learnDuration time.Duration
22+
learnOutput string
23+
)
24+
25+
func init() {
26+
rootCmd.AddCommand(learnCmd)
27+
28+
learnCmd.Flags().DurationVarP(&learnDuration, "duration", "d", 24*time.Hour, "Duration of historical data to analyze (e.g., 24h, 7d)")
29+
learnCmd.Flags().StringVarP(&learnOutput, "output", "o", "", "Save learned model to file")
30+
}
31+
32+
func runLearn(cmd *cobra.Command, args []string) {
33+
ctx := context.Background()
34+
35+
fmt.Printf("Learning baseline patterns from service mesh traffic...\n")
36+
fmt.Printf("Duration: %v\n", learnDuration)
37+
38+
if learnOutput != "" {
39+
fmt.Printf("Model will be saved to: %s\n", learnOutput)
40+
}
41+
42+
if err := performLearning(ctx); err != nil {
43+
log.Fatalf("Learning failed: %v", err)
44+
}
45+
46+
fmt.Println("✓ Baseline learning completed successfully")
47+
}
48+
49+
func performLearning(ctx context.Context) error {
50+
fmt.Println("Connecting to Kubernetes cluster...")
51+
fmt.Println("Discovering services in mesh...")
52+
fmt.Println("Collecting historical metrics...")
53+
fmt.Println("Extracting behavior features...")
54+
fmt.Println("Training clustering model...")
55+
56+
time.Sleep(2 * time.Second)
57+
58+
return nil
59+
}

0 commit comments

Comments
 (0)