11# GoSecretScanv2
22
3- GoSecretScanv2 is an engineering-focused security scanner that detects secrets, API keys, credentials, and common security misconfigurations using deterministic analysis plus optional LLM-based verification .
3+ GoSecretScanv2 is a fast secret scanner for code. It uses deterministic patterns with entropy and light context. LLM verification is optional.
44
5- ## Features
5+ ## Overview
66
7- ### Core Detection
7+ - Detects credentials, API keys, private keys, and connection strings
8+ - CLI and GitHub Actions support
9+ - Sensible defaults; no services required
10+ - Optional local LLM verification for triage
811
9- - ** 70+ Detection Patterns** : Comprehensive regex patterns for detecting:
10- - Cloud provider credentials (AWS, Azure, GCP)
11- - API keys and tokens (GitHub, Slack, JWT)
12- - Private keys (SSH, RSA, PGP)
13- - Database connection strings
14- - Basic authentication credentials
15- - Security vulnerabilities (XSS, SQL injection patterns)
1612
17- ### Advanced Intelligence
18-
19- - ** Shannon Entropy Analysis** :
20- - Calculates randomness of detected strings
21- - Identifies high-entropy secrets vs low-entropy false positives
22- - Entropy scoring (0-8 bits) for each finding
23-
24- - ** Context-Aware Detection** :
25- - Automatically detects test files, mocks, and examples
26- - Identifies comments, documentation, and templates
27- - Recognizes placeholders and environment variable templates
28- - Filters false positives from regex pattern definitions
29-
30- - ** Confidence Scoring System** :
31- - Every finding rated: Critical, High, Medium, or Low
32- - Combines entropy analysis + context detection + pattern matching
33- - Only reports medium confidence or higher (low confidence filtered out)
34- - Prioritizes critical findings first
35-
36- - ** Smart Filtering** :
37- - Skips false positives automatically
38- - Handles large files and minified code (1MB line buffer)
39- - Pattern definition detection
40-
41- ### LLM-Powered Verification (beta)
42-
43- - ** LLM Verification** :
44- - Uses IBM Granite 4.0 Micro (GGUF, Q4 quantized, ~ 450MB)
45- - Provides structured reasoning for each decision
46-
47- - ** Semantic Embedding Search** :
48- - Generates embeddings for each finding
49- - Searches for similar patterns across the codebase
50- - Reuses historical verifications for similar matches
51-
52- - ** Vector Store** :
53- - SQLite-based vector database
54- - Caches verified findings
55- - Enables incremental learning
56- - Fast similarity search
57-
58- - ** Code Context Analysis** :
59- - Parses code structure (functions, imports)
60- - Understands programming language syntax
61- - Gathers surrounding code for context
62- - Identifies test vs production code
63-
64- ** Enabling LLM Verification** :
13+ ### Optional: LLM verification
6514
6615``` bash
67- # Download the model first (one-time setup)
6816./scripts/download-models.sh
69-
70- # Start the llama.cpp HTTP server (runs on :8080 by default)
71- ./scripts/run-llama-server.sh
72-
73- # In a different terminal, run with LLM verification
17+ ./scripts/run-llama-server.sh # exposes http://localhost:8080
7418./gosecretscanner --llm
7519
76- # Custom model path
77- ./gosecretscanner --llm --model-path=/path/to/granite-4.0-micro.Q4_K_M.gguf
78-
79- # Point to a remote llama.cpp endpoint
80- ./gosecretscanner --llm --llm-endpoint=http://localhost:8080
81-
82- # Run the llama.cpp server in the background via Docker
83- DETACH=true PORT=8080 HOST_NETWORK=true SERVER_PORT=8080 ./scripts/run-llama-server.sh
84-
85- # Adjust similarity threshold for vector search
86- ./gosecretscanner --llm --similarity=0.9
87- ```
88-
89- ** Environment Variables** :
90-
91- ``` bash
92- # Enable LLM verification
93- export GOSECRETSCANNER_LLM_ENABLED=true
94-
95- # Set model path
96- export GOSECRETSCANNER_MODEL_PATH=.gosecretscanner/models/granite-4.0-micro.Q4_K_M.gguf
97-
98- # Override the llama.cpp endpoint (defaults to http://localhost:8080)
20+ # Optionally point to a remote/local endpoint
9921export GOSECRETSCANNER_LLM_ENDPOINT=http://localhost:8080
100-
101- # Launch llama.cpp in detached mode with a custom image/port
102- DETACH=true LLAMA_CPP_IMAGE=ghcr.io/ggerganov/llama.cpp:full HOST_NETWORK=true PORT=8080 ./scripts/run-llama-server.sh
103-
104- # Set vector database path
105- export GOSECRETSCANNER_DB_PATH=.gosecretscanner/findings.db
10622```
10723
108- ### Performance
109-
110- - ** Runtime characteristics** :
111- - Pre-compiled regex patterns for fast scanning
112- - Concurrent file processing using goroutines
113- - Thread-safe result aggregation
114- - Fallback paths that avoid external dependencies when optional components are unavailable
115-
116- - ** Operational notes** :
117- - Minimal configuration required for local runs
118- - Color-coded terminal output with confidence levels
119- - Automatic recursive directory scanning with ignore rules
120- - Results grouped by severity to aid triage
121-
12224## Installation
12325
12426### From Source
@@ -151,7 +53,7 @@ docker run --rm -v /path/to/scan:/workspace gosecretscanner
15153
15254### GitHub Actions
15355
154- The bundled ` action.yml ` now supports full LLM verification. Key inputs :
56+ Action inputs (when using ` enable-llm ` ) :
15557
15658- ` enable-llm ` : set to ` 'true' ` to download Granite, launch llama.cpp via Docker, and run the scan with ` --llm ` .
15759- ` model-path ` : overrides the GGUF path (relative to the action directory by default).
@@ -187,93 +89,7 @@ The scanner will:
187893 . Report any secrets found with file location and line numbers
188904 . Exit with code 1 if secrets are found, 0 otherwise
18991
190- ### Example Output
191-
192- ```
193- ------------------------------------------------------------------------
194- Secrets found:
195-
196- === CRITICAL FINDINGS ===
197-
198- File: /path/to/config.go (Secret)
199- Line Number: 42
200- Confidence: CRITICAL (Entropy: 4.85)
201- Context: code
202- Pattern: (?i)_(AWS_Key):[\\s'\"=]A[KS]IA[0-9A-Z]{16}[\\s'\"]
203- Line: const awsKey = "AKIAIOSFODNN7EXAMPLE"
204-
205- === HIGH CONFIDENCE ===
206-
207- File: /path/to/auth.py (Secret)
208- Line Number: 15
209- Confidence: HIGH (Entropy: 4.52)
210- Context: code
211- Pattern: (?i)api_key(?:\s*[:=]\s*|\s*["'\s])?([a-zA-Z0-9_\-]{32,})
212- Line: api_key = "sk_live_51a8f9c2e3b4d5f6g7h8"
213-
214- === MEDIUM CONFIDENCE ===
215-
216- File: /path/to/test.js (Secret)
217- Line Number: 89
218- Confidence: MEDIUM (Entropy: 3.91)
219- Context: test_file
220- Pattern: (?i)password(?:\s*[:=]\s*|\s*["'\s])?([a-zA-Z0-9!@#$%^&*()_+]{8,})
221- Line: const testPassword = "TestPass123"
222-
223- ------------------------------------------------------------------------
224- Summary: 3 secrets found (Critical: 1, High: 1, Medium: 1)
225- Please review and remove them before committing your code.
226- ```
227-
228- ** Output details:**
229- - Results grouped by confidence level (Critical → High → Medium)
230- - Entropy score shows randomness (higher = more likely real secret)
231- - Context indicates where the secret was found (code, test_file, comment, etc.)
232- - Low confidence findings are automatically filtered out
233-
234- ## Detected Patterns
23592
236- ### Cloud Provider Credentials
237-
238- - ** AWS** :
239- - Access Key IDs (AKIA...)
240- - Secret Access Keys
241- - STS Tokens
242-
243- - ** Azure** :
244- - Client IDs and Secrets
245- - Tenant IDs
246- - Subscription IDs
247- - Access Keys
248-
249- - ** Google Cloud Platform** :
250- - API Keys (AIza...)
251- - Application Credentials
252- - Service Account Keys
253- - Client IDs and Secrets
254-
255- ### Private Keys
256-
257- - SSH Private Keys
258- - RSA Private Keys
259- - PGP Private Keys
260- - Generic Private Keys (PEM format)
261-
262- ### Authentication & Secrets
263-
264- - Basic Authentication tokens
265- - API Keys
266- - Bearer tokens
267- - JWT tokens
268- - Passwords and credentials
269- - Database connection strings
270-
271- ### Security Vulnerabilities
272-
273- - Cross-Site Scripting (XSS) patterns
274- - SQL Injection patterns
275- - Hardcoded IP addresses
276- - S3 Bucket URLs
27793
27894## Integration with CI/CD
27995
@@ -312,32 +128,6 @@ jobs:
312128 fail-on-secrets : ' true'
313129` ` `
314130
315- #### Action Inputs
316-
317- - ` scan-path`: Directory path to scan (default: `.`)
318- - `fail-on-secrets` : Fail the workflow if secrets are found (default: `true`)
319-
320- # ### Action Outputs
321-
322- - `secrets-found` : Number of secrets detected
323- - `scan-status` : Status of the scan (`success`, `failed`, or `error`)
324-
325- # ### Advanced Usage
326-
327- ` ` ` yaml
328- - name: Run Secret Scanner with outputs
329- id: scan
330- uses: m1rl0k/GoSecretScanv2@main
331- with:
332- scan-path: './src'
333- fail-on-secrets: 'false'
334-
335- - name: Report results
336- if: always()
337- run: |
338- echo "Secrets found: ${{ steps.scan.outputs.secrets-found }}"
339- echo "Status: ${{ steps.scan.outputs.scan-status }}"
340- ` ` `
341131
342132## Development
343133
@@ -359,69 +149,6 @@ go test ./...
359149gofmt -w .
360150```
361151
362- # # How It Works
363-
364- # ## Scanning Pipeline
365-
366- 1. **Pattern Compilation** : On startup, all 70+ regex patterns are pre-compiled for optimal performance
367- 2. **Directory Walking** : Uses `filepath.Walk` to recursively traverse the directory tree
368- 3. **Concurrent Scanning** : Each file is scanned in a separate goroutine for parallel processing
369- 4. **Smart Filtering** : Regex pattern definitions and binary content are skipped
370- 5. **Pattern Matching** : Each line is checked against all compiled patterns
371- 6. **Entropy Analysis** : Shannon entropy calculated for each match
372- 7. **Context Detection** : File path and line content analyzed for context
373- 8. **Confidence Scoring** : Multi-factor scoring combines entropy + context + pattern type
374- 9. **Result Filtering** : Only medium+ confidence findings are reported
375- 10. **Priority Grouping** : Results grouped by confidence level (Critical → High → Medium)
376- 11. **Thread-Safe Results** : Uses mutex locks to safely collect results from concurrent scans
377-
378- # ## Advanced Algorithms
379-
380- # ### Shannon Entropy Calculation
381-
382- ```
383- H(X) = -Σ P(x) * log₂(P(x))
384- ```
385-
386- - Measures randomness of detected strings
387- - High entropy (>4.5): Likely a real secret (random characters)
388- - Low entropy (<3.5): Likely a false positive (repeated patterns)
389-
390- #### Confidence Scoring Algorithm
391-
392- ```
393- Base Score: 50
394-
395- Entropy Adjustments:
396- + 30 if entropy > 4.5 (very random)
397- + 20 if entropy > 4.0 (quite random)
398- + 10 if entropy > 3.5 (moderately random)
399- - 10 if entropy <= 3.5 (low randomness)
400-
401- Context Adjustments:
402- - 50 for placeholders (${VAR}, YOUR_KEY)
403- - 45 for templates (REPLACE_ME, CHANGE_ME)
404- - 40 for test files
405- - 35 for documentation
406- - 30 for comments
407- + 10 for actual code
408-
409- Pattern Adjustments:
410- + 15 for AWS keys, private keys (critical patterns)
411-
412- Final Mapping:
413- ≥ 80: Critical
414- ≥ 60: High
415- ≥ 40: Medium
416- < 40: Low (filtered out)
417- ```
418-
419- ## Performance Characteristics
420-
421- - Regex patterns are compiled once during startup.
422- - Files are scanned concurrently using a bounded worker pool.
423- - Common directories such as `.git` and `node_modules` are skipped automatically.
424- - Files are streamed line-by-line to limit memory usage.
425152
426153## Current Limitations
427154
0 commit comments