Skip to content

Commit f185242

Browse files
committed
fix
1 parent 26943fd commit f185242

File tree

3 files changed

+23
-283
lines changed

3 files changed

+23
-283
lines changed

README.md

Lines changed: 10 additions & 283 deletions
Original file line numberDiff line numberDiff line change
@@ -1,124 +1,26 @@
11
# GoSecretScanv2
22

3-
GoSecretScanv2 is an engineering-focused security scanner that detects secrets, API keys, credentials, and common security misconfigurations using deterministic analysis plus optional LLM-based verification.
3+
GoSecretScanv2 is a fast secret scanner for code. It uses deterministic patterns with entropy and light context. LLM verification is optional.
44

5-
## Features
5+
## Overview
66

7-
### Core Detection
7+
- Detects credentials, API keys, private keys, and connection strings
8+
- CLI and GitHub Actions support
9+
- Sensible defaults; no services required
10+
- Optional local LLM verification for triage
811

9-
- **70+ Detection Patterns**: Comprehensive regex patterns for detecting:
10-
- Cloud provider credentials (AWS, Azure, GCP)
11-
- API keys and tokens (GitHub, Slack, JWT)
12-
- Private keys (SSH, RSA, PGP)
13-
- Database connection strings
14-
- Basic authentication credentials
15-
- Security vulnerabilities (XSS, SQL injection patterns)
1612

17-
### Advanced Intelligence
18-
19-
- **Shannon Entropy Analysis**:
20-
- Calculates randomness of detected strings
21-
- Identifies high-entropy secrets vs low-entropy false positives
22-
- Entropy scoring (0-8 bits) for each finding
23-
24-
- **Context-Aware Detection**:
25-
- Automatically detects test files, mocks, and examples
26-
- Identifies comments, documentation, and templates
27-
- Recognizes placeholders and environment variable templates
28-
- Filters false positives from regex pattern definitions
29-
30-
- **Confidence Scoring System**:
31-
- Every finding rated: Critical, High, Medium, or Low
32-
- Combines entropy analysis + context detection + pattern matching
33-
- Only reports medium confidence or higher (low confidence filtered out)
34-
- Prioritizes critical findings first
35-
36-
- **Smart Filtering**:
37-
- Skips false positives automatically
38-
- Handles large files and minified code (1MB line buffer)
39-
- Pattern definition detection
40-
41-
### LLM-Powered Verification (beta)
42-
43-
- **LLM Verification**:
44-
- Uses IBM Granite 4.0 Micro (GGUF, Q4 quantized, ~450MB)
45-
- Provides structured reasoning for each decision
46-
47-
- **Semantic Embedding Search**:
48-
- Generates embeddings for each finding
49-
- Searches for similar patterns across the codebase
50-
- Reuses historical verifications for similar matches
51-
52-
- **Vector Store**:
53-
- SQLite-based vector database
54-
- Caches verified findings
55-
- Enables incremental learning
56-
- Fast similarity search
57-
58-
- **Code Context Analysis**:
59-
- Parses code structure (functions, imports)
60-
- Understands programming language syntax
61-
- Gathers surrounding code for context
62-
- Identifies test vs production code
63-
64-
**Enabling LLM Verification**:
13+
### Optional: LLM verification
6514

6615
```bash
67-
# Download the model first (one-time setup)
6816
./scripts/download-models.sh
69-
70-
# Start the llama.cpp HTTP server (runs on :8080 by default)
71-
./scripts/run-llama-server.sh
72-
73-
# In a different terminal, run with LLM verification
17+
./scripts/run-llama-server.sh # exposes http://localhost:8080
7418
./gosecretscanner --llm
7519

76-
# Custom model path
77-
./gosecretscanner --llm --model-path=/path/to/granite-4.0-micro.Q4_K_M.gguf
78-
79-
# Point to a remote llama.cpp endpoint
80-
./gosecretscanner --llm --llm-endpoint=http://localhost:8080
81-
82-
# Run the llama.cpp server in the background via Docker
83-
DETACH=true PORT=8080 HOST_NETWORK=true SERVER_PORT=8080 ./scripts/run-llama-server.sh
84-
85-
# Adjust similarity threshold for vector search
86-
./gosecretscanner --llm --similarity=0.9
87-
```
88-
89-
**Environment Variables**:
90-
91-
```bash
92-
# Enable LLM verification
93-
export GOSECRETSCANNER_LLM_ENABLED=true
94-
95-
# Set model path
96-
export GOSECRETSCANNER_MODEL_PATH=.gosecretscanner/models/granite-4.0-micro.Q4_K_M.gguf
97-
98-
# Override the llama.cpp endpoint (defaults to http://localhost:8080)
20+
# Optionally point to a remote/local endpoint
9921
export GOSECRETSCANNER_LLM_ENDPOINT=http://localhost:8080
100-
101-
# Launch llama.cpp in detached mode with a custom image/port
102-
DETACH=true LLAMA_CPP_IMAGE=ghcr.io/ggerganov/llama.cpp:full HOST_NETWORK=true PORT=8080 ./scripts/run-llama-server.sh
103-
104-
# Set vector database path
105-
export GOSECRETSCANNER_DB_PATH=.gosecretscanner/findings.db
10622
```
10723

108-
### Performance
109-
110-
- **Runtime characteristics**:
111-
- Pre-compiled regex patterns for fast scanning
112-
- Concurrent file processing using goroutines
113-
- Thread-safe result aggregation
114-
- Fallback paths that avoid external dependencies when optional components are unavailable
115-
116-
- **Operational notes**:
117-
- Minimal configuration required for local runs
118-
- Color-coded terminal output with confidence levels
119-
- Automatic recursive directory scanning with ignore rules
120-
- Results grouped by severity to aid triage
121-
12224
## Installation
12325

12426
### From Source
@@ -151,7 +53,7 @@ docker run --rm -v /path/to/scan:/workspace gosecretscanner
15153

15254
### GitHub Actions
15355

154-
The bundled `action.yml` now supports full LLM verification. Key inputs:
56+
Action inputs (when using `enable-llm`):
15557

15658
- `enable-llm`: set to `'true'` to download Granite, launch llama.cpp via Docker, and run the scan with `--llm`.
15759
- `model-path`: overrides the GGUF path (relative to the action directory by default).
@@ -187,93 +89,7 @@ The scanner will:
18789
3. Report any secrets found with file location and line numbers
18890
4. Exit with code 1 if secrets are found, 0 otherwise
18991

190-
### Example Output
191-
192-
```
193-
------------------------------------------------------------------------
194-
Secrets found:
195-
196-
=== CRITICAL FINDINGS ===
197-
198-
File: /path/to/config.go (Secret)
199-
Line Number: 42
200-
Confidence: CRITICAL (Entropy: 4.85)
201-
Context: code
202-
Pattern: (?i)_(AWS_Key):[\\s'\"=]A[KS]IA[0-9A-Z]{16}[\\s'\"]
203-
Line: const awsKey = "AKIAIOSFODNN7EXAMPLE"
204-
205-
=== HIGH CONFIDENCE ===
206-
207-
File: /path/to/auth.py (Secret)
208-
Line Number: 15
209-
Confidence: HIGH (Entropy: 4.52)
210-
Context: code
211-
Pattern: (?i)api_key(?:\s*[:=]\s*|\s*["'\s])?([a-zA-Z0-9_\-]{32,})
212-
Line: api_key = "sk_live_51a8f9c2e3b4d5f6g7h8"
213-
214-
=== MEDIUM CONFIDENCE ===
215-
216-
File: /path/to/test.js (Secret)
217-
Line Number: 89
218-
Confidence: MEDIUM (Entropy: 3.91)
219-
Context: test_file
220-
Pattern: (?i)password(?:\s*[:=]\s*|\s*["'\s])?([a-zA-Z0-9!@#$%^&*()_+]{8,})
221-
Line: const testPassword = "TestPass123"
222-
223-
------------------------------------------------------------------------
224-
Summary: 3 secrets found (Critical: 1, High: 1, Medium: 1)
225-
Please review and remove them before committing your code.
226-
```
227-
228-
**Output details:**
229-
- Results grouped by confidence level (Critical → High → Medium)
230-
- Entropy score shows randomness (higher = more likely real secret)
231-
- Context indicates where the secret was found (code, test_file, comment, etc.)
232-
- Low confidence findings are automatically filtered out
233-
234-
## Detected Patterns
23592

236-
### Cloud Provider Credentials
237-
238-
- **AWS**:
239-
- Access Key IDs (AKIA...)
240-
- Secret Access Keys
241-
- STS Tokens
242-
243-
- **Azure**:
244-
- Client IDs and Secrets
245-
- Tenant IDs
246-
- Subscription IDs
247-
- Access Keys
248-
249-
- **Google Cloud Platform**:
250-
- API Keys (AIza...)
251-
- Application Credentials
252-
- Service Account Keys
253-
- Client IDs and Secrets
254-
255-
### Private Keys
256-
257-
- SSH Private Keys
258-
- RSA Private Keys
259-
- PGP Private Keys
260-
- Generic Private Keys (PEM format)
261-
262-
### Authentication & Secrets
263-
264-
- Basic Authentication tokens
265-
- API Keys
266-
- Bearer tokens
267-
- JWT tokens
268-
- Passwords and credentials
269-
- Database connection strings
270-
271-
### Security Vulnerabilities
272-
273-
- Cross-Site Scripting (XSS) patterns
274-
- SQL Injection patterns
275-
- Hardcoded IP addresses
276-
- S3 Bucket URLs
27793

27894
## Integration with CI/CD
27995

@@ -312,32 +128,6 @@ jobs:
312128
fail-on-secrets: 'true'
313129
```
314130
315-
#### Action Inputs
316-
317-
- `scan-path`: Directory path to scan (default: `.`)
318-
- `fail-on-secrets`: Fail the workflow if secrets are found (default: `true`)
319-
320-
#### Action Outputs
321-
322-
- `secrets-found`: Number of secrets detected
323-
- `scan-status`: Status of the scan (`success`, `failed`, or `error`)
324-
325-
#### Advanced Usage
326-
327-
```yaml
328-
- name: Run Secret Scanner with outputs
329-
id: scan
330-
uses: m1rl0k/GoSecretScanv2@main
331-
with:
332-
scan-path: './src'
333-
fail-on-secrets: 'false'
334-
335-
- name: Report results
336-
if: always()
337-
run: |
338-
echo "Secrets found: ${{ steps.scan.outputs.secrets-found }}"
339-
echo "Status: ${{ steps.scan.outputs.scan-status }}"
340-
```
341131
342132
## Development
343133
@@ -359,69 +149,6 @@ go test ./...
359149
gofmt -w .
360150
```
361151

362-
## How It Works
363-
364-
### Scanning Pipeline
365-
366-
1. **Pattern Compilation**: On startup, all 70+ regex patterns are pre-compiled for optimal performance
367-
2. **Directory Walking**: Uses `filepath.Walk` to recursively traverse the directory tree
368-
3. **Concurrent Scanning**: Each file is scanned in a separate goroutine for parallel processing
369-
4. **Smart Filtering**: Regex pattern definitions and binary content are skipped
370-
5. **Pattern Matching**: Each line is checked against all compiled patterns
371-
6. **Entropy Analysis**: Shannon entropy calculated for each match
372-
7. **Context Detection**: File path and line content analyzed for context
373-
8. **Confidence Scoring**: Multi-factor scoring combines entropy + context + pattern type
374-
9. **Result Filtering**: Only medium+ confidence findings are reported
375-
10. **Priority Grouping**: Results grouped by confidence level (Critical → High → Medium)
376-
11. **Thread-Safe Results**: Uses mutex locks to safely collect results from concurrent scans
377-
378-
### Advanced Algorithms
379-
380-
#### Shannon Entropy Calculation
381-
382-
```
383-
H(X) = -Σ P(x) * log₂(P(x))
384-
```
385-
386-
- Measures randomness of detected strings
387-
- High entropy (>4.5): Likely a real secret (random characters)
388-
- Low entropy (<3.5): Likely a false positive (repeated patterns)
389-
390-
#### Confidence Scoring Algorithm
391-
392-
```
393-
Base Score: 50
394-
395-
Entropy Adjustments:
396-
+ 30 if entropy > 4.5 (very random)
397-
+ 20 if entropy > 4.0 (quite random)
398-
+ 10 if entropy > 3.5 (moderately random)
399-
- 10 if entropy <= 3.5 (low randomness)
400-
401-
Context Adjustments:
402-
- 50 for placeholders (${VAR}, YOUR_KEY)
403-
- 45 for templates (REPLACE_ME, CHANGE_ME)
404-
- 40 for test files
405-
- 35 for documentation
406-
- 30 for comments
407-
+ 10 for actual code
408-
409-
Pattern Adjustments:
410-
+ 15 for AWS keys, private keys (critical patterns)
411-
412-
Final Mapping:
413-
≥ 80: Critical
414-
≥ 60: High
415-
≥ 40: Medium
416-
< 40: Low (filtered out)
417-
```
418-
419-
## Performance Characteristics
420-
421-
- Regex patterns are compiled once during startup.
422-
- Files are scanned concurrently using a bounded worker pool.
423-
- Common directories such as `.git` and `node_modules` are skipped automatically.
424-
- Files are streamed line-by-line to limit memory usage.
425152

426153
## Current Limitations
427154

main.go

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -523,6 +523,17 @@ func detectContext(path, line string) string {
523523
pathLower := strings.ToLower(path)
524524
lineUpper := strings.ToUpper(line)
525525

526+
// Documentation/examples should not be treated as real secrets
527+
docExts := []string{".md", ".rst", ".adoc", ".txt"}
528+
for _, ext := range docExts {
529+
if strings.HasSuffix(pathLower, ext) {
530+
return "documentation"
531+
}
532+
}
533+
if strings.Contains(pathLower, "/docs/") || strings.Contains(pathLower, "\\docs\\") {
534+
return "documentation"
535+
}
536+
526537
// Test file detection (treat real test scaffolding as tests; do not down-rank examples/demo)
527538
testPatterns := []string{"test", "spec", "mock", "fixture"}
528539
for _, pattern := range testPatterns {

main_test.go

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,8 @@ func TestDetectContext(t *testing.T) {
1616
{"placeholder dollar", "pkg/service/foo.go", "token := \"$SECRET_TOKEN\"", "placeholder"},
1717
{"code with percent formatting", "pkg/service/foo.go", "fmt.Printf(\"token=%s\", token)", "code"},
1818
{"pointer code not comment", "pkg/service/foo.go", "value := foo * bar", "code"},
19+
{"markdown documentation", "docs/setup.md", "Example TOKEN=foo", "documentation"},
20+
{"readme file", "README.md", "Set API_KEY=foo", "documentation"},
1921
}
2022

2123
for _, tc := range cases {

0 commit comments

Comments
 (0)