fix

m1rl0k · m1rl0k · commit f185242e5201 · 2025-11-14T08:49:51.000-05:00
diff --git a/README.md b/README.md
@@ -1,124 +1,26 @@
 # GoSecretScanv2
 
-GoSecretScanv2 is an engineering-focused security scanner that detects secrets, API keys, credentials, and common security misconfigurations using deterministic analysis plus optional LLM-based verification.
+GoSecretScanv2 is a fast secret scanner for code. It uses deterministic patterns with entropy and light context. LLM verification is optional.
 
-## Features
+## Overview
 
-### Core Detection
+- Detects credentials, API keys, private keys, and connection strings
+- CLI and GitHub Actions support
+- Sensible defaults; no services required
+- Optional local LLM verification for triage
 
-- **70+ Detection Patterns**: Comprehensive regex patterns for detecting:
-  - Cloud provider credentials (AWS, Azure, GCP)
-  - API keys and tokens (GitHub, Slack, JWT)
-  - Private keys (SSH, RSA, PGP)
-  - Database connection strings
-  - Basic authentication credentials
-  - Security vulnerabilities (XSS, SQL injection patterns)
 
-### Advanced Intelligence
-
-- **Shannon Entropy Analysis**:
-  - Calculates randomness of detected strings
-  - Identifies high-entropy secrets vs low-entropy false positives
-  - Entropy scoring (0-8 bits) for each finding
-
-- **Context-Aware Detection**:
-  - Automatically detects test files, mocks, and examples
-  - Identifies comments, documentation, and templates
-  - Recognizes placeholders and environment variable templates
-  - Filters false positives from regex pattern definitions
-
-- **Confidence Scoring System**:
-  - Every finding rated: Critical, High, Medium, or Low
-  - Combines entropy analysis + context detection + pattern matching
-  - Only reports medium confidence or higher (low confidence filtered out)
-  - Prioritizes critical findings first
-
-- **Smart Filtering**:
-  - Skips false positives automatically
-  - Handles large files and minified code (1MB line buffer)
-  - Pattern definition detection
-
-### LLM-Powered Verification (beta)
-
-- **LLM Verification**:
-  - Uses IBM Granite 4.0 Micro (GGUF, Q4 quantized, ~450MB)
-  - Provides structured reasoning for each decision
-
-- **Semantic Embedding Search**:
-  - Generates embeddings for each finding
-  - Searches for similar patterns across the codebase
-  - Reuses historical verifications for similar matches
-
-- **Vector Store**:
-  - SQLite-based vector database
-  - Caches verified findings
-  - Enables incremental learning
-  - Fast similarity search
-
-- **Code Context Analysis**:
-  - Parses code structure (functions, imports)
-  - Understands programming language syntax
-  - Gathers surrounding code for context
-  - Identifies test vs production code
-
-**Enabling LLM Verification**:
+### Optional: LLM verification
 
 ```bash
-# Download the model first (one-time setup)
 ./scripts/download-models.sh
-
-# Start the llama.cpp HTTP server (runs on :8080 by default)
-./scripts/run-llama-server.sh
-
-# In a different terminal, run with LLM verification
+./scripts/run-llama-server.sh   # exposes http://localhost:8080
 ./gosecretscanner --llm
 
-# Custom model path
-./gosecretscanner --llm --model-path=/path/to/granite-4.0-micro.Q4_K_M.gguf
-
-# Point to a remote llama.cpp endpoint
-./gosecretscanner --llm --llm-endpoint=http://localhost:8080
-
-# Run the llama.cpp server in the background via Docker
-DETACH=true PORT=8080 HOST_NETWORK=true SERVER_PORT=8080 ./scripts/run-llama-server.sh
-
-# Adjust similarity threshold for vector search
-./gosecretscanner --llm --similarity=0.9
-```
-
-**Environment Variables**:
-
-```bash
-# Enable LLM verification
-export GOSECRETSCANNER_LLM_ENABLED=true
-
-# Set model path
-export GOSECRETSCANNER_MODEL_PATH=.gosecretscanner/models/granite-4.0-micro.Q4_K_M.gguf
-
-# Override the llama.cpp endpoint (defaults to http://localhost:8080)
+# Optionally point to a remote/local endpoint
 export GOSECRETSCANNER_LLM_ENDPOINT=http://localhost:8080
-
-# Launch llama.cpp in detached mode with a custom image/port
-DETACH=true LLAMA_CPP_IMAGE=ghcr.io/ggerganov/llama.cpp:full HOST_NETWORK=true PORT=8080 ./scripts/run-llama-server.sh
-
-# Set vector database path
-export GOSECRETSCANNER_DB_PATH=.gosecretscanner/findings.db
 ```
 
-### Performance
-
-- **Runtime characteristics**:
-  - Pre-compiled regex patterns for fast scanning
-  - Concurrent file processing using goroutines
-  - Thread-safe result aggregation
-  - Fallback paths that avoid external dependencies when optional components are unavailable
-
-- **Operational notes**:
-  - Minimal configuration required for local runs
-  - Color-coded terminal output with confidence levels
-  - Automatic recursive directory scanning with ignore rules
-  - Results grouped by severity to aid triage
-
 ## Installation
 
 ### From Source
@@ -151,7 +53,7 @@ docker run --rm -v /path/to/scan:/workspace gosecretscanner
 
 ### GitHub Actions
 
-The bundled `action.yml` now supports full LLM verification. Key inputs:
+Action inputs (when using `enable-llm`):
 
 - `enable-llm`: set to `'true'` to download Granite, launch llama.cpp via Docker, and run the scan with `--llm`.
 - `model-path`: overrides the GGUF path (relative to the action directory by default).
@@ -187,93 +89,7 @@ The scanner will:
 3. Report any secrets found with file location and line numbers
 4. Exit with code 1 if secrets are found, 0 otherwise
 
-### Example Output
-
-```
-------------------------------------------------------------------------
-Secrets found:
-
-=== CRITICAL FINDINGS ===
-
-File: /path/to/config.go (Secret)
-Line Number: 42
-Confidence: CRITICAL (Entropy: 4.85)
-Context: code
-Pattern: (?i)_(AWS_Key):[\\s'\"=]A[KS]IA[0-9A-Z]{16}[\\s'\"]
-Line: const awsKey = "AKIAIOSFODNN7EXAMPLE"
-
-=== HIGH CONFIDENCE ===
-
-File: /path/to/auth.py (Secret)
-Line Number: 15
-Confidence: HIGH (Entropy: 4.52)
-Context: code
-Pattern: (?i)api_key(?:\s*[:=]\s*|\s*["'\s])?([a-zA-Z0-9_\-]{32,})
-Line: api_key = "sk_live_51a8f9c2e3b4d5f6g7h8"
-
-=== MEDIUM CONFIDENCE ===
-
-File: /path/to/test.js (Secret)
-Line Number: 89
-Confidence: MEDIUM (Entropy: 3.91)
-Context: test_file
-Pattern: (?i)password(?:\s*[:=]\s*|\s*["'\s])?([a-zA-Z0-9!@#$%^&*()_+]{8,})
-Line: const testPassword = "TestPass123"
-
-------------------------------------------------------------------------
-Summary: 3 secrets found (Critical: 1, High: 1, Medium: 1)
-Please review and remove them before committing your code.
-```
-
-**Output details:**
-- Results grouped by confidence level (Critical → High → Medium)
-- Entropy score shows randomness (higher = more likely real secret)
-- Context indicates where the secret was found (code, test_file, comment, etc.)
-- Low confidence findings are automatically filtered out
-
-## Detected Patterns
 
-### Cloud Provider Credentials
-
-- **AWS**:
-  - Access Key IDs (AKIA...)
-  - Secret Access Keys
-  - STS Tokens
-
-- **Azure**:
-  - Client IDs and Secrets
-  - Tenant IDs
-  - Subscription IDs
-  - Access Keys
-
-- **Google Cloud Platform**:
-  - API Keys (AIza...)
-  - Application Credentials
-  - Service Account Keys
-  - Client IDs and Secrets
-
-### Private Keys
-
-- SSH Private Keys
-- RSA Private Keys
-- PGP Private Keys
-- Generic Private Keys (PEM format)
-
-### Authentication & Secrets
-
-- Basic Authentication tokens
-- API Keys
-- Bearer tokens
-- JWT tokens
-- Passwords and credentials
-- Database connection strings
-
-### Security Vulnerabilities
-
-- Cross-Site Scripting (XSS) patterns
-- SQL Injection patterns
-- Hardcoded IP addresses
-- S3 Bucket URLs
 
 ## Integration with CI/CD
 
@@ -312,32 +128,6 @@ jobs:
           fail-on-secrets: 'true'
 ```
 
-#### Action Inputs
-
-- `scan-path`: Directory path to scan (default: `.`)
-- `fail-on-secrets`: Fail the workflow if secrets are found (default: `true`)
-
-#### Action Outputs
-
-- `secrets-found`: Number of secrets detected
-- `scan-status`: Status of the scan (`success`, `failed`, or `error`)
-
-#### Advanced Usage
-
-```yaml
-- name: Run Secret Scanner with outputs
-  id: scan
-  uses: m1rl0k/GoSecretScanv2@main
-  with:
-    scan-path: './src'
-    fail-on-secrets: 'false'
-
-- name: Report results
-  if: always()
-  run: |
-    echo "Secrets found: ${{ steps.scan.outputs.secrets-found }}"
-    echo "Status: ${{ steps.scan.outputs.scan-status }}"
-```
 
 ## Development
 
@@ -359,69 +149,6 @@ go test ./...
 gofmt -w .
 ```
 
-## How It Works
-
-### Scanning Pipeline
-
-1. **Pattern Compilation**: On startup, all 70+ regex patterns are pre-compiled for optimal performance
-2. **Directory Walking**: Uses `filepath.Walk` to recursively traverse the directory tree
-3. **Concurrent Scanning**: Each file is scanned in a separate goroutine for parallel processing
-4. **Smart Filtering**: Regex pattern definitions and binary content are skipped
-5. **Pattern Matching**: Each line is checked against all compiled patterns
-6. **Entropy Analysis**: Shannon entropy calculated for each match
-7. **Context Detection**: File path and line content analyzed for context
-8. **Confidence Scoring**: Multi-factor scoring combines entropy + context + pattern type
-9. **Result Filtering**: Only medium+ confidence findings are reported
-10. **Priority Grouping**: Results grouped by confidence level (Critical → High → Medium)
-11. **Thread-Safe Results**: Uses mutex locks to safely collect results from concurrent scans
-
-### Advanced Algorithms
-
-#### Shannon Entropy Calculation
-
-```
-H(X) = -Σ P(x) * log₂(P(x))
-```
-
-- Measures randomness of detected strings
-- High entropy (>4.5): Likely a real secret (random characters)
-- Low entropy (<3.5): Likely a false positive (repeated patterns)
-
-#### Confidence Scoring Algorithm
-
-```
-Base Score: 50
-
-Entropy Adjustments:
-+ 30 if entropy > 4.5 (very random)
-+ 20 if entropy > 4.0 (quite random)
-+ 10 if entropy > 3.5 (moderately random)
-- 10 if entropy <= 3.5 (low randomness)
-
-Context Adjustments:
-- 50 for placeholders (${VAR}, YOUR_KEY)
-- 45 for templates (REPLACE_ME, CHANGE_ME)
-- 40 for test files
-- 35 for documentation
-- 30 for comments
-+ 10 for actual code
-
-Pattern Adjustments:
-+ 15 for AWS keys, private keys (critical patterns)
-
-Final Mapping:
-≥ 80: Critical
-≥ 60: High
-≥ 40: Medium
-< 40: Low (filtered out)
-```
-
-## Performance Characteristics
-
-- Regex patterns are compiled once during startup.
-- Files are scanned concurrently using a bounded worker pool.
-- Common directories such as `.git` and `node_modules` are skipped automatically.
-- Files are streamed line-by-line to limit memory usage.
 
 ## Current Limitations
 
diff --git a/main.go b/main.go
@@ -523,6 +523,17 @@ func detectContext(path, line string) string {
 	pathLower := strings.ToLower(path)
 	lineUpper := strings.ToUpper(line)
 
+	// Documentation/examples should not be treated as real secrets
+	docExts := []string{".md", ".rst", ".adoc", ".txt"}
+	for _, ext := range docExts {
+		if strings.HasSuffix(pathLower, ext) {
+			return "documentation"
+		}
+	}
+	if strings.Contains(pathLower, "/docs/") || strings.Contains(pathLower, "\\docs\\") {
+		return "documentation"
+	}
+
 	// Test file detection (treat real test scaffolding as tests; do not down-rank examples/demo)
 	testPatterns := []string{"test", "spec", "mock", "fixture"}
 	for _, pattern := range testPatterns {
diff --git a/main_test.go b/main_test.go
@@ -16,6 +16,8 @@ func TestDetectContext(t *testing.T) {
 		{"placeholder dollar", "pkg/service/foo.go", "token := \"$SECRET_TOKEN\"", "placeholder"},
 		{"code with percent formatting", "pkg/service/foo.go", "fmt.Printf(\"token=%s\", token)", "code"},
 		{"pointer code not comment", "pkg/service/foo.go", "value := foo * bar", "code"},
+		{"markdown documentation", "docs/setup.md", "Example TOKEN=foo", "documentation"},
+		{"readme file", "README.md", "Set API_KEY=foo", "documentation"},
 	}
 
 	for _, tc := range cases {

Original file line number	Diff line number	Diff line change
`@@ -16,6 +16,8 @@ func TestDetectContext(t *testing.T) {`
`16`	`16`	`{"placeholder dollar", "pkg/service/foo.go", "token := \"$SECRET_TOKEN\"", "placeholder"},`
`17`	`17`	`{"code with percent formatting", "pkg/service/foo.go", "fmt.Printf(\"token=%s\", token)", "code"},`
`18`	`18`	`{"pointer code not comment", "pkg/service/foo.go", "value := foo * bar", "code"},`
	`19`	`+ {"markdown documentation", "docs/setup.md", "Example TOKEN=foo", "documentation"},`
	`20`	`+ {"readme file", "README.md", "Set API_KEY=foo", "documentation"},`
`19`	`21`	`}`
`20`	`22`
`21`	`23`	`for _, tc := range cases {`