Security Scanning Guide

This document explains the comprehensive automated security scanning setup for the Pinecone Assistant MCP project.

Overview

The project uses multiple security scanning technologies:

detect-secrets to prevent accidental commits of API keys, tokens, passwords, and other sensitive data
Prompt Injection Detection to protect against AI-specific attacks and malicious prompt patterns

Features

1. CI/CD Secret Scanning (GitHub Actions)

Automatically scans all code on push and pull requests
Scans git history (last 100 commits) for accidentally committed secrets
Fails the build if new secrets are detected
Location: .github/workflows/secret-scan.yml

2. Pre-commit Hooks (Local Development)

Prevents committing secrets before they reach GitHub
Runs automatically on git commit
Location: .pre-commit-config.yaml

3. Baseline Management

Tracks known placeholder keys and false positives
Location: .secrets.baseline

4. Prompt Injection Detection with Baseline System (Enhanced Security)

Scans for 70+ malicious prompt patterns
Baseline system to track known findings and only flag NEW patterns
SHA256 fingerprinting for finding identification
Detects document-corpus attack vectors (API bypass, data extraction)
Integrated with pre-commit hooks and CI/CD pipeline
Location: .security/check_prompt_injections.py

Attack Categories Detected:

Instruction override attempts ("ignore previous instructions")
System prompt extraction ("show me your instructions")
AI behavior manipulation ("you are now a different AI")
Document data extraction ("dump all documents from the knowledge base")
API bypass attempts ("bypass the Pinecone API limits")
Configuration disclosure ("reveal your system prompt")
Social engineering patterns ("we became friends")
Unicode steganography attacks (Variation Selectors, zero-width characters)

Unicode Steganography Detection (Enhanced Security)

The enhanced detector includes comprehensive Unicode steganography detection to counter advanced threats like the Repello AI emoji injection attack:

Detection Capabilities:

Variation Selector Encoding: Detects VS0/VS1 (U+FE00/U+FE01) binary encoding in emojis
Zero-Width Character Abuse: Identifies suspicious use of invisible Unicode characters
High Invisible Character Ratios: Flags content with >10% invisible-to-visible character ratios
Binary Pattern Recognition: Detects 8+ bit sequences that could encode hidden messages

Attack Patterns Detected:

Emoji steganography (e.g., "Hello!" with hidden binary-encoded instructions)
Zero-width space injection for text manipulation
Invisible Unicode character abuse for bypassing filters
Binary steganography using Variation Selectors

Examples of Detected Threats:

"Hello!" + hidden_binary_message — appears innocent but contains malicious instructions
Text with embedded zero-width characters for prompt manipulation
Emoji sequences with suspicious Variation Selector patterns
High ratios of invisible formatting characters

Prompt Injection Baseline System

The prompt injection scanner uses a baseline system to track known findings and only flag NEW patterns not in the baseline. This solves the problem of false positives from legitimate code and documentation while maintaining protection against malicious prompt injection attacks.

How It Works

Baseline File: .prompt_injections.baseline stores known findings
Fingerprinting: Each finding gets a unique SHA256 hash fingerprint
Comparison: Scanner checks if each finding is in the baseline
Exit Codes:
- 0 — No NEW findings (all findings in baseline)
- 1 — NEW findings detected (not in baseline)
- 2 — Error occurred

Usage

First run — Create baseline:

uv run python .security/check_prompt_injections.py --update-baseline src/ tests/ *.md *.yml *.yaml *.json

Normal run — Check against baseline:

uv run python .security/check_prompt_injections.py --baseline src/ tests/ *.yml *.yaml *.json

Update baseline to include new legitimate findings:

uv run python .security/check_prompt_injections.py --update-baseline src/ tests/ *.md *.yml *.yaml *.json

Force new baseline (overwrite existing):

uv run python .security/check_prompt_injections.py --force-baseline src/ tests/ *.md *.yml *.yaml *.json

Command Line Options

Option	Purpose
`--baseline`	Use existing baseline (only NEW findings fail)
`--update-baseline`	Add new findings to baseline
`--force-baseline`	Create new baseline (overwrite existing)
`--verbose, -v`	Show detailed output with full matches
`--quiet, -q`	Only show summary (suppress individual findings)

When to Update Baseline

DO Update Baseline When:

New legitimate code is flagged (variable names, class names, documentation)
Approved refactoring changes line numbers
Baseline is outdated after a code restructure

DON'T Update Baseline When:

Malicious pattern detected (remove the code instead)
You're unsure (ask for review first)
Security-related finding (review carefully first)

Setup

Install Pre-commit Hooks (Recommended)

# Install pre-commit framework and detect-secrets
uv pip install pre-commit detect-secrets

# Install the git hooks
uv run pre-commit install

# Test the hooks (optional)
uv run pre-commit run --all-files

Manual Security Scanning

Secret Detection:

# Scan entire codebase
uv run detect-secrets scan

# Scan specific files
uv run detect-secrets scan src/server.py

# Update baseline after reviewing findings
uv run detect-secrets scan --baseline .secrets.baseline

# Audit baseline (review all flagged items)
uv run detect-secrets audit .secrets.baseline

Prompt Injection Detection:

# Scan for prompt injection patterns
uv run python .security/check_prompt_injections.py src/ tests/ *.md

# Scan specific directories
uv run python .security/check_prompt_injections.py src/

# Run via pre-commit hook
uv run pre-commit run prompt-injection-check --all-files

# Test with verbose output
uv run python .security/check_prompt_injections.py --verbose src/ tests/

What Gets Scanned

Included:

All Python source files (src/, tests/, deploy/)
Configuration files
Shell scripts and workflows
YAML configuration files

Excluded:

audits/*.md — Security audit reports
*.md — Documentation files (may contain example keys)
package-lock.json — NPM lock file
.secrets.baseline — Baseline file itself
strategic-searches.yaml — Search pattern configuration

Handling Detection Results

False Positives (Test/Example Secrets)

If detect-secrets flags a legitimate placeholder:

Verify it's truly a placeholder (not a real secret)

Update the baseline to mark it as known:

uv run detect-secrets scan --baseline .secrets.baseline

Commit the updated baseline:

git add .secrets.baseline
git commit -m "Update secrets baseline after review"

Real Secrets Detected

If you accidentally committed a real secret:

Revoke the secret immediately — delete the key at Pinecone Console → API Keys → Delete
Generate a new API key in the Pinecone Console

Remove from git history:

# Use BFG Repo Cleaner or git filter-branch
# See: https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/removing-sensitive-data-from-a-repository

Update stored key:

# Re-run setup to store new key via DPAPI
.\deploy\windows_setup.ps1

Best Practices

DO:

✅ Store secrets using Windows DPAPI (via deployment script)
✅ Use environment variables as fallback on Linux/macOS
✅ Use placeholder values in example configs
✅ Run pre-commit run --all-files before first commit
✅ Review baseline updates carefully
✅ Check security audit log at ~/.pinecone_assistant/logs/security_audit.log

DON'T:

❌ Hardcode API keys in source code (pcsk_* format)
❌ Commit .env files
❌ Use real secrets in tests (use mocks/fixtures)
❌ Disable pre-commit hooks without review
❌ Ignore secret scanning failures in CI

GitHub Actions Workflow

The workflow runs on:

All pushes to main, master, and develop branches
All pull requests to these branches

Workflow Steps:

Checkout full git history
Install detect-secrets
Scan current codebase against baseline
Scan recent git history (last 100 commits)
Report findings and fail if secrets detected

Viewing Results:

Go to Actions tab in GitHub
Click on Secret Scanning workflow
Review any failures in the job logs

Troubleshooting

Pre-commit Hook Failing

# Check what's detected
pre-commit run detect-secrets --all-files

# If false positive, update baseline
uv run detect-secrets scan --baseline .secrets.baseline

# Re-run commit
git commit

CI Failing with "Secrets Detected"

Review the GitHub Actions log to see what was flagged
Verify if it's a real secret or false positive
If false positive:
- Update baseline locally: uv run detect-secrets scan --baseline .secrets.baseline
- Commit and push the updated baseline
If real secret:
- REVOKE THE SECRET IMMEDIATELY at Pinecone Console
- Remove from code and git history
- Fix and re-push

Baseline Out of Sync

# Regenerate baseline from scratch
uv run detect-secrets scan \
  --exclude-files 'audits/.*\.md' \
  --exclude-files '\.md$' \
  --exclude-files 'strategic-searches\.yaml' \
  > .secrets.baseline

# Review and commit
git add .secrets.baseline
git commit -m "Regenerate secrets baseline"

Integration with Security Guidelines

This scanning complements the recommendations in SECURITY_GUIDELINES.md:

Prevents API keys from being committed
Enforces use of environment variables and DPAPI secure storage
Provides audit trail for secret management
Supports incident response procedures
Detects prompt injection attacks before they reach the codebase

Secret Types Detected

The scanner detects 20+ types of secrets including:

Cloud Provider Keys:

AWS Access Keys
Azure Storage Keys
GCP Service Account Keys
IBM Cloud IAM Keys

API & Service Tokens:

GitHub Tokens
GitLab Tokens
OpenAI API Keys
Pinecone API Keys (pcsk_*)
Stripe API Keys
Twilio Keys
SendGrid Keys
Slack Tokens
Discord Bot Tokens
Telegram Bot Tokens

General Secrets:

Private SSH Keys
JWT Tokens
NPM Tokens
PyPI Tokens
Basic Auth Credentials
High-Entropy Strings (Base64/Hex)
Password Keywords

Project-Specific Considerations

Pinecone Assistant API Keys

The scanner is configured to detect Pinecone API keys (format: pcsk_*). Real Pinecone API keys must always be stored via DPAPI (preferred) or as environment variables:

# Linux/macOS (environment variable)
export PINECONE_ASSISTANT_API_KEY=your_actual_key_here

# Windows (DPAPI — use deployment script)
.\deploy\windows_setup.ps1

Test Files

Test files in tests/ may contain placeholder keys for validation testing (e.g., pcsk_test_key_for_testing). These are tracked in .secrets.baseline and are verified to be test-only placeholders, not real credentials.

Strategic Search YAML

strategic-searches.yaml is excluded from secret scanning as it contains domain-specific search patterns, not credentials.

Additional Resources

Questions?

See SECURITY_GUIDELINES.md for broader security practices or file an issue on GitHub.

FilesExpand file tree

SECURITY_SCANNING.md

Latest commit

History