Skip to content

Our GitHub tool is in development with many new features in the pipeline. If you are passionate about exposing corruption and have tech savvy skills we need your help ! Contact us to find out more

License

Notifications You must be signed in to change notification settings

Signal-Watch/SignalWatch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

SignalWatch πŸ”

SignalWatch is a comprehensive Python-based tool for scanning and analyzing UK company data from Companies House (with optional OpenCorporates support). It detects name/date mismatches in filing documents and maps director-linked company networks.

🌟 Features

  • Data Extraction: Fetch company profiles, filing history, and download related PDFs
  • Smart Analysis: AI-powered OCR text extraction and name/date parsing
  • Mismatch Detection: Compare extracted data against official records
  • Network Discovery: Map companies through shared directors with iterative expansion
  • Rate Limit Management: Intelligent batching (600 requests / 5 minutes)
  • Resume Capability: Checkpoint system for interrupted scans
  • Multi-User Support: Users can provide their own API keys
  • Web Interface: Astra-style themed interface for easy interaction
  • Export Options: CSV, JSON, and embeddable HTML reports

πŸ“‹ Prerequisites

  • Python 3.9+
  • Companies House API Key (Get one here)
  • Tesseract OCR (for PDF text extraction)

Install Tesseract OCR

Windows:

# Download installer from: https://github.com/UB-Mannheim/tesseract/wiki
# Or use chocolatey:
choco install tesseract

Linux:

sudo apt-get install tesseract-ocr

macOS:

brew install tesseract

πŸš€ Installation

  1. Clone the repository:
git clone https://github.com/yourusername/signalwatch.git
cd signalwatch
  1. Create virtual environment:
python -m venv venv
.\venv\Scripts\Activate.ps1
  1. Install dependencies:
pip install -r requirements.txt
  1. Configure environment:
# Copy example environment file
copy .env.example .env

# Edit .env and add your API key
notepad .env
  1. Setup directories:
python -c "from config import Config; Config.ensure_directories()"

πŸ”‘ API Key Setup

Companies House API Key (REQUIRED)

⚠️ You MUST provide your own API key - the application will not work without it.

  1. Register at Companies House Developer Hub
  2. Create an application to get your API key
  3. Enter your API key in the web interface when performing a scan

Note: For development/testing, you can add your key to .env file:

COMPANIES_HOUSE_API_KEY=your_key_here

GitHub Token (Optional - for result caching)

SignalWatch can use GitHub as a distributed cache, sharing scan results between users to avoid duplicate API calls:

  1. Create a Personal Access Token at GitHub Settings β†’ Tokens
  2. Required permissions: repo (Full control of private repositories)
  3. Add to .env file:
GITHUB_TOKEN=ghp_your_token_here
  1. Ensure the target repo exists (default: https://github.com/Signal-Watch/signal-watch.git)

Benefits:

  • ⚑ Instant results for previously scanned companies
  • πŸ’° Reduces API usage across all users
  • πŸ“¦ Automatic archiving with timestamps
  • πŸ”„ Seamless fallback to fresh scan if cache miss

XAI/OpenAI API Key (Required if using AI extraction)

⚠️ Required when "Use AI Extraction" option is enabled.

XAI (Grok) - Recommended:

  1. Get API key from XAI Console
  2. More cost-effective and faster than OpenAI
  3. Enter your API key in the web interface when enabling AI extraction

OpenAI (Alternative):

  1. Get API key from OpenAI Platform
  2. Add to .env file:
OPENAI_API_KEY=your_openai_key_here

πŸ’» Usage

Web Interface

Start the web server:

python app.py

Open browser: http://localhost:5000

Command Line Interface

Scan a single company:

python cli.py scan --company 00000006

Scan multiple companies:

python cli.py scan --companies 00000006,00000007,00000008

Scan with director network expansion:

python cli.py scan --company 00000006 --expand-network --max-depth 2

Resume from checkpoint:

python cli.py resume --checkpoint-file ./data/checkpoint_20250114_120000.json

Export results:

python cli.py export --results ./data/results.json --format csv
python cli.py export --results ./data/results.json --format html

πŸ“¦ GitHub Cache Feature

SignalWatch can leverage GitHub as a distributed result cache:

How it works:

  1. Before scanning, checks if company data exists in GitHub repo
  2. If found, loads instantly (no API calls)
  3. If not found, performs fresh scan and pushes results to GitHub
  4. Results stored in /results/{company_number}/latest.json with timestamped archives

Storage structure:

results/
β”œβ”€β”€ 00081701/
β”‚   β”œβ”€β”€ latest.json              # Current scan results
β”‚   β”œβ”€β”€ 20250119_143022.json     # Historical archives
β”‚   └── 20250118_091245.json
└── 00146575/
    └── latest.json

API Endpoints:

  • GET /api/github/available-companies - List all cached companies
  • GET /api/github/company/<number> - Get specific company data
  • Automatic push after successful scans

UI Integration:

  • Toggle "Check GitHub Cache First" on scan form (checked by default)
  • Visual indicator when data loaded from cache
  • Force refresh by unchecking cache option

πŸ“ Project Structure

signalwatch/
β”œβ”€β”€ app.py                      # Flask web application
β”œβ”€β”€ cli.py                      # Command-line interface
β”œβ”€β”€ config.py                   # Configuration management
β”œβ”€β”€ requirements.txt            # Python dependencies
β”œβ”€β”€ .env.example               # Environment template
β”‚
β”œβ”€β”€ core/                      # Core functionality
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ api_client.py         # Companies House API wrapper
β”‚   β”œβ”€β”€ pdf_processor.py      # PDF download & text extraction
β”‚   β”œβ”€β”€ mismatch_detector.py  # Name/date comparison logic
β”‚   β”œβ”€β”€ network_scanner.py    # Director network expansion
β”‚   β”œβ”€β”€ batch_processor.py    # Scalable processing engine
β”‚   └── rate_limiter.py       # Rate limit management
β”‚
β”œβ”€β”€ parsers/                   # Data parsing modules
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ name_parser.py        # Extract company names
β”‚   β”œβ”€β”€ date_parser.py        # Extract dates
β”‚   └── document_parser.py    # PDF document analysis
β”‚
β”œβ”€β”€ exporters/                 # Export functionality
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ csv_exporter.py       # CSV generation
β”‚   β”œβ”€β”€ json_exporter.py      # JSON generation
β”‚   └── html_exporter.py      # HTML report generation
β”‚
β”œβ”€β”€ templates/                 # Web interface templates
β”‚   β”œβ”€β”€ base.html
β”‚   β”œβ”€β”€ index.html
β”‚   β”œβ”€β”€ results.html
β”‚   └── report.html
β”‚
β”œβ”€β”€ static/                    # CSS, JS, images
β”‚   β”œβ”€β”€ css/
β”‚   β”‚   └── astra-theme.css
β”‚   β”œβ”€β”€ js/
β”‚   β”‚   └── main.js
β”‚   └── images/
β”‚
β”œβ”€β”€ data/                      # Processing data (gitignored)
β”œβ”€β”€ cache/                     # API response cache (gitignored)
β”œβ”€β”€ exports/                   # Generated reports (gitignored)
β”‚
└── tests/                     # Unit tests
    β”œβ”€β”€ __init__.py
    β”œβ”€β”€ test_api_client.py
    β”œβ”€β”€ test_mismatch_detector.py
    └── test_network_scanner.py

πŸ”§ Configuration

Edit .env file to customize:

# API Keys
COMPANIES_HOUSE_API_KEY=your_key_here

# Rate Limiting (600 requests per 5 minutes default)
RATE_LIMIT_REQUESTS=600
RATE_LIMIT_PERIOD=300

# Server Configuration
FLASK_PORT=5000
FLASK_DEBUG=False

# Data Storage
DATA_DIR=./data
CACHE_DIR=./cache
EXPORTS_DIR=./exports

πŸ“Š Output Examples

Mismatch Detection

{
  "company_number": "00000006",
  "mismatches": [
    {
      "type": "name_mismatch",
      "expected": "EXAMPLE LTD",
      "found": "EXAMPLE LIMITED",
      "document": "AA000001.pdf",
      "confidence": 0.95
    }
  ]
}

Director Network

{
  "seed_company": "00000006",
  "network": [
    {
      "director": "John Smith",
      "companies": ["00000006", "00000007", "00000008"],
      "depth": 1
    }
  ]
}

πŸ§ͺ Testing

Run tests:

pytest tests/

With coverage:

pytest --cov=core --cov=parsers tests/

🀝 Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“ License

This project is licensed under the MIT License - see the LICENSE file for details.

⚠️ Disclaimer

This tool is for legitimate research and compliance purposes only. Users must:

  • Comply with Companies House API terms of service
  • Respect rate limits and usage guidelines
  • Ensure proper data handling and privacy compliance
  • Use responsibly and ethically

πŸ†˜ Support

πŸ™ Acknowledgments

  • Companies House for providing the API
  • Astra theme for design inspiration
  • Open source community for excellent libraries

Built with ❀️ for transparency and due diligence

πŸ“‹ legal Disclaimer

  1. All data has been pulled from official sources such as Companies House. SignalWatch does not accept any responsibility for the accuracy of records or data. We simply present what is available at the time.

  2. The vulnerabilities the tool detects are severe as they can enable crime and cause other systemic issues. SignalWatch does not carry out any kind of investigation or law enforcement activities. We fulfil our obligations of reporting any reasonable suspicion of crime to the relevant authorities

  3. Any claims of criminalty must be proven in the relevant court and SignalWatch takes precautions to avoid making any defamatory remarks.

  4. All data is open sourced and publicy available on official databases. We urge users to keep data protection laws in mind.

About

Our GitHub tool is in development with many new features in the pipeline. If you are passionate about exposing corruption and have tech savvy skills we need your help ! Contact us to find out more

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published