SignalWatch is a comprehensive Python-based tool for scanning and analyzing UK company data from Companies House (with optional OpenCorporates support). It detects name/date mismatches in filing documents and maps director-linked company networks.
- Data Extraction: Fetch company profiles, filing history, and download related PDFs
- Smart Analysis: AI-powered OCR text extraction and name/date parsing
- Mismatch Detection: Compare extracted data against official records
- Network Discovery: Map companies through shared directors with iterative expansion
- Rate Limit Management: Intelligent batching (600 requests / 5 minutes)
- Resume Capability: Checkpoint system for interrupted scans
- Multi-User Support: Users can provide their own API keys
- Web Interface: Astra-style themed interface for easy interaction
- Export Options: CSV, JSON, and embeddable HTML reports
- Python 3.9+
- Companies House API Key (Get one here)
- Tesseract OCR (for PDF text extraction)
Windows:
# Download installer from: https://github.com/UB-Mannheim/tesseract/wiki
# Or use chocolatey:
choco install tesseractLinux:
sudo apt-get install tesseract-ocrmacOS:
brew install tesseract- Clone the repository:
git clone https://github.com/yourusername/signalwatch.git
cd signalwatch- Create virtual environment:
python -m venv venv
.\venv\Scripts\Activate.ps1- Install dependencies:
pip install -r requirements.txt- Configure environment:
# Copy example environment file
copy .env.example .env
# Edit .env and add your API key
notepad .env- Setup directories:
python -c "from config import Config; Config.ensure_directories()"- Register at Companies House Developer Hub
- Create an application to get your API key
- Enter your API key in the web interface when performing a scan
Note: For development/testing, you can add your key to .env file:
COMPANIES_HOUSE_API_KEY=your_key_here
SignalWatch can use GitHub as a distributed cache, sharing scan results between users to avoid duplicate API calls:
- Create a Personal Access Token at GitHub Settings β Tokens
- Required permissions: repo (Full control of private repositories)
- Add to
.envfile:
GITHUB_TOKEN=ghp_your_token_here
- Ensure the target repo exists (default:
https://github.com/Signal-Watch/signal-watch.git)
Benefits:
- β‘ Instant results for previously scanned companies
- π° Reduces API usage across all users
- π¦ Automatic archiving with timestamps
- π Seamless fallback to fresh scan if cache miss
XAI (Grok) - Recommended:
- Get API key from XAI Console
- More cost-effective and faster than OpenAI
- Enter your API key in the web interface when enabling AI extraction
OpenAI (Alternative):
- Get API key from OpenAI Platform
- Add to
.envfile:
OPENAI_API_KEY=your_openai_key_here
Start the web server:
python app.pyOpen browser: http://localhost:5000
Scan a single company:
python cli.py scan --company 00000006Scan multiple companies:
python cli.py scan --companies 00000006,00000007,00000008Scan with director network expansion:
python cli.py scan --company 00000006 --expand-network --max-depth 2Resume from checkpoint:
python cli.py resume --checkpoint-file ./data/checkpoint_20250114_120000.jsonExport results:
python cli.py export --results ./data/results.json --format csv
python cli.py export --results ./data/results.json --format htmlSignalWatch can leverage GitHub as a distributed result cache:
How it works:
- Before scanning, checks if company data exists in GitHub repo
- If found, loads instantly (no API calls)
- If not found, performs fresh scan and pushes results to GitHub
- Results stored in
/results/{company_number}/latest.jsonwith timestamped archives
Storage structure:
results/
βββ 00081701/
β βββ latest.json # Current scan results
β βββ 20250119_143022.json # Historical archives
β βββ 20250118_091245.json
βββ 00146575/
βββ latest.json
API Endpoints:
GET /api/github/available-companies- List all cached companiesGET /api/github/company/<number>- Get specific company data- Automatic push after successful scans
UI Integration:
- Toggle "Check GitHub Cache First" on scan form (checked by default)
- Visual indicator when data loaded from cache
- Force refresh by unchecking cache option
signalwatch/
βββ app.py # Flask web application
βββ cli.py # Command-line interface
βββ config.py # Configuration management
βββ requirements.txt # Python dependencies
βββ .env.example # Environment template
β
βββ core/ # Core functionality
β βββ __init__.py
β βββ api_client.py # Companies House API wrapper
β βββ pdf_processor.py # PDF download & text extraction
β βββ mismatch_detector.py # Name/date comparison logic
β βββ network_scanner.py # Director network expansion
β βββ batch_processor.py # Scalable processing engine
β βββ rate_limiter.py # Rate limit management
β
βββ parsers/ # Data parsing modules
β βββ __init__.py
β βββ name_parser.py # Extract company names
β βββ date_parser.py # Extract dates
β βββ document_parser.py # PDF document analysis
β
βββ exporters/ # Export functionality
β βββ __init__.py
β βββ csv_exporter.py # CSV generation
β βββ json_exporter.py # JSON generation
β βββ html_exporter.py # HTML report generation
β
βββ templates/ # Web interface templates
β βββ base.html
β βββ index.html
β βββ results.html
β βββ report.html
β
βββ static/ # CSS, JS, images
β βββ css/
β β βββ astra-theme.css
β βββ js/
β β βββ main.js
β βββ images/
β
βββ data/ # Processing data (gitignored)
βββ cache/ # API response cache (gitignored)
βββ exports/ # Generated reports (gitignored)
β
βββ tests/ # Unit tests
βββ __init__.py
βββ test_api_client.py
βββ test_mismatch_detector.py
βββ test_network_scanner.py
Edit .env file to customize:
# API Keys
COMPANIES_HOUSE_API_KEY=your_key_here
# Rate Limiting (600 requests per 5 minutes default)
RATE_LIMIT_REQUESTS=600
RATE_LIMIT_PERIOD=300
# Server Configuration
FLASK_PORT=5000
FLASK_DEBUG=False
# Data Storage
DATA_DIR=./data
CACHE_DIR=./cache
EXPORTS_DIR=./exports{
"company_number": "00000006",
"mismatches": [
{
"type": "name_mismatch",
"expected": "EXAMPLE LTD",
"found": "EXAMPLE LIMITED",
"document": "AA000001.pdf",
"confidence": 0.95
}
]
}{
"seed_company": "00000006",
"network": [
{
"director": "John Smith",
"companies": ["00000006", "00000007", "00000008"],
"depth": 1
}
]
}Run tests:
pytest tests/With coverage:
pytest --cov=core --cov=parsers tests/Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
This tool is for legitimate research and compliance purposes only. Users must:
- Comply with Companies House API terms of service
- Respect rate limits and usage guidelines
- Ensure proper data handling and privacy compliance
- Use responsibly and ethically
- Issues: GitHub Issues
- Documentation: Wiki
- Companies House for providing the API
- Astra theme for design inspiration
- Open source community for excellent libraries
Built with β€οΈ for transparency and due diligence
π legal Disclaimer
-
All data has been pulled from official sources such as Companies House. SignalWatch does not accept any responsibility for the accuracy of records or data. We simply present what is available at the time.
-
The vulnerabilities the tool detects are severe as they can enable crime and cause other systemic issues. SignalWatch does not carry out any kind of investigation or law enforcement activities. We fulfil our obligations of reporting any reasonable suspicion of crime to the relevant authorities
-
Any claims of criminalty must be proven in the relevant court and SignalWatch takes precautions to avoid making any defamatory remarks.
-
All data is open sourced and publicy available on official databases. We urge users to keep data protection laws in mind.