Automated fuzzy matching system for business records using RapidFuzz and Playwright.
- Data Cleaning: Normalize business names, addresses, phone numbers
- Fuzzy Matching: RapidFuzz-powered matching with configurable thresholds
- Multiple Strategies: Phone-based exact matching + token-set name matching
- Web Scraping: Playwright-based browser automation for public directories
- Confidence Scoring: Transparent scoring based on match quality
# Using uv (recommended)
uv venv
source .venv/bin/activate
uv pip install -e .
# Install Playwright browsers
playwright install chromium# Run fuzzy matching on sample data
python src/main.py \
--source data/input/sample_data_source1.csv \
--target data/input/sample_data_source2.csv \
--output data/output/matched_results.csv \
--threshold 80business-matcher/
├── src/
│ ├── data_cleaner.py # Normalization utilities
│ ├── fuzzy_matcher.py # Matching engine
│ ├── web_scraper.py # Browser automation
│ └── main.py # Workflow orchestration
├── data/
│ ├── input/ # Source CSV files
│ └── output/ # Match results
├── tests/ # Unit tests
└── pyproject.toml # Project configuration
Normalization handles:
- Legal entity suffixes (Inc, LLC, Corp, Ltd)
- Common abbreviations (& → and, Intl → International)
- Special characters and whitespace
- Phone number formatting (digits only)
- Address standardization (Street → St, Avenue → Ave)
- State abbreviations (California → CA)
Uses RapidFuzz with two strategies:
-
Exact Phone Match (95% confidence)
- Fastest, highest confidence
- Normalized digits-only comparison
-
Token-Set Name Match (variable confidence)
- Handles word order variations
- "ABC Corp Inc" matches "Inc ABC Corporation"
- Confidence based on similarity score
- High (80-100): Strong match, safe to auto-merge
- Medium (60-79): Probable match, review recommended
- Low (<60): Weak match, manual review required
Browser automation with Playwright:
from src.web_scraper import BusinessDirectoryScraper
scraper = BusinessDirectoryScraper(headless=True)
results = scraper.scrape_yellowpages(
search_term="restaurants",
location="New York, NY",
max_results=10
)CSV with columns:
match_status: phone | fuzzy_name | unmatchedconfidence_score: 0-100name_similarity: 0-100phone_match: Yes/Nosource_*: Original source record fieldstarget_*: Matched target record fields
MATCH SUMMARY
==============================================================
Total records: 10
Matched: 9 (90.0%)
Unmatched: 1
Match Types:
Phone matches: 6
Fuzzy name matches: 3
Confidence Distribution:
High (80-100): 8
Medium (60-79): 1
Low (<60): 1
# Install dev dependencies
uv pip install -e ".[dev]"
# Format code
black src/
# Lint
ruff check src/
# Run tests
pytest tests/- CRM deduplication
- Vendor database consolidation
- Lead enrichment workflows
- Multi-source data merging
- Business directory matching
- Language: Python 3.10+
- Fuzzy Matching: RapidFuzz 3.5+ (C++ backend for speed)
- Web Automation: Playwright 1.40+
- Data Processing: Pandas 2.1+
MIT