Skip to content

Scraper for OpenGovUS business registrations. Collects company name, address, category, and registration date into CSVs.

License

Notifications You must be signed in to change notification settings

mdugan8186/opengovus-scraper

Repository files navigation

OpenGovUS Business Scraper

Structured business listings → clean CSVs, with post-processing & summaries.

Python License: MIT Last Commit

A production-style Python scraper that collects business registration data from OpenGovUS and writes tidy CSVs. It renders pages with Playwright, parses with BeautifulSoup, and includes optional post-processing and summary steps so clients can use the data immediately.


🔍 Key Features

  • Dynamic rendering with Playwright (Chromium) to handle JS.
  • Structured fields exported to CSV: Business Name, Address, Category, Date Registered.
  • Pagination across multiple result pages.
  • Basic stealth tactics to reduce trivial bot detection.
  • Post-processing script to dedupe, clean, and sort records.
  • Summary generator (plain text + Markdown) for quick insights.

⚙️ Quick Start

Prerequisites

  • Python 3.10+
  • Git
  • Playwright browsers (install step below)

Installation

# 1) Clone
git clone https://github.com/mdugan8186/opengovus-scraper.git
cd opengovus-scraper

# 2) (optional) Virtual environment
python -m venv .venv
# macOS/Linux:
source .venv/bin/activate
# Windows:
.venv\Scripts\activate

# 3) Install dependencies
pip install -r requirements.txt

# 4) Install Playwright browsers (first run only)
python -m playwright install chromium

Run the Scraper

python script.py
  • Writes the raw CSV to: output/opengovus_listings.csv

Optional: Post-process & Summarize

# Clean & sort, save to samples/cleaned_listings.csv
python postprocess.py

# Create text + markdown summaries from the cleaned CSV
python summarize_data.py
  • Outputs:
    • samples/cleaned_listings.csv
    • output/summary.txt
    • samples/summary.md

📁 Output

  • Primary CSV: output/opengovus_listings.csv
  • Cleaned CSV (optional): samples/cleaned_listings.csv

Columns

Business Name, Address, Category, Date Registered

🧩 Configuration & Selectors

  • CSS selectors and parsing logic live in the code (script.py). If the site HTML changes, update the selectors there.
  • For long-term maintainability, you can extract selectors into a config/ JSON (future enhancement).

🎥 Demo

Example of the scraper output:

OpenGovUS Output

The full dataset is saved as a CSV: output/opengovus_listings.csv


🧪 Testing & Dev Notes

See TESTING.md for a step-by-step sanity flow (render → extract → clean → summarize), selector maintenance notes, and data-quality checks.


🛠️ Tech Stack

  • Playwright (Python) for rendering
  • BeautifulSoup for parsing
  • pandas for cleaning & summaries
  • CSV outputs for easy analysis

⚖️ Legal & Ethical Use

This scraper includes basic measures (delays, browser automation) to reduce trivial blocking and ensure reliable data collection.
It is provided for educational and demonstration purposes only. Please review and comply with the target site’s terms of service and robots.txt before running it at scale.


📄 License

This project is licensed under the MIT License. See LICENSE.


👤 About

Mike Dugan — Python Web Scraper & Automation Developer

About

Scraper for OpenGovUS business registrations. Collects company name, address, category, and registration date into CSVs.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published