Persian Privacy Monitor is a research-oriented tool to collect and analyze privacy-related information from Persian-language websites. The project currently contains two main components:
- crawler: visits websites with Selenium, attempts to interact with cookie consent banners, scrolls pages, and collects cookies to store in MongoDB.
- scraper: extracts pages (privacy policies and other content) using Playwright/requests, readability and BeautifulSoup, then stores results for analysis.
This README documents how the code is organized, what dependencies are required, and how to run each component.
-
crawler/— Selenium-based crawler that visits pages and saves cookies to MongoDB. -
scraper/— Content extraction tools (Playwright + readability + BeautifulSoup) and the scraping entry pointscraper/main.py. -
utils/mongo_driver.py— Lightweight MongoDB helper used by both components (defaults tomongodb://localhost:27017, DBscraperdb). -
requirements.txt— Python dependencies. -
run.sh— convenience script with common example commands. -
urls.txt— list of target URLs (one per line) used by the scrapers/crawlers. -
extractor/— helper that reads scraped text from MongoDB, sends prompts to an LLM API, and stores responses. Entry point:extractor/main.py, implementation inextractor/prompt_api.py.
- Description of code components: Done
- Installation and environment setup: Done
- Dependencies and notes (Playwright + MongoDB): Done
- How to run scraper and crawler from project root (to avoid import issues): Done
- Python 3.10+ (the project was developed with Python 3.11/3.13-style features). Use a virtual environment.
- MongoDB running locally or accessible via network. The default Mongo URI used by
MongoDriverismongodb://localhost:27017and default DBscraperdb. - For the
scrapercomponent you will likely need Playwright browsers installed (Chromium). See the installation step below. - A modern shell (instructions below use
zsh/bash).
- Clone the repository and enter its directory:
git clone https://github.com/<your-username>/Persian-Privacy-Monitor.git
cd Persian-Privacy-Monitor- Create and activate a virtual environment (zsh/bash):
python -m venv .venv
source .venv/bin/activate- Install Python dependencies:
pip install -r requirements.txt- If you plan to run the
scraper(Playwright), install browsers:
python -m playwright install chromium- Ensure MongoDB is running and reachable. For local testing you can use the default URI and DB name; otherwise set the connection in code or pass environment variables/config (not currently implemented).
NOTE: You can run project after clone with run
run.shinstead of above steps.
See requirements.txt. Key packages used by the codebase include:
- playwright — browser automation used in
scraper. - selenium — browser automation used in
crawler. - pymongo — MongoDB client.
- beautifulsoup4, lxml, readability-lxml — HTML parsing and main-text extraction.
- httpx / requests — HTTP clients.
Note: requirements.txt contains the full pinned/unpinned list. Playwright requires separate browser installation (see above).
Important: run commands from the project root (/Users/amir/Desktop/cookie/Persian-Privacy-Monitor). Running individual .py files directly from their subfolders may cause import errors such as "No module named 'utils'". Use python -m package.module or run from project root so Python can resolve package imports.
Examples (all commands assume your virtualenv is activated and you are in the project root):
- Run the scraper (concurrent Playwright workers):
python -m scraper.main --input urls.txt --out result --parallel 3- Run the crawler (Selenium-based cookie collector):
python -m crawler.main --input urls.txt- Run the extractor (send prompts to LLM over scraped pages):
python -m extractor.main --input urls.txt --prompt data-practices-LLM-result/prompt/sample1.txt- A convenience example is included in
run.sh. You can inspect and run it from the project root:
bash run.shIf you get "ModuleNotFoundError: No module named 'utils'" when running crawler/crawler.py directly, use the -m option from project root or add the project root to PYTHONPATH.
Security note: extractor/prompt_api.py currently contains a hard-coded API key string. Storing secrets in source is insecure. Replace this with an environment variable (for example, LLM_API_KEY) and load it at runtime instead.
The provided utils/mongo_driver.py defaults to:
- uri:
mongodb://localhost:27017 - db name:
scraperdb - collection:
scraped_pages(used by scraper) orcookies(crawler code inserts cookie documents; check code for exact collection name used)
Start a local MongoDB instance before running or update MongoDriver to point to your hosted MongoDB.
crawler/crawler.py: utilities for visiting pages with Selenium: navigation, cookie-banner handling, scrolling, extracting cookies, and saving cookie documents to MongoDB. It expects to importutils.mongo_driver.MongoDriver.scraper/main.py: CLI entry for scraping pages. Loads URLs from a file and runsScraper(inscraper/scraper_core.py) asynchronously with Playwright.utils/mongo_driver.py: small wrapper aroundpymongo.MongoClientprovidingalready_scraped()andinsert_doc()helpers.
- Module import errors: always run from project root and prefer
python -m package.module. - Playwright errors: ensure you installed browsers via
python -m playwright install. - MongoDB connection errors: verify Mongo is running and reachable at configured URI.
- Add CLI flags or environment variables for Mongo URI and database names.
- Add unit tests for core helpers (cookie extraction, content extraction).
- Add a small sample
urls.txtcontaining a couple of known sites for quick smoke tests.
This project includes code from extracting-cookies-using-webcrawler
by MSaeidSedighi, licensed under the MIT License.
We are grateful for this contribution, which powers the cookie extraction component of our system.
This project includes source code under the project's LICENSE file. Check LICENSE for details.