Releases: privacy-tech-lab/gpc-web-crawler
August/September/October 2025 Crawl (partial)
Small note on data:
- Part of this crawl (CA batches 1-8, CO batch 1) was performed with the prior release (up through August 24th), which is why both this release and the August 2025 release are partial.
- A small number of domains in the analysis.json files had to be manually entered because they weren't saved properly. This has since been fixed. It shouldn't affect any of the data. (#192 for more details).
- We discovered that URL classification strings were getting truncated at 5,000 characters. This doesn't affect most websites, but for those that it does, it leads to some loss of URL classification data. We'll be fixing this in an upcoming release. (#199 for more details).
To pull the exact image versions used in this release:
docker pull ghcr.io/privacy-tech-lab/crawl-driver@sha256:df3f24c27cf8e1f551ef98a993176d987a48abc9ea0cb73b477c6f1d5cb7d636
docker pull ghcr.io/privacy-tech-lab/well-known-crawl@sha256:c207d6905f6b54bdfbfbf45b2fb76422a50972948b4a77c177ae1faf5ec612f5
docker pull ghcr.io/privacy-tech-lab/rest-api@sha256:d0386ef8d04f98ad52fd87ba0611316af7c26e156af29becbf47072b87af6026
docker pull ghcr.io/privacy-tech-lab/mariadb-custom@sha256:5b4bb8ecd383c26f14f808f0d99f9aa4619d62e619a17cc79b8e2905b2a0262e
August 2025 crawl (partial)
Differences from May 2025 Crawl:
- The RestAPI DockerFile now uses node:18 instead of node:16, fixing a bug caused by an archived version of Debian. The actual crawl for August 2025 was performed with node:16, but this change should not affect the data collected (#182 for more details).
- This crawl comprises CA batches 1-8, and CO batch 1. The rest of the crawl was performed with the fix, as can be seen in the Aug/Sep/Oct 2025 release.
To pull the exact image versions used in this release:
docker pull ghcr.io/privacy-tech-lab/crawl-driver@sha256:0a304d6a105da5a01e45ea6462255187d0b1bab3f5ba2571489815958b425c31
docker pull ghcr.io/privacy-tech-lab/well-known-crawl@sha256:6b3cf17d156566159826eeebde0f495d7d096d13e09d29b3922eefc6f21c4469
docker pull ghcr.io/privacy-tech-lab/rest-api@sha256:a9bcac0bbfc35b05bfa6f6c536b1d20fa4d46b7d3fa6b432f6c0dc035de3a509
docker pull ghcr.io/privacy-tech-lab/mariadb-custom@sha256:24b7cb1fb51433b2149043de27182dd66c14329c7620ac6c417c52e9b235acf8
May 2025 Crawl
Differences from Feb/Mar 2025 Crawl:
- Updated Well-known python script to write "None" to well-known-data.csv file rather than
Noneto prevent blank cells.
To pull the exact image versions used in this release:
docker pull ghcr.io/privacy-tech-lab/crawl-driver@sha256:1955a5a8e9dd06a84e92e87c44eecaaa248d44ce03c1a155888b00a82b1833df
docker pull ghcr.io/privacy-tech-lab/well-known-crawl@sha256:06110d2c0c56a6e214a03c93d9e88f0ea4c25953318587f8fa9d939789cad191
docker pull ghcr.io/privacy-tech-lab/rest-api@sha256:932cd43ebb38d4ff9e806822c5ecd8e886484e77bdf0361c3749a44f7ba7daf8
docker pull ghcr.io/privacy-tech-lab/mariadb-custom@sha256:0f08aa392b890a33f8f747c123d45577be2184115461385729c7fcab86691ab0```
February/March 2025 Crawl
Introduced Docker containerization for the web crawler.
- Dockerization: The crawler is now fully containerized, including the addition of Dockerfiles for MariaDB and the complete isolation of containers from the local machine
- Improved Efficiency: Reduced wait times and added a retry policy for persistent services, enhancing crawl speed and reliability
- Enhanced Error Handling and Cleanup: More robust error handling implemented, along with automatic container shutdown and cleanup of volumes to ensure a cleaner environment after crawls
June 2024 Crawl
Differences from April 2024 Crawl:
- addition of GPP version that identifies whether the site is using GPP v1.0 or v1.1 version
April 2024 Crawl
Differences from February 2024 crawl:
- well-known data is no longer collected by the crawler. We use a python script instead, which is also included in this repo.
- longer database values are now stored as TEXT instead of varchar
- addition of OneTrustWPCCPAGoogleOptOut and OTGPPConsent cookies
February 2024 Crawl
This is largely the same as the December 2023 crawl code.
Differences:
- well-known data is collected by the crawler
- column values in the debugging table are capped at 4,000 characters, as this is what is specified in our table
- one new human check regular expression
December 2023 Crawl
This is the code we used to perform our crawl on 11,708 sites in December 2023.
The extension collects data from Firefox's urlClassification object in order to determine whether a site is subject to the CCPA. It collects data on the USPS, GPP string, and the OptanonConsent cookie to determine whether sites recognize GPC signals. This version uses a SQL database to store the data.
Firefox-analysis-mode-crawler
The Firefox-analysis-mode-crawler is used to crawl the top 1000 sites of the US Privacy String Test Set.