Skip to content

Releases: privacy-tech-lab/gpc-web-crawler

August/September/October 2025 Crawl (partial)

15 Nov 20:03

Choose a tag to compare

Small note on data:

  • Part of this crawl (CA batches 1-8, CO batch 1) was performed with the prior release (up through August 24th), which is why both this release and the August 2025 release are partial.
  • A small number of domains in the analysis.json files had to be manually entered because they weren't saved properly. This has since been fixed. It shouldn't affect any of the data. (#192 for more details).
  • We discovered that URL classification strings were getting truncated at 5,000 characters. This doesn't affect most websites, but for those that it does, it leads to some loss of URL classification data. We'll be fixing this in an upcoming release. (#199 for more details).
To pull the exact image versions used in this release:

docker pull ghcr.io/privacy-tech-lab/crawl-driver@sha256:df3f24c27cf8e1f551ef98a993176d987a48abc9ea0cb73b477c6f1d5cb7d636
docker pull ghcr.io/privacy-tech-lab/well-known-crawl@sha256:c207d6905f6b54bdfbfbf45b2fb76422a50972948b4a77c177ae1faf5ec612f5
docker pull ghcr.io/privacy-tech-lab/rest-api@sha256:d0386ef8d04f98ad52fd87ba0611316af7c26e156af29becbf47072b87af6026
docker pull ghcr.io/privacy-tech-lab/mariadb-custom@sha256:5b4bb8ecd383c26f14f808f0d99f9aa4619d62e619a17cc79b8e2905b2a0262e

August 2025 crawl (partial)

24 Aug 20:34
303aa7a

Choose a tag to compare

Differences from May 2025 Crawl:

  • The RestAPI DockerFile now uses node:18 instead of node:16, fixing a bug caused by an archived version of Debian. The actual crawl for August 2025 was performed with node:16, but this change should not affect the data collected (#182 for more details).
  • This crawl comprises CA batches 1-8, and CO batch 1. The rest of the crawl was performed with the fix, as can be seen in the Aug/Sep/Oct 2025 release.
To pull the exact image versions used in this release:
docker pull ghcr.io/privacy-tech-lab/crawl-driver@sha256:0a304d6a105da5a01e45ea6462255187d0b1bab3f5ba2571489815958b425c31
docker pull ghcr.io/privacy-tech-lab/well-known-crawl@sha256:6b3cf17d156566159826eeebde0f495d7d096d13e09d29b3922eefc6f21c4469
docker pull ghcr.io/privacy-tech-lab/rest-api@sha256:a9bcac0bbfc35b05bfa6f6c536b1d20fa4d46b7d3fa6b432f6c0dc035de3a509
docker pull ghcr.io/privacy-tech-lab/mariadb-custom@sha256:24b7cb1fb51433b2149043de27182dd66c14329c7620ac6c417c52e9b235acf8

May 2025 Crawl

05 Aug 18:40
f70fe6a

Choose a tag to compare

Differences from Feb/Mar 2025 Crawl:

  • Updated Well-known python script to write "None" to well-known-data.csv file rather than None to prevent blank cells.
To pull the exact image versions used in this release:

docker pull ghcr.io/privacy-tech-lab/crawl-driver@sha256:1955a5a8e9dd06a84e92e87c44eecaaa248d44ce03c1a155888b00a82b1833df
docker pull ghcr.io/privacy-tech-lab/well-known-crawl@sha256:06110d2c0c56a6e214a03c93d9e88f0ea4c25953318587f8fa9d939789cad191
docker pull ghcr.io/privacy-tech-lab/rest-api@sha256:932cd43ebb38d4ff9e806822c5ecd8e886484e77bdf0361c3749a44f7ba7daf8
docker pull ghcr.io/privacy-tech-lab/mariadb-custom@sha256:0f08aa392b890a33f8f747c123d45577be2184115461385729c7fcab86691ab0```

February/March 2025 Crawl

14 Mar 21:44

Choose a tag to compare

Introduced Docker containerization for the web crawler.

  • Dockerization: The crawler is now fully containerized, including the addition of Dockerfiles for MariaDB and the complete isolation of containers from the local machine
  • Improved Efficiency: Reduced wait times and added a retry policy for persistent services, enhancing crawl speed and reliability
  • Enhanced Error Handling and Cleanup: More robust error handling implemented, along with automatic container shutdown and cleanup of volumes to ensure a cleaner environment after crawls

June 2024 Crawl

10 Jun 04:34
24e732e

Choose a tag to compare

Differences from April 2024 Crawl:

  • addition of GPP version that identifies whether the site is using GPP v1.0 or v1.1 version

April 2024 Crawl

18 Apr 18:29
b302a2a

Choose a tag to compare

Differences from February 2024 crawl:

  • well-known data is no longer collected by the crawler. We use a python script instead, which is also included in this repo.
  • longer database values are now stored as TEXT instead of varchar
  • addition of OneTrustWPCCPAGoogleOptOut and OTGPPConsent cookies

February 2024 Crawl

13 Feb 02:01
d2545de

Choose a tag to compare

This is largely the same as the December 2023 crawl code.

Differences:

  • well-known data is collected by the crawler
  • column values in the debugging table are capped at 4,000 characters, as this is what is specified in our table
  • one new human check regular expression

December 2023 Crawl

03 Jan 17:14
72f5d87

Choose a tag to compare

This is the code we used to perform our crawl on 11,708 sites in December 2023.

The extension collects data from Firefox's urlClassification object in order to determine whether a site is subject to the CCPA. It collects data on the USPS, GPP string, and the OptanonConsent cookie to determine whether sites recognize GPC signals. This version uses a SQL database to store the data.

Firefox-analysis-mode-crawler

19 Aug 18:48
7badcbe

Choose a tag to compare

The Firefox-analysis-mode-crawler is used to crawl the top 1000 sites of the US Privacy String Test Set.