overhauled documentation in README and made saving slightly friendlier

eakubilo · eakubilo · commit 092d17a69bdc · 2025-03-06T17:38:53.000-05:00
diff --git a/Makefile b/Makefile
@@ -15,6 +15,9 @@ clean:
 custom:
 	CUSTOM_CRAWL=true sh run-crawlers.sh
 
+check-if-up:
+	docker compose ls | grep -q "gpc-web-crawler.*running" && echo "true" || echo "false"
+
 help:
 	@echo "Available commands:"
 	@echo "  make start        - Starts crawler on all 8 batches of sites with debug mode turned off"
diff --git a/README.md b/README.md
@@ -242,7 +242,6 @@ Here are the steps for doing so:
    python3 well-known-adhoc.py
    ```
 
-Running this script requires three input files: `selenium-optmeowt-crawler/full-crawl-set.csv`, which is in the repo, `redo-original-sites.csv`, and `redo-sites.csv`. The second two files are not found in the repo and should be created for that crawl based on the [instructions in our Wiki](https://github.com/privacy-tech-lab/gpc-web-crawler/wiki/Instructions-for-Lab-Members-Performing-Crawls#saving-crawl-data-when-crawling-our-8-batch-dataset). As explained in `selenium-optmeowt-crawler/well-known-collection.py`, the output is a csv called `well-known-data.csv` with three columns: Site URL, request status, json data as well as an error json file called `well-known-errors.json` that logs all errors. To run this script on a csv file of sites without accounting for redo sites, comment all lines between line 27 and line 40 except for line 34.
 
 #### Details of the .well-known Analysis
 
@@ -261,22 +260,6 @@ Analyze the full crawl set with the redo sites replaced, i.e., using the full se
 - Status Codes (HTTP Responses)
   - In general, we expect a 404 status code (Not Found) when a site does not have a .well-known/gpc.json (output: Site_URL, 404, None)
   - Other possible status codes signaling that the .well-known data is not found include but are not limited to: 403 (Forbidden: the server understands the request but refuses to authorize it), 500 (Internal Server Error: the server encountered an unexpected condition that prevented it from fulfilling the request), 406 (Not Acceptable: the server cannot produce a response matching the list of acceptable values define), 429 (Too Many Requests)
-- `.well-known-collection.py` Code Rundown
-
-  1. First, the file reads in the full site set, i.e., original sites and redo sites
-     - sites_df.index(redo_original_sites[idx]): get the index of the site we want to change
-     - sites_list[x] = redo_new_sites[idx]: replace the site with the new site
-  2. r = requests.get(sites\*df[site_idx] + '/.well-known/gpc.json', timeout=35): The code runs with a timeout of 35 seconds (to stay consistent with Crawler timeouts)  
-     (i) checks if there will be json data, then logging all three columns (Site URL, request status, json data)  
-     (ii) if there is no json data, it will just log the **status and site**  
-     (iii) if r.json is not json data(), the "Expecting value: line 1 column 1 (char 0)", means that the status .." error will appear in the error logging and the error will log site and status  
-     (iv) if the request.get does not finish within 35 seconds, it will store errors and only log **site**
-
-- Important Code Documentation
-
-  - "file1.write(sites_df[site_idx] + "," + str(r.status_code) + ',"' + str(r.json()) + '"\n')" : writing data to a file with three columns (site, status and json data)
-  - "errors[sites_df[site_idx]] = str(e)" -> store errors with original links
-  - "with open("well-known-errors.json", "w") as outfile: json.dump(errors, outfile)" -> convert and write JSON object as containing errors to file
 
 ## 9. Thank You!
 
diff --git a/compose.yaml b/compose.yaml
@@ -111,6 +111,7 @@ services:
     privileged: true
     ports:
       - "4444:4444"
+###Delete the following lines to start the crawler with the VNC environment started
     environment:
      - SE_START_VNC=false
     volumes:
diff --git a/merge_well_known_data.sh b/merge_well_known_data.sh
@@ -0,0 +1,21 @@
+#!/bin/bash
+
+# Set output file
+output="crawl_results/well-known/well-known-data.csv"
+
+# Ensure we're starting clean
+rm -f "$output"
+
+#makes well-known directory
+mkdir crawl_results/well-known
+
+# Copy pt1 the output file
+cat crawl_results/pt1/well-known-data.csv >> $output
+
+# Append pt2 to pt8 
+for i in {2..8}
+do
+    tail -n +0 crawl_results/pt${i}/Extra/well-known-data.csv >> $output
+done
+
+echo "Files have been merged into crawl_results/merged-well-known-data.csv"
diff --git a/selenium-optmeowt-crawler/run_crawl.sh b/selenium-optmeowt-crawler/run_crawl.sh
@@ -12,7 +12,7 @@ if [ "$TEST_CRAWL" = "true" ]; then
     curl -o "$SAVE_PATH"/debug.json "http://rest_api:8080/debug"
   fi
 else
-  SAVE_PATH=./crawl_results/CRAWLSET"$CRAWL_ID"-"$TIMESTAMP"
+  SAVE_PATH=./crawl_results/pt"$CRAWL_ID"
   mkdir -p "$SAVE_PATH"/error-logging
   touch "$SAVE_PATH"/error-logging/error-logging.json
   
diff --git a/well-known-crawl/well-known-adhoc.py b/well-known-crawl/well-known-adhoc.py
@@ -15,7 +15,6 @@
 # was 200 (i.e. site exists and loaded) but it didn't find a json
 # this happens when sites send all incorrect links to a generic error page
 # instead of not serving the page. Also, it seems like human check error sites are the ones that time out
-
 import requests
 import pandas as pd
 import json
diff --git a/well-known-crawl/well-known-crawl.py b/well-known-crawl/well-known-crawl.py
@@ -49,7 +49,7 @@
 if TEST_CRAWL == "true":
     save_path = f"./crawl_results/CUSTOMCRAWL-{TIMESTAMP}"
 else:
-    save_path = f"./crawl_results/CRAWLSET{CRAWL_ID}-{TIMESTAMP}"
+    save_path = f"./crawl_results/pt{CRAWL_ID}"
 data_save_path = save_path + "/well-known-data.csv"
 error_save_path = save_path + "/well-known-errors.json"
 with open(data_save_path, "a") as f: