Skip to content

Commit 092d17a

Browse files
committed
overhauled documentation in README and made saving slightly friendlier
1 parent 39cfd29 commit 092d17a

File tree

7 files changed

+27
-20
lines changed

7 files changed

+27
-20
lines changed

Makefile

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,9 @@ clean:
1515
custom:
1616
CUSTOM_CRAWL=true sh run-crawlers.sh
1717

18+
check-if-up:
19+
docker compose ls | grep -q "gpc-web-crawler.*running" && echo "true" || echo "false"
20+
1821
help:
1922
@echo "Available commands:"
2023
@echo " make start - Starts crawler on all 8 batches of sites with debug mode turned off"

README.md

Lines changed: 0 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -242,7 +242,6 @@ Here are the steps for doing so:
242242
python3 well-known-adhoc.py
243243
```
244244

245-
Running this script requires three input files: `selenium-optmeowt-crawler/full-crawl-set.csv`, which is in the repo, `redo-original-sites.csv`, and `redo-sites.csv`. The second two files are not found in the repo and should be created for that crawl based on the [instructions in our Wiki](https://github.com/privacy-tech-lab/gpc-web-crawler/wiki/Instructions-for-Lab-Members-Performing-Crawls#saving-crawl-data-when-crawling-our-8-batch-dataset). As explained in `selenium-optmeowt-crawler/well-known-collection.py`, the output is a csv called `well-known-data.csv` with three columns: Site URL, request status, json data as well as an error json file called `well-known-errors.json` that logs all errors. To run this script on a csv file of sites without accounting for redo sites, comment all lines between line 27 and line 40 except for line 34.
246245

247246
#### Details of the .well-known Analysis
248247

@@ -261,22 +260,6 @@ Analyze the full crawl set with the redo sites replaced, i.e., using the full se
261260
- Status Codes (HTTP Responses)
262261
- In general, we expect a 404 status code (Not Found) when a site does not have a .well-known/gpc.json (output: Site_URL, 404, None)
263262
- Other possible status codes signaling that the .well-known data is not found include but are not limited to: 403 (Forbidden: the server understands the request but refuses to authorize it), 500 (Internal Server Error: the server encountered an unexpected condition that prevented it from fulfilling the request), 406 (Not Acceptable: the server cannot produce a response matching the list of acceptable values define), 429 (Too Many Requests)
264-
- `.well-known-collection.py` Code Rundown
265-
266-
1. First, the file reads in the full site set, i.e., original sites and redo sites
267-
- sites_df.index(redo_original_sites[idx]): get the index of the site we want to change
268-
- sites_list[x] = redo_new_sites[idx]: replace the site with the new site
269-
2. r = requests.get(sites\*df[site_idx] + '/.well-known/gpc.json', timeout=35): The code runs with a timeout of 35 seconds (to stay consistent with Crawler timeouts)
270-
(i) checks if there will be json data, then logging all three columns (Site URL, request status, json data)
271-
(ii) if there is no json data, it will just log the **status and site**
272-
(iii) if r.json is not json data(), the "Expecting value: line 1 column 1 (char 0)", means that the status .." error will appear in the error logging and the error will log site and status
273-
(iv) if the request.get does not finish within 35 seconds, it will store errors and only log **site**
274-
275-
- Important Code Documentation
276-
277-
- "file1.write(sites_df[site_idx] + "," + str(r.status_code) + ',"' + str(r.json()) + '"\n')" : writing data to a file with three columns (site, status and json data)
278-
- "errors[sites_df[site_idx]] = str(e)" -> store errors with original links
279-
- "with open("well-known-errors.json", "w") as outfile: json.dump(errors, outfile)" -> convert and write JSON object as containing errors to file
280263

281264
## 9. Thank You!
282265

compose.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,7 @@ services:
111111
privileged: true
112112
ports:
113113
- "4444:4444"
114+
###Delete the following lines to start the crawler with the VNC environment started
114115
environment:
115116
- SE_START_VNC=false
116117
volumes:

merge_well_known_data.sh

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
#!/bin/bash
2+
3+
# Set output file
4+
output="crawl_results/well-known/well-known-data.csv"
5+
6+
# Ensure we're starting clean
7+
rm -f "$output"
8+
9+
#makes well-known directory
10+
mkdir crawl_results/well-known
11+
12+
# Copy pt1 the output file
13+
cat crawl_results/pt1/well-known-data.csv >> $output
14+
15+
# Append pt2 to pt8
16+
for i in {2..8}
17+
do
18+
tail -n +0 crawl_results/pt${i}/Extra/well-known-data.csv >> $output
19+
done
20+
21+
echo "Files have been merged into crawl_results/merged-well-known-data.csv"

selenium-optmeowt-crawler/run_crawl.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ if [ "$TEST_CRAWL" = "true" ]; then
1212
curl -o "$SAVE_PATH"/debug.json "http://rest_api:8080/debug"
1313
fi
1414
else
15-
SAVE_PATH=./crawl_results/CRAWLSET"$CRAWL_ID"-"$TIMESTAMP"
15+
SAVE_PATH=./crawl_results/pt"$CRAWL_ID"
1616
mkdir -p "$SAVE_PATH"/error-logging
1717
touch "$SAVE_PATH"/error-logging/error-logging.json
1818

well-known-crawl/well-known-adhoc.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,6 @@
1515
# was 200 (i.e. site exists and loaded) but it didn't find a json
1616
# this happens when sites send all incorrect links to a generic error page
1717
# instead of not serving the page. Also, it seems like human check error sites are the ones that time out
18-
1918
import requests
2019
import pandas as pd
2120
import json

well-known-crawl/well-known-crawl.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@
4949
if TEST_CRAWL == "true":
5050
save_path = f"./crawl_results/CUSTOMCRAWL-{TIMESTAMP}"
5151
else:
52-
save_path = f"./crawl_results/CRAWLSET{CRAWL_ID}-{TIMESTAMP}"
52+
save_path = f"./crawl_results/pt{CRAWL_ID}"
5353
data_save_path = save_path + "/well-known-data.csv"
5454
error_save_path = save_path + "/well-known-errors.json"
5555
with open(data_save_path, "a") as f:

0 commit comments

Comments
 (0)