-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathcrawl.html
More file actions
13 lines (11 loc) · 926 Bytes
/
crawl.html
File metadata and controls
13 lines (11 loc) · 926 Bytes
1
2
3
4
5
6
7
8
9
10
11
12
13
<div align="center">
<h1> The Common Crawl</h1>
<a href="index.html">Go Back to Home Page</a><br><br>
<p>The Common Crawl is a non-profit dedicated to copying the Internet to individuals looking to study the contents of the Internet
in more depth. According to their website, "We build and maintain an open repository of web crawl data that can be accessed and analyzed by anyone."
<a href="https://commoncrawl.org/">Common Crawl Website</a></p><br>
<p>Here is a link to an article I published about it:</p>
<a href="https://medium.com/@bazill_theG/measuring-internet-links-accessing-the-common-crawl-dataset-using-emr-and-pyspark-in-aws-fcf5eb26afd9">Measuring Internet Links: Accessing the Common Crawl Dataset Using EMR and Pyspark in AWS on Medium</a><br>
<p>Here is a link to a video explaining the results of the project:
<a href="https://vimeo.com/427917497">Common Crawl Data Access Using PySpark</a><br>
</div>