basillatif.github.io/crawl.html at master · basillatif/basillatif.github.io · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
<div align="center">
<h1> The Common Crawl</h1>
<a href="index.html">Go Back to Home Page</a><br><br>

<p>The Common Crawl is a non-profit dedicated to copying the Internet to individuals looking to study the contents of the Internet
  in more depth. According to their website, "We build and maintain an open repository of web crawl data that can be accessed and analyzed by anyone."
  <a href="https://commoncrawl.org/">Common Crawl Website</a></p><br>

<p>Here is a link to an article I published about it:</p>
<a href="https://medium.com/@bazill_theG/measuring-internet-links-accessing-the-common-crawl-dataset-using-emr-and-pyspark-in-aws-fcf5eb26afd9">Measuring Internet Links: Accessing the Common Crawl Dataset Using EMR and Pyspark in AWS on Medium</a><br>
<p>Here is a link to a video explaining the results of the project:
<a href="https://vimeo.com/427917497">Common Crawl Data Access Using PySpark</a><br>
</div>