Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 37 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,26 @@
# cc-web-graph-neo4j

This repo contains documentation and code related to the Common Crawl
Foundation's [Web Graphs](https://commoncrawl.org/web-graphs),
This repo contains documentation and code related to the Common Crawl Foundation's [Web Graphs](https://commoncrawl.org/web-graphs),
stored in a [Neo4j graph database](https://neo4j.com/).
We have been computing these web graphs since 2018, and currently every crawl has
a web graph covering the previous 3 crawls.

These graphs are computed by the [WebGraph Framework](https://webgraph.di.unimi.it/). Historically
CCF only distributed these graphs in a not-commonly-used format.
These graphs are computed by the [WebGraph Framework](https://webgraph.di.unimi.it/). Historically CCF only distributed these graphs in a
not-commonly-used format.
This repo contains both instructions for using the graphs in neo4j form, and also code to convert from Web Graph
Framework format to neo4j.

## Status

This project is in beta-testing. Please give it a try with the one
domain graph we've converted, and tell us how it went!
This project is in beta-testing. Please give it a try with the one Web Graph we've converted: we provide both the domain and host version.

Eventually we will provide all of our web graphs in neo4j format.
The host Web Graph contains each hostname as a separate node, and links between them as edges.
The domain Web Graph is built by aggregating the host graph at the pay-level domain (PLD) level based on the public suffix
list maintained on publicsuffix.org.

> [!TIP]
> We are collecting feedback on the instructions and the code, and will be making improvements based on your needs and suggestions.
> Eventually we will provide all of our web graphs in Neo4j format.

## Motivation

Expand All @@ -31,8 +35,7 @@ These papers give good examples of what web graphs are useful for:

## Hardware Requirements

We recommend 2–4 CPU cores or more, 16–32 GB of memory, and ample
storage -- 512GB to 1TB.
We recommend 2–4 CPU cores or more, 16–32 GB of memory, and ample storage -- 512GB to 1TB.

## Docker container

Expand All @@ -57,7 +60,7 @@ sudo useradd -u 7474 -g 7474 neo4j;
sudo chown -R neo4j:neo4j data logs
```

You could also add your own user to group neo4j for simplified access.
You could also add your own user to group Neo4j for simplified access.

At this point you have a container (with Neo4J not running yet) that you can stop and start and run commands in.
For example,
Expand All @@ -68,20 +71,22 @@ docker exec web-graph-neo4j ls /data
docker stop web-graph-neo4j
```

Also, note that there are 3 special directories on the local disk, one for the neo4j database, one for incoming files,
Also, note that there are 3 special directories on the local disk, one for the Neo4j database, one for incoming files,
and one for files created by running commands in the container. These are:

- data/neo4j_db
- data/import
- data/export

## Download and use an existing neo4j web graph
## Download and use an existing Neo4j web graph

Our pre-made neo4j format web graphs are stored as neo4j dump files.
Our pre-made Neo4j format web graphs are stored as Neo4j dump files.
To use them, you'll have to download the dumps, and then load them.

### Download

#### Domain Web Graph

```
wget https://data.commoncrawl.org/projects/web-graph-testing/v1/cc-main-2025-oct-nov-dec-domain-system.dump
wget https://data.commoncrawl.org/projects/web-graph-testing/v1/cc-main-2025-oct-nov-dec-domain-neo4j.dump
Expand All @@ -94,11 +99,25 @@ s3://commoncrawl/projects/web-graph-testing/v1/cc-main-2025-oct-nov-dec-domain-s
s3://commoncrawl/projects/web-graph-testing/v1/cc-main-2025-oct-nov-dec-domain-neo4j.dump
```

#### Host Web Graph

```
wget https://data.commoncrawl.org/projects/web-graph-testing/v1/cc-main-2025-oct-nov-dec-host-system.dump
wget https://data.commoncrawl.org/projects/web-graph-testing/v1/cc-main-2025-oct-nov-dec-host-neo4j.dump
```

or from inside AWS:

```
s3://commoncrawl/projects/web-graph-testing/v1/cc-main-2025-oct-nov-dec-host-system.dump
s3://commoncrawl/projects/web-graph-testing/v1/cc-main-2025-oct-nov-dec-host-neo4j.dump
```

### Load

This step turns the dump files into a neo4j database. Note that the database will be about 2.5X the size of the dump.
This step turns the dump files into a Neo4j database. Note that the database will be about 2.5X the size of the dump.

Move the dumps in the import directory
Move the dumps in the import directory:

```shell
mv cc-main-2025-oct-nov-dec-domain-system.dump data/import/system.dump
Expand All @@ -109,7 +128,7 @@ mv cc-main-2025-oct-nov-dec-domain-neo4j.dump data/import/neo4j.dump
> Load and dump operations should always be performed with Neo4J in offline mode, or stopped.
> You can check using `docker exec web-graph-neo4j neo4j status`

Load the system and neo4j databases:
Load the system and Neo4j databases:

```shell
docker start web-graph-neo4j
Expand All @@ -123,7 +142,7 @@ in import/

### Use

The container is configured to sleep infinitely, after starting, you can "exec" to start up neo4j:
The container is configured to sleep infinitely, after starting, you can "exec" to start up Neo4j:

```shell
docker start web-graph-neo4j
Expand All @@ -132,7 +151,7 @@ docker exec web-graph-neo4j neo4j start

After, you can access it with a browser at https://localhost:7474/

If you want to run scripts against neo4j, write the output into /export
If you want to run scripts against Neo4j, write the output into /export

The web dashboard looks like:
<p align="center">
Expand Down