Skip to content

Commit dbebdcf

Browse files
authored
Merge pull request #49 from sourcifyeth/doc-parquet-v2
Add Parquet export v2 documentation
2 parents f81ab3b + 5567bc6 commit dbebdcf

File tree

6 files changed

+172
-97
lines changed

6 files changed

+172
-97
lines changed
Lines changed: 6 additions & 78 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Sourcify Database
22

3-
Sourcify Database is the main storage backend for Sourcify. It is a PostgreSQL database that follows the [Verified Alliance Schema](https://github.com/verifier-alliance/database-specs) as its base with few modifications.
3+
Sourcify Database is the main storage backend for Sourcify. It is a PostgreSQL database that follows the [Verifier Alliance Schema](https://github.com/verifier-alliance/database-specs) as its base with few modifications.
44

55
On a high level, these modifications are:
66

@@ -65,84 +65,12 @@ Other known inconsistencies in the data below (not planned to fix) are documente
6565

6666
## Download
6767

68-
:::warning Deprecation Notice
69-
The current parquet download format will be deprecated. A new `/v2` endpoint will be introduced with an updated format. Documentation for the new format will be added once it is live. Feel free to use the export in its current form, but be aware that it will be replaced.
70-
:::
68+
See [Download the Dataset](/docs/repository/download-dataset/) for instructions on downloading the database in Parquet format.
7169

72-
We dump the whole database daily in [Parquet](https://en.wikipedia.org/wiki/Apache_Parquet) format and upload it to a Cloudflare R2 storage. You can access the manifest file at https://export.sourcify.dev ( `.dev` redirects to `.app` domain, which also belongs to Sourcify). The script that does the dump is at [sourcifyeth/parquet-export](https://github.com/sourcifyeth/parquet-export).
73-
74-
[export.sourcify.dev](https://export.sourcify.dev) will redirect to a `manifest.json` file:
75-
76-
<details>
77-
<summary>manifest.json</summary>
78-
79-
```json
80-
{
81-
"timestamp": 1726030203254,
82-
"dateStr": "2024-09-11T04:50:03.254904Z",
83-
"files": {
84-
"code": [
85-
"code/code_0_100000.parquet",
86-
"code/code_100000_200000.parquet",
87-
...
88-
"code/code_2700000_2800000.parquet"
89-
],
90-
"contracts": [
91-
"contracts/contracts_0_1000000.parquet",
92-
...
93-
"contracts/contracts_4000000_5000000.parquet"
94-
],
95-
"contract_deployments": [
96-
"contract_deployments/contract_deployments_0_1000000.parquet",
97-
...
98-
"contract_deployments/contract_deployments_5000000_6000000.parquet"
99-
],
100-
"compiled_contracts": [
101-
"compiled_contracts/compiled_contracts_0_5000.parquet",
102-
...
103-
"compiled_contracts/compiled_contracts_815000_820000.parquet"
104-
],
105-
"verified_contracts": [
106-
"verified_contracts/verified_contracts_0_1000000.parquet",
107-
...
108-
"verified_contracts/verified_contracts_5000000_6000000.parquet"
109-
],
110-
"sourcify_matches": [
111-
"sourcify_matches/sourcify_matches_0_100000.parquet",
112-
...
113-
"sourcify_matches/sourcify_matches_5300000_5400000.parquet"
114-
]
115-
}
116-
}
117-
```
118-
119-
</details>
120-
121-
You can download all the files and use a parquet client to query, inspect, or process the data.
122-
123-
1. Download the manifest file (`-L` to follow redirects):
124-
125-
```bash
126-
curl -L -O https://export.sourcify.dev/manifest.json
127-
```
128-
129-
2. Download all the tables listed in the manifest:
130-
```bash
131-
jq -r '.files | keys[] as $k | .[$k][]' manifest.json | xargs -I {} curl -L -O https://export.sourcify.dev/{}
132-
```
133-
134-
For example you can install the [`parquet-cli`](https://github.com/apache/parquet-java/blob/master/parquet-cli/README.md) to do basic inspection:
135-
136-
```bash
137-
brew install parquet-cli
138-
139-
parquet meta compiled_contracts_0_5000.parquet
140-
```
141-
142-
alternatively use your favorite data processing tool or import this data into a database.
143-
144-
## BigQuery Datasets
70+
## BigQuery Dataset
14571

14672
We also provide a public BigQuery dataset for convenient querying and exploration:
14773

148-
[Sourcify production dataset](https://console.cloud.google.com/bigquery/analytics-hub/exchanges/projects/1019539084286/locations/europe-west1/dataExchanges/sourcify_19a0c79ef3a/listings/sourcify_19a0c7d0be2?project=tranquil-petal-125711)
74+
[Sourcify BigQuery dataset](https://console.cloud.google.com/bigquery/analytics-hub/exchanges/projects/1019539084286/locations/europe-west1/dataExchanges/sourcify_19a0c79ef3a/listings/sourcify_19a0c7d0be2?project=tranquil-petal-125711)
75+
76+
The dataset is updated continuously as new contracts are verified. You need a Google account to access it.
Lines changed: 146 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,146 @@
1+
# Download the Dataset
2+
3+
:::warning
4+
5+
The previous Parquet export format v1 is now deprecated. See the [note](/docs/repository/download-dataset/#legacy-format-v1) below. Please follow the [instructions](/docs/repository/download-dataset/#export-v2-format) for the new v2 format.
6+
7+
:::
8+
9+
The entire Sourcify Database is exported continuously as [Parquet](https://github.com/apache/parquet-format) files, a modern columnar data format. Parquet files are compressed, efficient to query, and widely supported by data tools. ([Quick tutorial](https://www.datacamp.com/tutorial/apache-parquet)).
10+
11+
The export is hosted on Google Cloud Storage and accessible via an S3-compatible API at [export.sourcify.dev](https://export.sourcify.dev/). The export is based on the structure of the [Verifier Alliance database export](https://verifieralliance.org/docs/download).
12+
13+
## Export v2 Format
14+
15+
The export format has undergone a redesign to make it more efficient and easier to use. The v2 format follows these principles:
16+
17+
- New data is uploaded **daily**.
18+
- Each database **table** is stored as a set of Parquet files.
19+
- Files are partitioned by row ranges and **ordered** by `created_at` timestamps. Exception: the `sourcify_matches` table is ordered by `updated_at` timestamps, please see the [note](/docs/repository/download-dataset/#note-on-sourcify_matches) below.
20+
- **Append-only** pattern: New data is added to new files; existing files are not modified. Only the most recent file for each table may be updated while it is not full yet.
21+
- **File metadata** (checksums, sizes, timestamps) is provided directly by the Google Cloud Storage API.
22+
- Files use **zstd compression** built into the Parquet format.
23+
24+
The dataset is available at [export.sourcify.dev](https://export.sourcify.dev/). All files of the v2 format are stored under the `v2/` prefix.
25+
26+
### Downloading and Syncing the Dataset
27+
28+
To download the entire dataset, you can run this command:
29+
30+
```bash
31+
curl -s 'https://export.sourcify.dev/?prefix=v2/' | \
32+
grep -oP '(?<=<Key>)[^<]+' | \
33+
xargs -I {} curl -L -O https://export.sourcify.dev/{}
34+
```
35+
36+
Alternatively, the [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html#getting-started-install-instructions) makes it easy to download and keep the dataset in sync. The following command downloads the entire dataset on the first run, and on subsequent runs only downloads new or modified files:
37+
38+
```bash
39+
aws s3 sync s3://sourcify-parquet-export/v2/ ./sourcify-dataset --endpoint-url https://storage.googleapis.com --no-sign-request
40+
```
41+
42+
### Note on `sourcify_matches`
43+
44+
The `sourcify_matches` table is the only table that is not append-only and can be updated in the underlying Sourcify Database.
45+
Therefore, its rows are ordered by `updated_at` timestamps when exported.
46+
This means that rows with the same `id` may appear multiple times in the export files.
47+
48+
When working with the `sourcify_matches` table from the export, please only consider the row with the most recent `updated_at` for each `id` as the current state. For importing the `sourcify_matches` parquet files into a database, an **upsert** operation should be used.
49+
50+
51+
### Working with Parquet Files
52+
53+
Once downloaded, you can query and analyze Parquet files using various tools and libraries. Here are some popular options to give you a head start:
54+
55+
- [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html): Read data from Parquet files in Python
56+
- [DuckDB](https://duckdb.org/docs/data/parquet): SQL queries on Parquet files
57+
- [pg_parquet](https://github.com/CrunchyData/pg_parquet): PostgreSQL extension for copying Parquet data into a Postgres database
58+
59+
### API
60+
61+
For more fine-grained control, you can browse and download files directly using the S3-compatible Google Cloud Storage API:
62+
63+
**List all v2 files:**
64+
65+
```
66+
https://export.sourcify.dev/?prefix=v2/
67+
```
68+
69+
**List files for a specific table:**
70+
71+
```
72+
https://export.sourcify.dev/?prefix=v2/verified_contracts/
73+
```
74+
75+
**Download a specific file:**
76+
77+
```
78+
https://export.sourcify.dev/v2/verified_contracts/verified_contracts_0_1000000.parquet
79+
```
80+
81+
The API returns XML responses following the [Google Cloud Storage XML API specification](https://cloud.google.com/storage/docs/xml-api/get-bucket-list).
82+
83+
#### Available Tables
84+
85+
The Parquet export is available for all Sourcify Database tables: `sourcify_matches`, `verified_contracts`, `sources`, `compiled_contracts_sources`, `compiled_contracts`, `contract_deployments`, `contracts`, `code`, `compiled_contracts_signatures`, and `signatures`.
86+
87+
#### API Parameters
88+
89+
The most important parameters of the listing API are the following:
90+
91+
- **prefix**: Filter results to objects whose names begin with this prefix (e.g., `?prefix=v2/verified_contracts/`)
92+
- **marker**: Start listing after this object name (for pagination)
93+
- **max-keys**: Maximum number of objects to return in one response
94+
95+
The response from the listing API might be truncated, which is indicated by the `IsTruncated` field of the result. The `marker` parameter can be used to paginate through results by setting it to the `NextMarker` of the previous response.
96+
97+
Example with pagination:
98+
99+
```
100+
https://export.sourcify.dev/?prefix=v2/verified_contracts/&max-keys=2&marker=v2/verified_contracts/verified_contracts_1000000_2000000.parquet
101+
```
102+
103+
#### Metadata
104+
105+
The listing API provides detailed metadata for each of the Parquet files:
106+
107+
```xml
108+
<ListBucketResult xmlns="http://doc.s3.amazonaws.com/2006-03-01">
109+
<Name>sourcify-parquet-export</Name>
110+
<Prefix>v2/</Prefix>
111+
<Marker/>
112+
<IsTruncated>false</IsTruncated>
113+
<Contents>
114+
<Key>v2/code/code_0_100000.parquet</Key>
115+
<Generation>1766065018286394</Generation>
116+
<MetaGeneration>1</MetaGeneration>
117+
<LastModified>2025-12-18T13:36:58.292Z</LastModified>
118+
<ETag>"ba687acd0afab85ed203a593479f0ce3"</ETag>
119+
<Size>101591414</Size>
120+
</Contents>
121+
<!-- More entries... -->
122+
</ListBucketResult>
123+
```
124+
125+
Most important fields:
126+
127+
- **Key**: The file path (download at `https://export.sourcify.dev/{Key}`)
128+
- **LastModified**: When the file was last uploaded/modified
129+
- **ETag**: MD5 hash of the file contents (use this to detect changes)
130+
- **Size**: File size in bytes
131+
132+
## Legacy Format (v1)
133+
134+
:::warning Deprecation Notice
135+
136+
The v1 Parquet export format is **no longer updated**. All new data is only available in the v2 format. Please migrate to v2 for access to current data.
137+
138+
:::
139+
140+
The legacy v1 format files can still be accessed via non-prefixed paths in the bucket (e.g., `https://export.sourcify.dev/verified_contracts/verified_contracts_0_1000000.parquet`).
141+
142+
The v1 format used a JSON manifest file at [https://export.sourcify.dev/manifest.json](https://export.sourcify.dev/manifest.json) listing all available Parquet files. However, this format was not append-only. Each daily export regenerated all files, requiring users to download the entire dataset again after every update. The manifest also did not include checksums or modification timestamps, making it difficult to determine what changed between exports.
143+
144+
## Export Script
145+
146+
The source code of the export script is available at [https://github.com/sourcifyeth/parquet-export](https://github.com/sourcifyeth/parquet-export).

docs/4. repository/2. file-repositories.mdx renamed to docs/4. repository/3. file-repositories.mdx

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,15 @@ import TotalRepoSize from "./TotalRepoSize"
22

33
# File Repositories
44

5-
This page describes the `RepositoryV1` and `RepositoryV2`, which are file systems (deprecated).
5+
:::danger Deprecation Notice
6+
The file repositories are **deprecated** and no longer supported. The [Sourcify Database](/docs/repository/sourcify-database/) serves as the source of truth now.
67

7-
:::warning
8-
The file repositories are used by the legacy API that is deprecated. Please use APIv2 and the [Database](/docs/repository/sourcify-database/) as the main storage backend.
8+
Only the IPFS pinning service still uses the logical structure of RepositoryV2 for uploading files to IPFS.
99

10-
You can still use RepositoryV2 just to save files to be pinned on IPFS.
10+
For custom Sourcify instances, we recommend migrating to APIv2 and the [Database](/docs/repository/sourcify-database/) as the main storage backend. See the [migration guide](/docs/database-migration).
1111
:::
1212

13+
This page describes the `RepositoryV1` and `RepositoryV2`, which are file systems (deprecated).
1314

1415
## Table of Contents
1516

@@ -74,16 +75,15 @@ The files are exactly the same so their IPFS hashes will not change, and you can
7475

7576
## IPFS
7677

77-
Unfortunatelly publishing under IPNS is temporarily disabled. This is because of the difficulty of managing the whole filesystem over IPFS (with MFS etc.) and updating the IPNS regularly.
78+
The sources of all verified contracts are pinned on IPFS. The logical structure of RepositoryV2 serves as the basis for uploading to IPFS. Files can be accessed via their individual CIDs (e.g. [`QmVij3h9z536ZG5cRpUmTfdoN9KR1Xp4ix2P7to9dPHgE5`](https://ipfs.io/ipfs/QmVij3h9z536ZG5cRpUmTfdoN9KR1Xp4ix2P7to9dPHgE5)).
7879

79-
We still pin all the files on IPFS so you can access them over their individual CIDs (e.g. [`QmVij3h9z536ZG5cRpUmTfdoN9KR1Xp4ix2P7to9dPHgE5`](https://ipfs.io/ipfs/QmVij3h9z536ZG5cRpUmTfdoN9KR1Xp4ix2P7to9dPHgE5)).
80+
Unfortunately, publishing under IPNS is temporarily disabled. This is because of the difficulty of managing the whole filesystem over IPFS (with MFS etc.) and updating the IPNS regularly.
8081

81-
Look at the [Download section](#download) to learn how to download the whole repository.
8282

8383
## Download
8484

8585
:::danger No New Exports
86-
Following deprecating the filesystem based repositories, **we no longer publish new exports**. We recommend resorting to the [Parquet exports](/docs/repository/sourcify-database/#download) instead.
86+
Following deprecating the filesystem based repositories, **we no longer publish new exports**. We recommend resorting to the [Parquet exports](/docs/repository/download-dataset/) instead.
8787

8888
You can still download the existing export for a while. Double check the date of the export in the manifest file. If you need these exports please reach out to us.
8989
:::

docs/4. repository/3. signature-database.mdx renamed to docs/4. repository/4. signature-database.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,5 +6,5 @@ The data is stored in the same database as the verified contracts. The `signatur
66

77
- **Schema**: Check [docs/repository/sourcify-database/#schema](/docs/repository/sourcify-database/#schema) for the schema.
88
- **API**: Check [docs/api/](/docs/api/) for the API.
9-
- **Download**: You can download the related tables in Parquet format from [export.sourcify.dev](https://export.sourcify.dev). See [/docs/repository/sourcify-database/#download](/docs/repository/sourcify-database/#download) for more details.
9+
- **Download**: You can download the related tables in Parquet format from [export.sourcify.dev](https://export.sourcify.dev). See [Download the Dataset](/docs/repository/download-dataset/) for more details.
1010
- **Playground**: Visit [4byte.sourcify.dev](https://4byte.sourcify.dev) to search for signatures.

docs/4. repository/index.mdx

Lines changed: 7 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,13 @@
1-
# Contract Repository
1+
# Contract Dataset
22

3-
Sourcify stores the contracts in multiple storage backends and gives the option to choose which one to use. In short there are the following options:
3+
Sourcify stores all verified contracts in multiple storage backends and gives multiple options to access the dataset. In short, there are the following options:
44

5-
- `RepositoryV1`
6-
- `RepositoryV2`
7-
- `SourcifyDatabase`
8-
- `AllianceDatabase`
5+
- **Sourcify Database**: Sourcify's source of truth, a [postgres database](/docs/repository/sourcify-database). Accessible via the [API](/docs/api/) and the [Repo UI](https://repo.sourcify.dev/).
6+
- **Verifier Alliance Database**: Shared database with other verification services. See the [Verifier Alliance](https://verifieralliance.org/) website for more info.
7+
- **BigQuery**: For convenience, the Sourcify dataset is uploaded to [BigQuery](/docs/repository/sourcify-database/#bigquery-datasets).
8+
- **IPFS**: The sources of all verified contracts are pinned on [IPFS](/docs/repository/file-repositories/#ipfs).
99

10-
For details see [Choosing the storage backend](https://github.com/argotorg/sourcify/tree/staging/services/server#choosing-the-storage-backend).
1110

1211
## Download
1312

14-
You can download the whole contract file repository in zips or the Sourcify database in Parquet format. Follow the guides in each page:
15-
- [Download RepositoryV2](/docs/repository/file-repositories/#download)
16-
- [Download SourcifyDatabase](/docs/repository/sourcify-database/#download)
13+
You can download the Sourcify database in Parquet format. Follow this guide: [Download the Dataset](/docs/repository/download-dataset/).

src/css/custom.css

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,10 @@ h4 {
4848
font-family: "VT323";
4949
}
5050

51+
h4 {
52+
font-size: 1.3rem;
53+
}
54+
5155
.navbar__logo > img {
5256
border-radius: 9999px;
5357
}

0 commit comments

Comments
 (0)