Skip to content

Commit aadb781

Browse files
authored
docs: update contribute-data (#3435)
* docs: flatten API crawling sections down to contribute-data/ * fix: .env.example works by default without GCP auth * docs: consolidate dbt guide * docs: update GCS asset factory * docs: update funding-data guide * docs: Update bigquery data guide * docs: update BigQuery data transfer service guide * docs: simplify database replication guide * docs: update custom dagster asset guide * docs: Dagster getting started guide * docs: fixup contribute-data index * docs: build
1 parent f74b7f4 commit aadb781

20 files changed

+13054
-22412
lines changed

.env.example

Lines changed: 25 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,47 +1,48 @@
11
# .env
22
## This .env file is mostly used for Python data ops
33

4-
## Google Cloud setup
5-
# You will need to generate Google application credentials
6-
# Note: You can use your gcloud auth credentials
7-
GOOGLE_APPLICATION_CREDENTIALS=<path-to-valid-gcp-creds>
8-
# GCP project ID
9-
GOOGLE_PROJECT_ID=
10-
# Used for storing all BigQuery data in the dbt pipeline
11-
BIGQUERY_DATASET_ID=
12-
134
## Dagster Setup
145
# You may want to change the location of dagster home if you want it to survive resets
156
DAGSTER_HOME=/tmp/dagster-home
167

17-
# This is used to put generated dbt profiles for dagster in a specific place
18-
DAGSTER_DBT_TARGET_BASE_DIR=/tmp/dagster-home/generated-dbt
19-
DAGSTER_DBT_PARSE_PROJECT_ON_LOAD=1
20-
21-
# Used when loading dlt assets into a staging area. It should be set to a GCS
22-
# bucket that will be used to write to for dlt data transfers into bigquery.
23-
DAGSTER_STAGING_BUCKET_URL=gs://some-bucket
8+
## sqlmesh
9+
SQLMESH_DUCKDB_LOCAL_PATH=/tmp/oso.duckdb
10+
#SQLMESH_DUCKDB_LOCAL_TRINO_PATH=/tmp/oso-trino.duckdb
2411

2512
# Uncomment the next two vars to use gcp secrets (you'll need to have gcp
2613
# secrets configured). Unfortunately at this time, if you don't have access to
2714
# the official oso gcp account uncommenting these will likely not work. The GCP
28-
# secrets prefix should likely match the dagster deployment's search prefix in
29-
# flux
30-
#DAGSTER_USE_LOCAL_SECRETS=False
15+
# secrets prefix should likely match the dagster deployment's search prefix in flux
16+
DAGSTER_USE_LOCAL_SECRETS=True
3117
#DAGSTER_GCP_SECRETS_PREFIX=dagster
3218

19+
## Google Cloud setup
20+
# You will need to generate Google application credentials.
21+
# You can log in via `gcloud auth application-default login`
22+
# Then you can enter the path to your credentials
23+
# e.g. /home/user/.config/gcloud/application_default_credentials.json
24+
GOOGLE_APPLICATION_CREDENTIALS=
25+
# GCP project ID
26+
GOOGLE_PROJECT_ID=
27+
# Used for storing all BigQuery data in the dbt pipeline
28+
BIGQUERY_DATASET_ID=
29+
# Used when loading dlt assets into a staging area. It should be set to a GCS
30+
# bucket that will be used to write to for dlt data transfers into bigquery.
31+
DAGSTER_STAGING_BUCKET_URL=gs://some-bucket
32+
3333
## Clickhouse setup
3434
DAGSTER__CLICKHOUSE__HOST=
3535
DAGSTER__CLICKHOUSE__USER=
3636
DAGSTER__CLICKHOUSE__PASSWORD=
3737

38-
## sqlmesh
39-
SQLMESH_DUCKDB_LOCAL_PATH=/tmp/oso.duckdb
40-
4138
###################
4239
# DEPRECATED
4340
###################
4441

42+
# This is used to put generated dbt profiles for dagster in a specific place
43+
DAGSTER_DBT_TARGET_BASE_DIR=/tmp/dagster-home/generated-dbt
44+
DAGSTER_DBT_PARSE_PROJECT_ON_LOAD=0
45+
4546
# Used for data transfer between databases
4647
CLOUDSTORAGE_BUCKET_NAME=
4748

@@ -50,4 +51,5 @@ CLOUDSQL_REGION=
5051
CLOUDSQL_INSTANCE_ID=
5152
CLOUDSQL_DB_NAME=
5253
CLOUDSQL_DB_PASSWORD=
53-
CLOUDSQL_DB_USER=
54+
CLOUDSQL_DB_USER=
55+

apps/docs/docs/contribute-data/api-crawling/index.md

Lines changed: 0 additions & 18 deletions
This file was deleted.

apps/docs/docs/contribute-data/bigquery.md

Lines changed: 0 additions & 84 deletions
Original file line numberDiff line numberDiff line change
@@ -45,87 +45,3 @@ Add the `allAuthenticatedUsers` as the "BigQuery Data Viewer"
4545
If you have reasons to keep your dataset private,
4646
you can reach out to us directly on our
4747
[Discord](https://www.opensource.observer/discord).
48-
49-
## Defining a dbt source
50-
51-
For example, Google maintains a
52-
[public dataset](https://cloud.google.com/blog/products/data-analytics/ethereum-bigquery-public-dataset-smart-contract-analytics)
53-
for Ethereum mainnet.
54-
55-
As long as the dataset is publicly available in the US region,
56-
we can create a dbt source in `oso/warehouse/dbt/models/`
57-
(see [source](https://github.com/opensource-observer/oso/blob/main/warehouse/dbt/models/ethereum_sources.yml)):
58-
59-
```yaml
60-
sources:
61-
- name: ethereum
62-
database: bigquery-public-data
63-
schema: crypto_ethereum
64-
tables:
65-
- name: transactions
66-
identifier: transactions
67-
- name: traces
68-
identifier: traces
69-
```
70-
71-
We can then reference these tables in a downstream model with
72-
the `source` macro:
73-
74-
```sql
75-
select
76-
block_timestamp,
77-
`hash` as transaction_hash,
78-
from_address,
79-
receipt_contract_address
80-
from {{ source("ethereum", "transactions") }}
81-
```
82-
83-
## Creating a playground dataset (optional)
84-
85-
If the source table is large, we will want to
86-
extract a subset of the data into a playground dataset
87-
for testing and development.
88-
89-
For example for GitHub event data,
90-
we copy just the last 14 days of data
91-
into a playground dataset, which is used
92-
when the dbt target is set to `playground`
93-
(see [source](https://github.com/opensource-observer/oso/blob/main/warehouse/dbt/models/github_sources.yml)):
94-
95-
```yaml
96-
sources:
97-
- name: github_archive
98-
database: |
99-
{%- if target.name in ['playground', 'dev'] -%} opensource-observer
100-
{%- elif target.name == 'production' -%} githubarchive
101-
{%- else -%} invalid_database
102-
{%- endif -%}
103-
schema: |
104-
{%- if target.name in ['playground', 'dev'] -%} oso
105-
{%- elif target.name == 'production' -%} day
106-
{%- else -%} invalid_schema
107-
{%- endif -%}
108-
tables:
109-
- name: events
110-
identifier: |
111-
{%- if target.name in ['playground', 'dev'] -%} stg_github__events
112-
{%- elif target.name == 'production' -%} 20*
113-
{%- else -%} invalid_table
114-
{%- endif -%}
115-
```
116-
117-
### Choosing a playground window size
118-
119-
There is a fine balance between choosing a playground data set window
120-
that is sufficiently small for affordable testing and development,
121-
yet produces meaningful results to detect issues in your queries.
122-
123-
:::warning
124-
Coming soon... This section is a work in progress.
125-
:::
126-
127-
### Copying the playground dataset
128-
129-
:::warning
130-
Coming soon... This section is a work in progress.
131-
:::

apps/docs/docs/contribute-data/bq-data-transfer.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
---
22
title: BigQuery Data Transfer Service
3-
sidebar_position: 6
3+
sidebar_position: 2
4+
sidebar_class_name: hidden
45
---
56

67
BigQuery comes with a built-in data transfer service

apps/docs/docs/contribute-data/api-crawling/crawl-api-advanced.png renamed to apps/docs/docs/contribute-data/crawl-api-advanced.png

File renamed without changes.

apps/docs/docs/contribute-data/api-crawling/crawl-api-example-defillama.png renamed to apps/docs/docs/contribute-data/crawl-api-example-defillama.png

File renamed without changes.

apps/docs/docs/contribute-data/api-crawling/crawl-api-example-opencollective.png renamed to apps/docs/docs/contribute-data/crawl-api-example-opencollective.png

File renamed without changes.

apps/docs/docs/contribute-data/api-crawling/crawl-api-graphql-pipeline.png renamed to apps/docs/docs/contribute-data/crawl-api-graphql-pipeline.png

File renamed without changes.

apps/docs/docs/contribute-data/dagster.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,14 @@
11
---
22
title: Write a Custom Dagster Asset
3-
sidebar_position: 6
3+
sidebar_position: 7
44
---
55

66
Before writing a fully custom Dagster asset,
77
we recommend you first see if the previous guides on
88
[BigQuery datasets](./bigquery.md),
99
[database replication](./database.md),
10-
[API crawling](./api-crawling/index.md)
10+
[Graph API crawling](./graphql-api.md),
11+
or [REST API crawling](./rest-api.md)
1112
may be a better fit.
1213
This guide should only be used in the rare cases where you cannot
1314
use the other methods.
Lines changed: 38 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -1,26 +1,26 @@
11
---
2-
title: Provide Access to Your Database
2+
title: Replicate your SQL Database
33
sidebar_position: 3
44
---
55

6-
OSO's dagster infrastructure has support for database replication into our data
6+
OSO's Dagster infrastructure has support for database replication into our data
77
warehouse by using Dagster's "embedded-elt" that integrates with the library
88
[dlt](https://dlthub.com/).
99

10-
## Configure your database as a dagster asset
10+
## Configure your database as a Dagster asset
1111

12-
There are many possible ways to configure a database as a dagster asset,
13-
however, to reduce complexity of configuration we provide a single interface for
14-
specifying a SQL database for replication. The SQL database _must_ be a database
15-
that is [supported by
16-
dlt](https://dlthub.com/devel/dlt-ecosystem/verified-sources/sql_database). In
17-
general, we replicate _all_ columns and for now custom column selection is not
12+
There are many possible ways to configure a database as a Dagster asset.
13+
To simplify things, we have built a factory function, `sql_assets`,
14+
to automatically replicate any SQL database.
15+
The SQL database _must_ be a database that is
16+
[supported by dlt](https://dlthub.com/devel/dlt-ecosystem/verified-sources/sql_database).
17+
In general, we replicate _all_ columns and for now custom column selection is not
1818
available in our interface.
1919

20-
This section shows how to setup a database with two tables as a set of sql
21-
assets. The table named `some_incremental_database` has a chronologically
22-
organized or updated dataset and can therefore be loaded incrementally. The
23-
second table, `some_nonincremental_database`, does not have a way to be loaded
20+
This section shows how to replciate 2 tables in a database.
21+
The first table named `some_incremental_database` has a time column
22+
and can be loaded incrementally.
23+
The second table, `some_nonincremental_database`, does not have a way to be loaded
2424
incrementally and will force a full refresh upon every sync.
2525

2626
To setup this database replication, you can add a new python file to
@@ -52,25 +52,21 @@ my_database = sql_assets(
5252
```
5353

5454
The first three lines of the file import some necessary tooling to configure a
55-
sql database:
56-
57-
- The first import, `sql_assets`, is an asset factory created by the OSO team
58-
that enables this "easy" configuration of sql assets.
59-
- The second import, `SecretReference`, is a tool used to reference a secret in
60-
a secret resolver. The secret resolver can be configured differently based on
61-
the environment, but on production we use this to reference a cloud based secret
62-
manager.
63-
- The final import, `incremental`, is used to specify a column to use for
64-
incremental loading. This is a `dlt` constructor that is passed to the
65-
configuration.
66-
67-
The `sql_assets`, factory takes 3 arguments:
68-
69-
- The first argument is an asset key prefix which is used to both specify an
70-
asset key prefix and also used when generating asset related names inside the
71-
factory. In general, this should match the filename of the containing python
72-
file unless you have a more complex set of assets to configure. This name is
73-
also used as the dataset name into which this data will be loaded.
55+
SQL database:
56+
57+
- `sql_assets`: an asset factory created by the OSO team
58+
that enables this simple configuration of SQL assets.
59+
- `SecretReference`: a secret reference in the OSO a secret resolver.
60+
The secret resolver can be configured differently based on
61+
the environment. On production, we use a cloud-based secret manager.
62+
- `incremental`: used to specify a column to use for incremental loading.
63+
This is a `dlt` constructor that is passed to the configuration.
64+
65+
The `sql_assets` factory takes 3 arguments:
66+
67+
- The first argument is an asset key prefix, used to group assets generated
68+
by the factory. In general, this should match the filename of the python
69+
file unless you have more complex requirements.
7470
- The second argument must be a `SecretReference` object that will be used to
7571
retrieve the credentials that you will provide at a later step to the OSO
7672
team. The `SecretReference` object has two required keyword arguments:
@@ -81,11 +77,10 @@ The `sql_assets`, factory takes 3 arguments:
8177
- `key` - This is an arbitrary name for the secret.
8278

8379
- The third argument is a list of dictionaries that define options for tables
84-
that should be replicated into the data warehouse. The most important options
85-
here are:
80+
that should be replicated into OSO.
8681

87-
- `table` - The table name
88-
- `destination_table_name` - The table name to use in the data warehouse
82+
- `table` - The source table name
83+
- `destination_table_name` - The destination table name to use in the OSO data lake
8984
- `incremental` - An `incremental` object that defines time/date based column
9085
to use for incrementally loading a database.
9186

@@ -95,11 +90,11 @@ The `sql_assets`, factory takes 3 arguments:
9590

9691
## Enabling access to your database
9792

98-
Before the OSO infrastructure can begin to synchronize your database to the data
99-
warehouse, it will need to be provided access to the database. At this time
100-
there is no automated process for this. Once you're ready to get your database
101-
integrated, you will want to contact the OSO team on our
102-
[Discord](https://www.opensource.observer/discord). Be prepared to provide
103-
credentials (we will work out a secure method of transmission) and also ensure
104-
that you have access to update any firewall settings that may be required for us
93+
For the asset to run in OSO production, we will need access to
94+
your secrets (e.g. password or connection string).
95+
At this time there is no automated process for this.
96+
You can contact the OSO team on our
97+
[Discord](https://www.opensource.observer/discord).
98+
Be prepared to provide credentials via a secure method of transmission.
99+
Also remember to update any firewall settings that may be required for us
105100
to access your database server.

0 commit comments

Comments
 (0)