Skip to content

Commit 52bba79

Browse files
authored
feat: graphql factory dynamic args (#4527)
* add: graphql factory `example` asset * fix: use `python` based `pyright` * fix: graphql factory return `dlt.resource` for greater customization * fix: regenerate pnpm `lock` file * fix: update date parameters and add transform function in expenses asset * fix: enhance GraphQL resource configuration with dependency support and improve field expansion logic * fix: `qol` improvements for `graphql` factory * add: example `dependency` asset usage * add: `giveth` crawler with `graphql` factory * feat: rewrite docs for new `graphql` factory config * fix: remove `unused` file * fix: cleanup `qf_rounds` example asset * fix: remove `todo` comment * add: `graphql` factory leaf data shape info * fix: revert `pyright` uninstall * fix: update macro registration to unpack function and aliases correctly * Revert "fix: update macro registration to unpack function and aliases correctly" This reverts commit 9daf0e8. * Revert "Revert "fix: update macro registration to unpack function and aliases correctly"" This reverts commit 7588551. * fix: revert to `pythonpath` approach * fix: rename asset file to `giveth_qf` * fix: update `typing` to require a `resource` in `deps` * add: `global_config` as new required argument * fix: address pull request `review` feedback * feat: `redis` backed memoize utility decorator
1 parent fda2ff4 commit 52bba79

File tree

10 files changed

+671
-306
lines changed

10 files changed

+671
-306
lines changed
-80.6 KB
Loading
127 KB
Loading

apps/docs/docs/contribute-data/graphql-api.md

Lines changed: 208 additions & 112 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ sidebar_position: 4
66
This guide will explain how to use the
77
[`graphql_factory`](https://github.com/opensource-observer/oso/blob/main/warehouse/oso_dagster/factories/graphql.py)
88
factory function to automatically introspect, build queries, and scrape GraphQL
9-
APIs, including support for pagination.
9+
APIs, including support for pagination and dependencies.
1010

1111
## Defining Your GraphQL Resource
1212

@@ -15,11 +15,8 @@ the
1515
[Open Collective API](https://docs.opencollective.com/help/contributing/development/api).
1616
The API has a `transactions` query that returns a list of transactions.
1717

18-
Currently, it has hundreds of nested fields, making it **cumbersome** to write
19-
queries manually (we have done this in the past and it is not fun). The GraphQL
20-
Resource Factory will introspect the schema,
21-
generate the query, extract the relevant data, and
22-
return a clean asset, with minimal effort.
18+
The GraphQL factory will introspect the schema, generate queries, extract the
19+
relevant data, and return clean assets with minimal configuration.
2320

2421
### 1. Create the Configuration
2522

@@ -28,28 +25,24 @@ resource. For the Open Collective transactions example, we set the endpoint URL,
2825
define query parameters, and specify a transformation function to extract the
2926
data we need.
3027

31-
We also set a `max_depth` parameter to limit the depth of the introspection
32-
query. This means the generated query will only explore fields up to a certain
33-
depth, preventing it from becoming too large, but capturing all the necessary
34-
data for our asset.
35-
3628
```python
37-
from ..factories.graphql import GraphQLResourceConfig
38-
from ..factories.pagination import PaginationConfig, PaginationType
29+
from dagster import AssetExecutionContext
30+
from ..factories.graphql import (
31+
GraphQLResourceConfig,
32+
PaginationConfig,
33+
PaginationType,
34+
graphql_factory,
35+
)
36+
from ..factories.dlt import dlt_factory
37+
from ..config import DagsterConfig
3938

39+
# Configuration for the main transactions query
4040
config = GraphQLResourceConfig(
4141
name="transactions",
4242
endpoint="https://api.opencollective.com/graphql/v2",
43-
max_depth=3, # Limit the introspection depth
44-
pagination=PaginationConfig(
45-
type=PaginationType.OFFSET,
46-
page_size=100,
47-
max_pages=5,
48-
rate_limit_seconds=0.5,
49-
offset_field="offset",
50-
limit_field="limit",
51-
total_count_path="totalCount",
52-
),
43+
target_type="Query",
44+
target_query="transactions",
45+
max_depth=2, # Limit the introspection depth
5346
parameters={
5447
"type": {
5548
"type": "TransactionType!",
@@ -64,114 +57,159 @@ config = GraphQLResourceConfig(
6457
"value": "2024-12-31T23:59:59Z",
6558
},
6659
},
67-
transform_fn=lambda result: result["transactions"]["nodes"], # Optional transformation function
68-
target_query="transactions", # The query to execute
69-
target_type="TransactionCollection", # The type containing the data
60+
pagination=PaginationConfig(
61+
type=PaginationType.OFFSET,
62+
page_size=100,
63+
max_pages=5,
64+
rate_limit_seconds=0.5,
65+
offset_field="offset",
66+
limit_field="limit",
67+
total_count_path="totalCount",
68+
),
69+
exclude=["loggedInAccount", "me"], # Exclude unnecessary fields
70+
transform_fn=lambda result: result["transactions"]["nodes"],
7071
)
7172
```
7273

7374
:::tip
74-
For the full `GraphQLResourceConfig` spec, see the [`source`](https://github.com/opensource-observer/oso/blob/05fe8b9192a08f6446225a89f4455c6b3723c5de/warehouse/oso_dagster/factories/graphql.py#L99)
75+
For the full `GraphQLResourceConfig` spec, see the [`source`](https://github.com/opensource-observer/oso/blob/main/warehouse/oso_dagster/factories/graphql.py#L151)
7576
:::
7677

7778
In this configuration, we define the following fields:
7879

7980
- **name**: A unique identifier for the dagster asset.
8081
- **endpoint**: The URL of the GraphQL API.
82+
- **target_type**: The GraphQL type containing the target query (usually
83+
"Query").
84+
- **target_query**: The name of the query to execute.
8185
- **max_depth**: The maximum depth of the introspection query. This will
8286
generate a query that explores all fields recursively up to this depth.
83-
- **pagination**: A configuration object that defines how to handle pagination.
84-
It includes the pagination type, page size, maximum number of pages to
85-
fetch, rate limit in seconds, and the fields used for offset and limit.
8687
- **parameters**: A dictionary of query parameters. The keys are the parameter
8788
names, and the values are dictionaries with the parameter type and value.
89+
- **pagination**: A configuration object that defines how to handle pagination.
90+
It includes the pagination type, page size, maximum number of pages to fetch,
91+
rate limit in seconds, and the fields used for offset and limit.
92+
- **exclude**: A list of field names to exclude from the GraphQL schema
93+
expansion.
8894
- **transform_fn**: A function that processes the raw GraphQL response and
8995
returns the desired data.
90-
- **target_query**: The name of the query to execute.
91-
- **target_type**: The name of the GraphQL type that contains the data of
92-
interest.
93-
94-
The factory will create the following query automatically, recursively
95-
introspecting all the fields up to the specified depth:
96-
97-
```graphql
98-
query (
99-
$type: TransactionType!
100-
$dateFrom: DateTime!
101-
$dateTo: DateTime!
102-
$offset: Int
103-
$limit: Int!
104-
) {
105-
transactions(
106-
type: $type
107-
dateFrom: $dateFrom
108-
dateTo: $dateTo
109-
offset: $offset
110-
limit: $limit
111-
) {
112-
offset
113-
limit
114-
totalCount
115-
nodes {
116-
id
117-
legacyId
118-
uuid
119-
group
120-
type
121-
kind
122-
refundKind
123-
amount {
124-
value
125-
currency
126-
valueInCents
127-
}
128-
oppositeTransaction {
129-
id
130-
legacyId
131-
uuid
132-
# ... other generated fields ...
133-
merchantId
134-
invoiceTemplate
135-
}
136-
merchantId
137-
balanceInHostCurrency {
138-
value
139-
currency
140-
valueInCents
141-
}
142-
invoiceTemplate
143-
}
144-
kinds
145-
paymentMethodTypes
146-
}
147-
}
148-
```
14996

150-
### 2. Build the Factory
97+
### 2. Build the Asset with Dependencies
15198

152-
:::tip
153-
The GraphQL factory function takes a mandatory `config`
154-
argument. The other arguments are directly passed to the underlying
155-
`dlt_factory` function, allowing you to customize the behavior of the asset.
156-
157-
For the full reference of the allowed arguments, check out the Dagster
158-
[`asset`](https://docs.dagster.io/api/python-api/assets) documentation.
159-
:::
160-
161-
The `graphql_factory` function is the used to convert your configuration into a
162-
callable Dagster asset. It takes the configuration object and returns a factory
163-
function that our infrastructure will use to automatically create the asset.
99+
The GraphQL factory now returns a `dlt.resource` directly and supports
100+
dependency chaining. You can create assets that depend on the results of other
101+
GraphQL queries:
164102

165103
```python
166-
from ..factories.graphql import graphql_factory
104+
def fetch_account_details(context: AssetExecutionContext, global_config: DagsterConfig, transaction_data):
105+
"""
106+
Dependency function that fetches account details for each transaction.
107+
This function receives individual transaction items and yields account data.
108+
"""
109+
account_config = GraphQLResourceConfig(
110+
name=f"account_{transaction_data['fromAccount']['id']}",
111+
endpoint="https://api.opencollective.com/graphql/v2",
112+
target_type="Query",
113+
target_query="account",
114+
parameters={
115+
"id": {
116+
"type": "String",
117+
"value": transaction_data["fromAccount"]["id"],
118+
},
119+
},
120+
transform_fn=lambda result: result["account"],
121+
max_depth=2,
122+
exclude=["parentAccount", "stats"],
123+
)
167124

168-
# ... config definition ...
125+
# The dependency yields data from the nested GraphQL query
126+
yield from graphql_factory(account_config, global_config, context, max_table_nesting=0)
169127

170-
open_collective_transactions = graphql_factory(
171-
config,
128+
@dlt_factory(
172129
key_prefix="open_collective",
173130
)
131+
def transactions_with_accounts(context: AssetExecutionContext, global_config: DagsterConfig):
132+
"""
133+
Main asset that fetches transactions and their associated account details.
134+
"""
135+
config = GraphQLResourceConfig(
136+
name="transactions_with_accounts",
137+
endpoint="https://api.opencollective.com/graphql/v2",
138+
target_type="Query",
139+
target_query="transactions",
140+
parameters={
141+
"type": {"type": "TransactionType!", "value": "CREDIT"},
142+
"dateFrom": {"type": "DateTime!", "value": "2024-01-01T00:00:00Z"},
143+
"dateTo": {"type": "DateTime!", "value": "2024-12-31T23:59:59Z"},
144+
},
145+
pagination=PaginationConfig(
146+
type=PaginationType.OFFSET,
147+
page_size=50,
148+
max_pages=2,
149+
rate_limit_seconds=1.0,
150+
),
151+
transform_fn=lambda result: result["transactions"]["nodes"],
152+
max_depth=2,
153+
exclude=["loggedInAccount", "me"],
154+
deps=[fetch_account_details], # Dependencies to execute for each item
155+
deps_rate_limit_seconds=1.0, # Rate limit between dependency calls
156+
)
157+
158+
# Return the configured resource
159+
yield graphql_factory(config, global_config, context, max_table_nesting=0)
160+
```
161+
162+
:::tip
163+
The GraphQL factory function now takes a mandatory `config` argument,
164+
`global_config` (DagsterConfig), and `AssetExecutionContext`. The `global_config`
165+
is required for Redis caching functionality that stores GraphQL introspection
166+
results for 24 hours, preventing overwhelming of external services. Additional
167+
arguments are passed to the underlying `dlt.resource` function, allowing you to
168+
customize the behavior of the asset.
169+
170+
For the full reference of the allowed arguments, check out the DLT
171+
[`resource`](https://dlthub.com/docs/general-usage/resource) documentation.
172+
:::
173+
174+
### 3. Dependency System
175+
176+
The new dependency system allows you to:
177+
178+
- **Chain GraphQL queries**: Use results from one query to feed parameters into
179+
another
180+
- **Transform data flow**: The main query results become intermediate data that
181+
feeds dependencies
182+
- **Return consistent shape**: Only the dependency results are yielded to the
183+
final dataset
184+
- **Rate limit dependencies**: Control the rate of dependent API calls
185+
186+
#### Tree-like Dependency Structure
187+
188+
Dependencies can be viewed as a **tree structure** where:
189+
190+
- The **root** is your main GraphQL query
191+
- **Branches** are dependency functions that process each item from the parent
192+
- **Leaves** are the final data shapes that get yielded to your dataset
193+
174194
```
195+
Main Query (transactions)
196+
├── Item 1 → fetch_account_details() → Account Data (leaf)
197+
├── Item 2 → fetch_account_details() → Account Data (leaf)
198+
└── Item 3 → fetch_account_details() → Account Data (leaf)
199+
```
200+
201+
#### Data Shape Merging
202+
203+
When `deps` is provided, **DLT automatically handles different data shapes**:
204+
205+
1. The main query executes and fetches data
206+
2. Each item from the main query is passed to dependency functions
207+
3. Dependency functions can execute additional GraphQL queries with different schemas
208+
4. **DLT merges different data shapes** from dependencies automatically
209+
5. **Only leaf nodes** (final dependency results) are included in the final output
210+
6. The main query data serves as intermediate processing data and is discarded
211+
212+
This means you can have dependencies that return completely different data structures (e.g., transactions → accounts → users), and DLT will intelligently combine them into a coherent dataset, only preserving the final leaf data from your dependency tree.
175213

176214
---
177215

@@ -183,8 +221,8 @@ our [quickstart guide](./setup/index.md).
183221
:::
184222

185223
After having your Dagster instance running, follow the
186-
[Dagster setup guide](./setup/index.md) to materialize the assets.
187-
Our example assets are located under `assets/open_collective/transactions`.
224+
[Dagster setup guide](./setup/index.md) to materialize the assets. Our example
225+
assets are located under `assets/open_collective/transactions`.
188226

189227
![Dagster Open Collective Asset List](crawl-api-graphql-pipeline.png)
190228

@@ -197,12 +235,70 @@ result.
197235

198236
---
199237

238+
## Advanced Features
239+
240+
### Field Exclusion
241+
242+
Use the `exclude` parameter to skip unnecessary fields and reduce query
243+
complexity:
244+
245+
```python
246+
config = GraphQLResourceConfig(
247+
# ... other config
248+
exclude=["loggedInAccount", "me", "stats.totalDonations"],
249+
)
250+
```
251+
252+
### Custom Stop Conditions
253+
254+
Define custom conditions to stop pagination:
255+
256+
```python
257+
def stop_when_old_data(result, page_count):
258+
# Stop if we encounter transactions older than 1 year
259+
if result.get("transactions", {}).get("nodes"):
260+
last_transaction = result["transactions"]["nodes"][-1]
261+
transaction_date = last_transaction.get("createdAt")
262+
# Add your date comparison logic here
263+
return False # Continue pagination
264+
return True # Stop pagination
265+
266+
pagination_config = PaginationConfig(
267+
type=PaginationType.OFFSET,
268+
stop_condition=stop_when_old_data,
269+
# ... other pagination settings
270+
)
271+
```
272+
273+
### Cursor-Based Pagination
274+
275+
For APIs that use cursor-based pagination:
276+
277+
```python
278+
pagination_config = PaginationConfig(
279+
type=PaginationType.CURSOR,
280+
page_size=50,
281+
cursor_field="after",
282+
page_size_field="first",
283+
next_cursor_path="pageInfo.endCursor",
284+
has_next_path="pageInfo.hasNextPage",
285+
)
286+
```
287+
288+
---
289+
200290
## Conclusion
201291

202-
The GraphQL Resource Factory is a powerful tool for creating reusable assets
203-
that fetch data from GraphQL APIs. By defining a configuration object and
204-
building a factory function, you can quickly create assets that crawl complex
205-
APIs effortlessly.
292+
The GraphQL factory provides a powerful way to create reusable assets that fetch
293+
data from GraphQL APIs with minimal configuration. The new version offers
294+
enhanced capabilities:
295+
296+
- **Direct DLT integration**: Returns `dlt.resource` objects directly
297+
- **Dependency chaining**: Chain multiple GraphQL queries together
298+
- **Field exclusion**: Skip unnecessary fields to optimize queries
299+
- **Enhanced error handling**: Better logging and error reporting
300+
- **Flexible pagination**: Support for offset, cursor, and relay-style
301+
pagination
206302

207303
This allows you to focus on the data you need and the transformations you want
208304
to apply, rather than the mechanics of constructing queries and managing API

0 commit comments

Comments
 (0)