Skip to content

Commit 9e72cf7

Browse files
committed
update dataset and transformation pages
1 parent b06d715 commit 9e72cf7

File tree

4 files changed

+152
-99
lines changed

4 files changed

+152
-99
lines changed

docs/website/docs/general-usage/dataset-access/dataset.md

Lines changed: 13 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -16,9 +16,9 @@ Here's a full example of how to retrieve data from a pipeline and load it into a
1616

1717
## Getting started
1818

19-
Assuming you have a `Pipeline` object (let's call it `pipeline`), you can obtain a `ReadableDataset` and access your tables as `ReadableRelation` objects.
19+
Assuming you have a `Pipeline` object (let's call it `pipeline`), you can obtain a `Dataset` which is contains the crendentials and schema to your destination dataset. You can run construct a query and execute it on the dataset to retrieve a `Relation` which you may use to retrieve data from the `Dataset`.
2020

21-
**Note:** The `ReadableDataset` and `ReadableRelation` objects are **lazy-loading**. They will only query and retrieve data when you perform an action that requires it, such as fetching data into a DataFrame or iterating over the data. This means that simply creating these objects does not load data into memory, making your code more efficient.
21+
**Note:** The `Dataset` and `Relation` objects are **lazy-loading**. They will only query and retrieve data when you perform an action that requires it, such as fetching data into a DataFrame or iterating over the data. This means that simply creating these objects does not load data into memory, making your code more efficient.
2222

2323

2424
### Access the dataset
@@ -27,13 +27,17 @@ Assuming you have a `Pipeline` object (let's call it `pipeline`), you can obtain
2727

2828
### Access tables as dataset
2929

30-
You can access tables in your dataset using either attribute access or item access.
30+
The simplest way of getting a Relation from a Dataset is to get a full table relation:
3131

3232
<!--@@@DLT_SNIPPET ./dataset_snippets/dataset_snippets.py::accessing_tables-->
3333

34+
### Creating relations with sql query strings
35+
36+
<!--@@@DLT_SNIPPET ./dataset_snippets/dataset_snippets.py::custom_sql-->
37+
3438
## Reading data
3539

36-
Once you have a `ReadableRelation`, you can read data in various formats and sizes.
40+
Once you have a `Relation`, you can read data in various formats and sizes.
3741

3842
### Fetch the entire table
3943

@@ -55,7 +59,7 @@ Loading full tables into memory without limiting or iterating over them can cons
5559

5660
## Lazy loading behavior
5761

58-
The `ReadableDataset` and `ReadableRelation` objects are **lazy-loading**. This means that they do not immediately fetch data when you create them. Data is only retrieved when you perform an action that requires it, such as calling `.df()`, `.arrow()`, or iterating over the data. This approach optimizes performance and reduces unnecessary data loading.
62+
The `Dataset` and `Relation` objects are **lazy-loading**. This means that they do not immediately fetch data when you create them. Data is only retrieved when you perform an action that requires it, such as calling `.df()`, `.arrow()`, or iterating over the data. This approach optimizes performance and reduces unnecessary data loading.
5963

6064
## Iterating over data in chunks
6165

@@ -73,7 +77,7 @@ To handle large datasets efficiently, you can process data in smaller chunks.
7377

7478
<!--@@@DLT_SNIPPET ./dataset_snippets/dataset_snippets.py::iterating_fetch_chunks-->
7579

76-
The methods available on the ReadableRelation correspond to the methods available on the cursor returned by the SQL client. Please refer to the [SQL client](./sql-client.md#supported-methods-on-the-cursor) guide for more information.
80+
The methods available on the Relation correspond to the methods available on the cursor returned by the SQL client. Please refer to the [SQL client](./sql-client.md#supported-methods-on-the-cursor) guide for more information.
7781

7882
## Connection Handling
7983

@@ -171,20 +175,9 @@ Note: `delta` tables are by default on autorefresh which is implemented by delta
171175

172176
## Advanced usage
173177

174-
### Using custom SQL queries to create `ReadableRelations`
175-
176-
You can use custom SQL queries directly on the dataset to create a `ReadableRelation`:
177-
178-
<!--@@@DLT_SNIPPET ./dataset_snippets/dataset_snippets.py::custom_sql-->
179-
180-
:::note
181-
When using custom SQL queries with `dataset()`, methods like `limit` and `select` won't work. Include any filtering or column selection directly in your SQL query.
182-
:::
183-
184-
185-
### Loading a `ReadableRelation` into a pipeline table
178+
### Loading a `Relation` into a pipeline table
186179

187-
Since the `iter_arrow` and `iter_df` methods are generators that iterate over the full `ReadableRelation` in chunks, you can use them as a resource for another (or even the same) `dlt` pipeline:
180+
Since the `iter_arrow` and `iter_df` methods are generators that iterate over the full `Relation` in chunks, you can use them as a resource for another (or even the same) `dlt` pipeline:
188181

189182
<!--@@@DLT_SNIPPET ./dataset_snippets/dataset_snippets.py::loading_to_pipeline-->
190183

@@ -198,7 +191,7 @@ Visit the [Native Ibis integration](./ibis-backend.md) guide to learn more.
198191

199192
- **Memory usage:** Loading full tables into memory without iterating or limiting can consume significant memory, potentially leading to crashes if the dataset is large. Always consider using limits or chunked iteration.
200193

201-
- **Lazy evaluation:** `ReadableDataset` and `ReadableRelation` objects delay data retrieval until necessary. This design improves performance and resource utilization.
194+
- **Lazy evaluation:** `Dataset` and `Relation` objects delay data retrieval until necessary. This design improves performance and resource utilization.
202195

203196
- **Custom SQL queries:** When executing custom SQL queries, remember that additional methods like `limit()` or `select()` won't modify the query. Include all necessary clauses directly in your SQL statement.
204197

docs/website/docs/general-usage/dataset-access/dataset_snippets/dataset_snippets.py

Lines changed: 15 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ def quick_start_example_snippet(pipeline: dlt.Pipeline) -> None:
4747
customers_relation = dataset.customers # Or dataset["customers"]
4848

4949
# Step 3: Fetch the entire table as a Pandas DataFrame
50-
df = customers_relation.df()
50+
df = customers_relation.df() # or customers_relation.df(chunk_size=50)
5151

5252
# Alternatively, fetch as a PyArrow Table
5353
arrow_table = customers_relation.arrow()
@@ -172,8 +172,10 @@ def filter_snippet(default_dataset: dlt.Dataset) -> None:
172172
# @@@DLT_SNIPPET_START filter
173173
# Filter by 'id'
174174
filtered = customers_relation.where("id", "in", [3, 1, 7]).fetchall()
175+
175176
# Filter with raw SQL string
176177
filtered = customers_relation.where("id = 1").fetchall()
178+
177179
# Filter with sqlglot expression
178180
import sqlglot.expressions as sge
179181

@@ -188,10 +190,13 @@ def filter_snippet(default_dataset: dlt.Dataset) -> None:
188190
def aggregate_snippet(default_dataset: dlt.Dataset) -> None:
189191
customers_relation = default_dataset.customers
190192
# @@@DLT_SNIPPET_START aggregate
193+
191194
# Get max 'id'
192195
max_id = customers_relation.select("id").max().scalar()
196+
193197
# Get min 'id'
194198
min_id = customers_relation.select("id").min().scalar()
199+
195200
# @@@DLT_SNIPPET_END aggregate
196201

197202

@@ -297,11 +302,15 @@ def iterating_with_limit_and_select_snippet(dataset: dlt.Dataset) -> None:
297302

298303
def custom_sql_snippet(dataset: dlt.Dataset) -> None:
299304
# @@@DLT_SNIPPET_START custom_sql
300-
# Join 'customers' and 'purchases' tables
301-
custom_relation = dataset(
302-
"SELECT * FROM customers JOIN purchases ON customers.id = purchases.customer_id"
303-
)
304-
arrow_table = custom_relation.arrow()
305+
# Join 'customers' and 'purchases' tables and filter by quantity
306+
query = """
307+
SELECT *
308+
FROM customers
309+
JOIN purchases
310+
ON customers.id = purchases.customer_id
311+
WHERE purchases.quantity > 1
312+
"""
313+
joined_relation = dataset(query)
305314
# @@@DLT_SNIPPET_END custom_sql
306315

307316

0 commit comments

Comments
 (0)