Skip to content

Commit 4fb377a

Browse files
committed
Quickstart for dagster-teradata
1 parent 21504a6 commit 4fb377a

17 files changed

+1173
-0
lines changed
62.4 KB
Loading
111 KB
Loading
147 KB
Loading
62.2 KB
Loading
125 KB
Loading
150 KB
Loading
69.3 KB
Loading
167 KB
Loading
Lines changed: 247 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,247 @@
1+
---
2+
sidebar_position: 4.7
3+
author: Mohan Talla
4+
5+
page_last_update: February 5th, 2025
6+
description: Transferring CSV, JSON, and Parquet data from Azure Blob Storage to Teradata Vantage with dagster-teradata
7+
keywords: [data warehouses, teradata, vantage, transfer, cloud data platform, object storage, business intelligence, enterprise analytics, dagster, dagster-teradata, microsoft azure blob storage]
8+
---
9+
10+
import Tabs from '@theme/Tabs';
11+
import TabItem from '@theme/TabItem';
12+
import ClearscapeDocsNote from '../_partials/vantage_clearscape_analytics.mdx'
13+
14+
# Data Transfer from Azure Blob to Teradata Vantage Using dagster-teradata
15+
16+
## Overview
17+
18+
This document provides instructions and guidance for transferring data in CSV, JSON and Parquet formats from Microsoft Azure Blob Storage to Teradata Vantage using **dagster-teradata**. It outlines the setup, configuration and execution steps required to establish a seamless data transfer pipeline between these platforms.
19+
20+
:::note
21+
Use [The Windows Subsystem for Linux (WSL)](https://learn.microsoft.com/en-us/windows/wsl/install) on `Windows` to try this quickstart example.
22+
:::
23+
24+
## Prerequisites
25+
26+
* Access to a Teradata Vantage instance.
27+
28+
<ClearscapeDocsNote />
29+
30+
* Python **3.9** or higher, Python **3.12** is recommended.
31+
* pip
32+
33+
## Setting Up a Virtual Enviroment
34+
35+
A virtual environment is recommended to isolate project dependencies and avoid conflicts with system-wide Python packages. Here’s how to set it up:
36+
37+
<InstallTabs/>
38+
39+
## Install dagster and dagster-teradata
40+
41+
With your virtual environment active, the next step is to install dagster and the Teradata provider package (dagster-teradata) to interact with Teradata Vantage.
42+
43+
1. Install the Required Packages:
44+
45+
```bash
46+
pip install dagster dagster-webserver dagster-teradata[azure]
47+
```
48+
49+
2. Verify the Installation:
50+
<br>
51+
<br>
52+
To confirm that Dagster is correctly installed, run:
53+
```bash
54+
dagster –version
55+
```
56+
If installed correctly, it should show the version of Dagster.
57+
58+
59+
## Initialize a Dagster Project
60+
61+
Now that you have the necessary packages installed, the next step is to create a new Dagster project.
62+
63+
### Scaffold a New Dagster Project
64+
65+
Run the following command:
66+
67+
```bash
68+
dagster project scaffold --name dagster-teradata-azure
69+
```
70+
This command will create a new project named dagster-teradata-azure. It will automatically generate the following directory structure:
71+
72+
```bash
73+
dagster-teradata-azure
74+
│ pyproject.toml
75+
│ README.md
76+
│ setup.cfg
77+
│ setup.py
78+
79+
├───dagster_teradata_azure
80+
│ assets.py
81+
│ definitions.py
82+
│ __init__.py
83+
84+
└───dagster_teradata_azure_tests
85+
test_assets.py
86+
__init__.py
87+
```
88+
89+
Refer [here](https://docs.dagster.io/guides/build/projects/dagster-project-file-reference) to know more above this directory structure
90+
91+
You need to modify the `definitions.py` file inside the `jaffle_dagster/jaffle_dagster` directory.
92+
93+
### Step 1: Open `definitions.py` in `dagster-teradata-azure/dagster-teradata-azure` Directory
94+
Locate and open the file where Dagster job definitions are configured.
95+
This file manages resources, jobs, and assets needed for the Dagster project.
96+
97+
### Step 2: Implement Azure to Teradata Transfer in Dagster
98+
99+
``` python
100+
import os
101+
102+
from dagster import job, op, Definitions, EnvVar, DagsterError
103+
from dagster_azure.adls2 import ADLS2Resource, ADLS2SASToken
104+
from dagster_teradata import TeradataResource, teradata_resource
105+
106+
azure_resource = ADLS2Resource(
107+
storage_account="",
108+
credential=ADLS2SASToken(token=""),
109+
)
110+
111+
td_resource = TeradataResource(
112+
host=os.getenv("TERADATA_HOST"),
113+
user=os.getenv("TERADATA_USER"),
114+
password=os.getenv("TERADATA_PASSWORD"),
115+
database=os.getenv("TERADATA_DATABASE"),
116+
)
117+
118+
@op(required_resource_keys={"teradata"})
119+
def drop_existing_table(context):
120+
context.resources.teradata.drop_table("people")
121+
return "Tables Dropped"
122+
123+
@op(required_resource_keys={"teradata", "azure"})
124+
def ingest_azure_to_teradata(context, status):
125+
if status == "Tables Dropped":
126+
context.resources.teradata.azure_blob_to_teradata(azure_resource, "/az/akiaxox5jikeotfww4ul.blob.core.windows.net/td-usgs/CSVDATA/09380000/2018/06/", "people", True)
127+
else:
128+
raise DagsterError("Tables not dropped")
129+
130+
@job(resource_defs={"teradata": td_resource, "azure": azure_resource})
131+
def example_job():
132+
ingest_azure_to_teradata(drop_existing_table())
133+
134+
defs = Definitions(
135+
jobs=[example_job]
136+
)
137+
```
138+
139+
##### Explanation of Code
140+
141+
1. **Resource Setup**:
142+
- The code sets up two resources: one for **Azure Data Lake Storage** (ADLS2) and one for **Teradata**.
143+
- **Azure Blob Storage**:
144+
- For a **public bucket**, the `storage_account` and `credential` (SAS token) are left empty.
145+
- For a **private bucket**, the `storage_account` (Azure Storage account name) and a valid SAS `credential` are required for access.
146+
- **Teradata resource**: The `teradata_resource` is configured using credentials pulled from environment variables (`TERADATA_HOST`, `TERADATA_USER`, `TERADATA_PASSWORD`, `TERADATA_DATABASE`).
147+
148+
2. **Operations**:
149+
- **`drop_existing_table`**: This operation drops the "people" table in Teradata using the `teradata_resource`.
150+
- **`ingest_azure_to_teradata`**: This operation checks if the "people" table was successfully dropped. If the table is dropped successfully, it loads data from Azure Blob Storage into Teradata. The data is ingested using the `azure_blob_to_teradata` method, which fetches data from the specified Azure Blob Storage path.
151+
152+
3. **Job Execution**:
153+
- The **`example_job`** runs the operations in sequence. First, it drops the table, and if successful, it transfers data from the Azure Blob Storage (either public or private) to Teradata.
154+
155+
This setup allows for dynamic handling of both **public** and **private Azure Blob Storage** configurations while transferring data into Teradata.
156+
157+
## Running the Pipeline
158+
159+
After setting up the project, you can now run your Dagster pipeline:
160+
161+
1. **Start the Dagster Dev Server:** In your terminal, navigate to the root directory of your project and run:
162+
dagster dev
163+
After executing the command dagster dev, the Dagster logs will be displayed directly in the terminal. Any errors encountered during startup will also be logged here. Once you see a message similar to:
164+
```bash
165+
2025-02-04 09:15:46 +0530 - dagster-webserver - INFO - Serving dagster-webserver on http://127.0.0.1:3000 in process 32564,
166+
```
167+
It indicates that the Dagster web server is running successfully. At this point, you can proceed to the next step.
168+
<br>
169+
<br>
170+
2. **Access the Dagster UI:** Open a web browser and navigate to http://127.0.0.1:3000. This will open the Dagster UI where you can manage and monitor your pipelines.
171+
<br>
172+
<br>
173+
![dagster-teradata-azure1.png](../images/dagster/dagster-teradata-azure1.png)
174+
175+
In the Dagster UI, you will see the following:
176+
177+
- The job **`example_job`** is displayed, along with the associated dbt asset.
178+
- The dbt asset is organized under the **"default"** asset group.
179+
- In the middle, you can view the **lineage** of each `@op`, showing its dependencies and how each operation is related to others.
180+
181+
![dagster-teradata-azure2.png](../images/dagster/dagster-teradata-azure2.png)
182+
183+
Go to the **"Launchpad"** and provide the configuration for the **TeradataResource** as follows:
184+
185+
```yaml
186+
resources:
187+
teradata:
188+
config:
189+
host: <host>
190+
user: <user>
191+
password: <password>
192+
database: <database>
193+
```
194+
Replace `<host>, <user>, <password> and <database>` with the actual hostname and credentials of the Teradata VantageCloud Lake instance.
195+
196+
Once the configuration is done, click on **"Launch Run"** to start the process.
197+
198+
![dagster-teradata-azure3.png](../images/dagster/dagster-teradata-azure3.png)
199+
200+
The Dagster UI allows you to visualize the pipeline's progress, view logs, and inspect the status of each step.
201+
202+
## Arguments Supported by `azure_blob_to_teradata`
203+
204+
- **azure (ADLS2Resource)**:
205+
The `ADLS2Resource` object used to interact with the Azure Blob Storage.
206+
207+
- **blob_source_key (str)**:
208+
The URI specifying the location of the Azure Blob object. The format is:
209+
`/az/YOUR-STORAGE-ACCOUNT.blob.core.windows.net/YOUR-CONTAINER/YOUR-BLOB-LOCATION`
210+
For more details, refer to the Teradata documentation:
211+
[Teradata Documentation - Native Object Store](https://docs.teradata.com/search/documents?query=native+object+store&sort=last_update&virtual-field=title_only&content-lang=en-US)
212+
213+
- **teradata_table (str)**:
214+
The name of the Teradata table where the data will be loaded.
215+
216+
- **public_bucket (bool, optional)**:
217+
Indicates whether the Azure Blob container is public. If `True`, the objects in the container can be accessed without authentication.
218+
Defaults to `False`.
219+
220+
- **teradata_authorization_name (str, optional)**:
221+
The name of the Teradata Authorization Database Object used to control access to the Azure Blob object store. This is required for secure access to private containers.
222+
Defaults to an empty string.
223+
For more details, refer to the documentation:
224+
[Teradata Vantage Native Object Store - Setting Up Access](https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Teradata-VantageTM-Native-Object-Store-Getting-Started-Guide-17.20/Setting-Up-Access/Controlling-Foreign-Table-Access-with-an-AUTHORIZATION-Object)
225+
226+
## Transfer data from Private Blob Storage Container to Teradata instance
227+
To successfully transfer data from a Private Blob Storage Container to a Teradata instance, the following prerequisites are necessary.
228+
229+
* An Azure account. You can start with a [free account](https://azure.microsoft.com/free/).
230+
* Create an [Azure storage account](https://docs.microsoft.com/en-us/azure/storage/common/storage-quickstart-create-account?tabs=azure-portal)
231+
* Create a [blob container](https://learn.microsoft.com/en-us/azure/storage/blobs/blob-containers-portal) under Azure storage account
232+
* [Upload](https://learn.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-portal) CSV/JSON/Parquest format files to blob container
233+
* Create a Teradata Authorization object with the Azure Blob Storage Account and the Account Secret Key
234+
235+
``` sql
236+
CREATE AUTHORIZATION azure_authorization USER 'azuretestquickstart' PASSWORD 'AZURE_BLOB_ACCOUNT_SECRET_KEY'
237+
```
238+
239+
:::note
240+
Replace `AZURE_BLOB_ACCOUNT_SECRET_KEY` with Azure storage account `azuretestquickstart` [access key](https://learn.microsoft.com/en-us/azure/storage/common/storage-account-keys-manage?toc=%2Fazure%2Fstorage%2Fblobs%2Ftoc.json&bc=%2Fazure%2Fstorage%2Fblobs%2Fbreadcrumb%2Ftoc.json&tabs=azure-portal)
241+
:::
242+
243+
## Summary
244+
This guide details the utilization of the dagster-teradata to seamlessly transfer CSV, JSON, and Parquet data from Microsoft Azure Blob Storage to Teradata Vantage, facilitating streamlined data operations between these platforms.
245+
246+
## Further reading
247+
* [Teradata Authorization](https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/SQL-Data-Definition-Language-Syntax-and-Examples/Authorization-Statements-for-External-Routines/CREATE-AUTHORIZATION-and-REPLACE-AUTHORIZATION)

0 commit comments

Comments
 (0)