Skip to content

Commit ebdc0f0

Browse files
committed
data loader updates
1 parent 257374f commit ebdc0f0

File tree

8 files changed

+137
-128
lines changed

8 files changed

+137
-128
lines changed

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,9 +23,9 @@ Transform data and create rich visualizations iteratively with AI 🪄. Try Data
2323

2424
## News 🔥🔥🔥
2525

26-
- [05-13-2025] Data Formulator 0.2.1: External Data Loader
26+
- [05-13-2025] Data Formulator 0.2.3: External Data Loader
2727
- We introduced external data loader class to make import data easier. [Readme](https://github.com/microsoft/data-formulator/tree/main/py-src/data_formulator/data_loader) and [Demo](https://github.com/microsoft/data-formulator/pull/155)
28-
- Example data loaders from MySQL and Azure Data Explorer (Kusto) are provided.
28+
- Current data loaders: MySQL, Azure Data Explorer (Kusto), Azure Blob and Amazon S3 (json, parquet, csv).
2929
- Call for action [link](https://github.com/microsoft/data-formulator/issues/156):
3030
- Users: let us know which data source you'd like to load data from.
3131
- Developers: let's build more data loaders.

py-src/data_formulator/data_loader/azure_blob_data_loader.py

Lines changed: 28 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -23,33 +23,39 @@ def list_params() -> List[Dict[str, Any]]:
2323

2424
@staticmethod
2525
def auth_instructions() -> str:
26-
return """**Authentication Options (choose one)**
26+
return """Authentication Options (choose one)
2727
28-
**Option 1 - Connection String (Simplest)**
29-
- Get connection string from Azure Portal > Storage Account > Access keys
30-
- Use `connection_string` parameter with full connection string
31-
- `account_name` can be omitted when using connection string
28+
Option 1 - Connection String (Simplest)
29+
- Get connection string from Azure Portal > Storage Account > Access keys
30+
- Use `connection_string` parameter with full connection string
31+
- `account_name` can be omitted when using connection string
3232
33-
**Option 2 - Account Key**
34-
- Get account key from Azure Portal > Storage Account > Access keys
35-
- Use `account_name` + `account_key` parameters
36-
- Provides full access to storage account
33+
Option 2 - Account Key
34+
- Get account key from Azure Portal > Storage Account > Access keys
35+
- Use `account_name` + `account_key` parameters
36+
- Provides full access to storage account
3737
38-
**Option 3 - SAS Token (Recommended for limited access)**
39-
- Generate SAS token from Azure Portal > Storage Account > Shared access signature
40-
- Use `account_name` + `sas_token` parameters
41-
- Can be time-limited and permission-scoped
38+
Option 3 - SAS Token (Recommended for limited access)
39+
- Generate SAS token from Azure Portal > Storage Account > Shared access signature
40+
- Use `account_name` + `sas_token` parameters
41+
- Can be time-limited and permission-scoped
4242
43-
**Option 4 - Credential Chain (Most Secure)**
44-
- Use `account_name` + `container_name` only (no explicit credentials)
45-
- Requires Azure CLI login (`az login`) or Managed Identity
46-
- Default chain: `cli;managed_identity;env`
47-
- Customize with `credential_chain` parameter
43+
Option 4 - Credential Chain (Most Secure)
44+
- Use `account_name` + `container_name` only (no explicit credentials)
45+
- Requires Azure CLI login (`az login` in terminal) or Managed Identity
46+
- Default chain: `cli;managed_identity;env`
47+
- Customize with `credential_chain` parameter
4848
49-
**Additional Options**
50-
- `endpoint`: Custom endpoint (default: `blob.core.windows.net`)
51-
- For Azure Government: `blob.core.usgovcloudapi.net`
52-
- For Azure China: `blob.core.chinacloudapi.cn`"""
49+
Additional Options
50+
- `endpoint`: Custom endpoint (default: `blob.core.windows.net`)
51+
- For Azure Government: `blob.core.usgovcloudapi.net`
52+
- For Azure China: `blob.core.chinacloudapi.cn`
53+
54+
Supported File Formats:
55+
- CSV files (.csv)
56+
- Parquet files (.parquet)
57+
- JSON files (.json, .jsonl)
58+
"""
5359

5460
def __init__(self, params: Dict[str, Any], duck_db_conn: duckdb.DuckDBPyConnection):
5561
self.params = params

py-src/data_formulator/data_loader/kusto_data_loader.py

Lines changed: 24 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -26,38 +26,30 @@ def list_params() -> bool:
2626

2727
@staticmethod
2828
def auth_instructions() -> str:
29-
return """
30-
Azure Kusto Authentication Instructions:
31-
32-
This data loader supports two authentication methods:
33-
34-
**Method 1: Azure CLI Authentication (Recommended for development)**
35-
1. Install Azure CLI: https://docs.microsoft.com/en-us/cli/azure/install-azure-cli
36-
2. Run `az login` in your terminal to authenticate
37-
3. Ensure you have access to the specified Kusto cluster and database
38-
4. Leave client_id, client_secret, and tenant_id parameters empty
39-
40-
**Method 2: Application Key Authentication (Recommended for production)**
41-
1. Register an Azure AD application in your tenant
42-
2. Generate a client secret for the application
43-
3. Grant the application appropriate permissions to your Kusto cluster:
44-
- Go to your Kusto cluster in Azure Portal
45-
- Navigate to Permissions > Add
46-
- Add your application as a user with appropriate role (e.g., "AllDatabasesViewer" for read access)
47-
4. Provide the following parameters:
48-
- client_id: Application (client) ID from your Azure AD app registration
49-
- client_secret: Client secret value you generated
50-
- tenant_id: Directory (tenant) ID from your Azure AD
51-
52-
**Required Parameters:**
53-
- kusto_cluster: Your Kusto cluster URI (e.g., "https://mycluster.region.kusto.windows.net")
54-
- kusto_database: Name of the database you want to access
55-
56-
**Troubleshooting:**
57-
- If authentication fails, ensure you have the correct permissions on the Kusto cluster
58-
- For CLI auth, make sure you're logged in with `az account show`
59-
- For app key auth, verify your client_id, client_secret, and tenant_id are correct
60-
"""
29+
return """Azure Kusto Authentication Instructions
30+
31+
Method 1: Azure CLI Authentication
32+
1. Install Azure CLI: https://docs.microsoft.com/en-us/cli/azure/install-azure-cli
33+
2. Run `az login` in your terminal to authenticate
34+
3. Ensure you have access to the specified Kusto cluster and database
35+
4. Leave client_id, client_secret, and tenant_id parameters empty
36+
37+
Method 2: Application Key Authentication
38+
1. Register an Azure AD application in your tenant
39+
2. Generate a client secret for the application
40+
3. Grant the application appropriate permissions to your Kusto cluster:
41+
- Go to your Kusto cluster in Azure Portal
42+
- Navigate to Permissions > Add
43+
- Add your application as a user with appropriate role (e.g., "AllDatabasesViewer" for read access)
44+
4. Provide the following parameters:
45+
- client_id: Application (client) ID from your Azure AD app registration
46+
- client_secret: Client secret value you generated
47+
- tenant_id: Directory (tenant) ID from your Azure AD
48+
49+
Required Parameters:
50+
- kusto_cluster: Your Kusto cluster URI (e.g., "https://mycluster.region.kusto.windows.net")
51+
- kusto_database: Name of the database you want to access
52+
"""
6153

6254
def __init__(self, params: Dict[str, Any], duck_db_conn: duckdb.DuckDBPyConnection):
6355

py-src/data_formulator/data_loader/mysql_data_loader.py

Lines changed: 5 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -23,32 +23,26 @@ def auth_instructions() -> str:
2323
return """
2424
MySQL Connection Instructions:
2525
26-
1. **Local MySQL Setup:**
26+
1. Local MySQL Setup:
2727
- Ensure MySQL server is running on your machine
2828
- Default connection: host='localhost', user='root'
2929
- If you haven't set a root password, leave password field empty
3030
31-
2. **Remote MySQL Connection:**
31+
2. Remote MySQL Connection:
3232
- Obtain host address, username, and password from your database administrator
3333
- Ensure the MySQL server allows remote connections
3434
- Check that your IP is whitelisted in MySQL's user permissions
3535
36-
3. **Common Connection Parameters:**
36+
3. Common Connection Parameters:
3737
- user: Your MySQL username (default: 'root')
3838
- password: Your MySQL password (leave empty if no password set)
3939
- host: MySQL server address (default: 'localhost')
4040
- database: Target database name to connect to
4141
42-
4. **Troubleshooting:**
42+
4. Troubleshooting:
4343
- Verify MySQL service is running: `brew services list` (macOS) or `sudo systemctl status mysql` (Linux)
4444
- Test connection: `mysql -u [username] -p -h [host] [database]`
45-
- Common issues: Wrong credentials, server not running, firewall blocking connection
46-
47-
5. **Security Notes:**
48-
- Use dedicated database users with limited privileges for applications
49-
- Avoid using root user for application connections
50-
- Consider using SSL connections for remote databases
51-
"""
45+
"""
5246

5347
def __init__(self, params: Dict[str, Any], duck_db_conn: duckdb.DuckDBPyConnection):
5448
self.params = params

py-src/data_formulator/data_loader/s3_data_loader.py

Lines changed: 24 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -22,65 +22,38 @@ def list_params() -> List[Dict[str, Any]]:
2222
@staticmethod
2323
def auth_instructions() -> str:
2424
return """
25-
To connect to Amazon S3, you'll need the following AWS credentials and configuration:
26-
27-
**Required Parameters:**
25+
**Required AWS Credentials:**
2826
- **AWS Access Key ID**: Your AWS access key identifier
2927
- **AWS Secret Access Key**: Your AWS secret access key
30-
- **Region Name**: The AWS region where your S3 bucket is located (e.g., 'us-east-1', 'us-west-2')
31-
- **Bucket**: The name of your S3 bucket
32-
33-
**Optional Parameters:**
34-
- **AWS Session Token**: Required only if using temporary credentials (e.g., from AWS STS or IAM roles)
35-
36-
**How to Get AWS Credentials:**
37-
38-
1. **AWS IAM User (Recommended for programmatic access):**
39-
- Go to AWS Console → IAM → Users
40-
- Create a new user or select existing user
41-
- Go to "Security credentials" tab
42-
- Click "Create access key"
43-
- Choose "Application running outside AWS"
44-
- Save both the Access Key ID and Secret Access Key securely
45-
46-
2. **Required S3 Permissions:**
47-
Your IAM user/role needs these permissions for the target bucket:
48-
```json
49-
{
50-
"Version": "2012-10-17",
51-
"Statement": [
52-
{
53-
"Effect": "Allow",
54-
"Action": [
55-
"s3:GetObject",
56-
"s3:ListBucket"
57-
],
58-
"Resource": [
59-
"arn:aws:s3:::your-bucket-name",
60-
"arn:aws:s3:::your-bucket-name/*"
61-
]
62-
}
63-
]
64-
}
65-
```
66-
67-
3. **Finding Your Region:**
68-
- Go to S3 Console → Select your bucket → Properties
69-
- Look for "AWS Region" in the bucket overview
70-
71-
**Security Best Practices:**
72-
- Never share your secret access key
73-
- Use IAM roles when possible instead of long-term access keys
74-
- Consider using temporary credentials with session tokens for enhanced security
75-
- Regularly rotate your access keys
76-
- Use the principle of least privilege for S3 permissions
28+
- **Region Name**: AWS region (e.g., 'us-east-1', 'us-west-2')
29+
- **Bucket**: S3 bucket name
30+
- **AWS Session Token**: Optional, for temporary credentials only
31+
32+
**Getting Credentials:**
33+
1. AWS Console → IAM → Users → Select user → Security credentials → Create access key
34+
2. Choose "Application running outside AWS"
35+
36+
**Required S3 Permissions:**
37+
```json
38+
{
39+
"Version": "2012-10-17",
40+
"Statement": [{
41+
"Effect": "Allow",
42+
"Action": ["s3:GetObject", "s3:ListBucket"],
43+
"Resource": [
44+
"arn:aws:s3:::your-bucket-name",
45+
"arn:aws:s3:::your-bucket-name/*"
46+
]
47+
}]
48+
}
49+
```
7750
7851
**Supported File Formats:**
7952
- CSV files (.csv)
8053
- Parquet files (.parquet)
8154
- JSON files (.json, .jsonl)
8255
83-
The connector will automatically detect file types and load them appropriately using DuckDB's S3 integration.
56+
**Security:** Never share secret keys, rotate regularly, use least privilege permissions.
8457
"""
8558

8659
def __init__(self, params: Dict[str, Any], duck_db_conn: duckdb.DuckDBPyConnection):

py-src/data_formulator/tables_routes.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -728,7 +728,10 @@ def data_loader_list_data_loaders():
728728
return jsonify({
729729
"status": "success",
730730
"data_loaders": {
731-
name: data_loader.list_params()
731+
name: {
732+
"params": data_loader.list_params(),
733+
"auth_instructions": data_loader.auth_instructions()
734+
}
732735
for name, data_loader in DATA_LOADERS.items()
733736
}
734737
})

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
44

55
[project]
66
name = "data_formulator"
7-
version = "0.2.1.2"
7+
version = "0.2.1.3"
88

99
requires-python = ">=3.9"
1010
authors = [

0 commit comments

Comments
 (0)