Skip to content

Commit 0d3e6c0

Browse files
authored
[deploy] Merge pull request #155 from microsoft/dev
External data loader class
2 parents 6433640 + 16c2ea9 commit 0d3e6c0

27 files changed

+1602
-504
lines changed

.env.template

Lines changed: 1 addition & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -5,16 +5,4 @@
55
DISABLE_DISPLAY_KEYS=false # if true, the display keys will not be shown in the frontend
66
EXEC_PYTHON_IN_SUBPROCESS=false # if true, the python code will be executed in a subprocess to avoid crashing the main app, but it will increase the time of response
77

8-
LOCAL_DB_DIR= # the directory to store the local database, if not provided, the app will use the temp directory
9-
10-
# External atabase connection settings
11-
# check https://duckdb.org/docs/stable/extensions/mysql.html
12-
# and https://duckdb.org/docs/stable/extensions/postgres.html
13-
USE_EXTERNAL_DB=false # if true, the app will use an external database instead of the one in the app
14-
DB_NAME=mysql_db # the name to refer to this database connection
15-
DB_TYPE=mysql # mysql or postgresql
16-
DB_HOST=localhost
17-
DB_PORT=0
18-
DB_DATABASE=mysql
19-
DB_USER=root
20-
DB_PASSWORD=
8+
LOCAL_DB_DIR= # the directory to store the local database, if not provided, the app will use the temp directory

README.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) 
99
[![YouTube](https://img.shields.io/badge/YouTube-white?logo=youtube&logoColor=%23FF0000)](https://youtu.be/3ndlwt0Wi3c) 
1010
[![build](https://github.com/microsoft/data-formulator/actions/workflows/python-build.yml/badge.svg)](https://github.com/microsoft/data-formulator/actions/workflows/python-build.yml)
11+
[![Discord](https://img.shields.io/badge/discord-chat-green?logo=discord)](https://discord.gg/mYCZMQKYZb)
1112

1213
</div>
1314

@@ -22,6 +23,14 @@ Transform data and create rich visualizations iteratively with AI 🪄. Try Data
2223

2324
## News 🔥🔥🔥
2425

26+
- [05-13-2025] Data Formulator 0.2.1: External Data Loader
27+
- We introduced external data loader class to make import data easier. [Readme](https://github.com/microsoft/data-formulator/tree/main/py-src/data_formulator/data_loader) and [Demo](https://github.com/microsoft/data-formulator/pull/155)
28+
- Example data loaders from MySQL and Azure Data Explorer (Kusto) are provided.
29+
- Call for action [link](https://github.com/microsoft/data-formulator/issues/156):
30+
- Users: let us know which data source you'd like to load data from.
31+
- Developers: let's build more data loaders.
32+
- Discord channel for discussions: join us! [![Discord](https://img.shields.io/badge/discord-chat-green?logo=discord)](https://discord.gg/mYCZMQKYZb)
33+
2534
- [04-23-2025] Data Formulator 0.2: working with *large* data 📦📦📦
2635
- Explore large data by:
2736
1. Upload large data file to the local database (powered by [DuckDB](https://github.com/duckdb/duckdb)).

package.json

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,11 @@
44
"version": "0.1.0",
55
"private": true,
66
"dependencies": {
7-
"@emotion/react": "^11.9.0",
8-
"@emotion/styled": "^11.8.1",
7+
"@emotion/react": "^11.14.0",
8+
"@emotion/styled": "^11.14.0",
99
"@fontsource/roboto": "^4.5.5",
1010
"@mui/icons-material": "^5.14.0",
11-
"@mui/material": "^5.6.0",
11+
"@mui/material": "^7.0.2",
1212
"@reduxjs/toolkit": "^1.8.6",
1313
"@types/dompurify": "^3.0.5",
1414
"@types/validator": "^13.12.2",

py-src/data_formulator/agent_routes.py

Lines changed: 23 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@
2929
from data_formulator.agents.agent_data_load import DataLoadAgent
3030
from data_formulator.agents.agent_data_clean import DataCleanAgent
3131
from data_formulator.agents.agent_code_explanation import CodeExplanationAgent
32-
32+
from data_formulator.agents.agent_query_completion import QueryCompletionAgent
3333
from data_formulator.agents.client_utils import Client
3434

3535
from data_formulator.db_manager import db_manager
@@ -437,4 +437,25 @@ def request_code_expl():
437437
expl = code_expl_agent.run(input_tables, code)
438438
else:
439439
expl = ""
440-
return expl
440+
return expl
441+
442+
@agent_bp.route('/query-completion', methods=['POST'])
443+
def query_completion():
444+
if request.is_json:
445+
logger.info("# request data: ")
446+
content = request.get_json()
447+
448+
client = get_client(content['model'])
449+
450+
data_source_metadata = content["data_source_metadata"]
451+
query = content["query"]
452+
453+
454+
query_completion_agent = QueryCompletionAgent(client=client)
455+
reasoning, query = query_completion_agent.run(data_source_metadata, query)
456+
response = flask.jsonify({ "token": "", "status": "ok", "reasoning": reasoning, "query": query })
457+
else:
458+
response = flask.jsonify({ "token": "", "status": "error", "reasoning": "unable to complete query", "query": "" })
459+
460+
response.headers.add('Access-Control-Allow-Origin', '*')
461+
return response

py-src/data_formulator/agents/agent_py_data_rec.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -165,7 +165,7 @@ def process_gpt_response(self, input_tables, messages, response):
165165
if result['status'] == 'ok':
166166
result_df = result['content']
167167
result['content'] = {
168-
'rows': result_df.to_dict(orient='records'),
168+
'rows': json.loads(result_df.to_json(orient='records')),
169169
}
170170
else:
171171
logger.info(result['content'])

py-src/data_formulator/agents/agent_py_data_transform.py

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -221,13 +221,11 @@ def process_gpt_response(self, input_tables, messages, response):
221221
result = py_sandbox.run_transform_in_sandbox2020(code_str, [pd.DataFrame.from_records(t['rows']) for t in input_tables], self.exec_python_in_subprocess)
222222
result['code'] = code_str
223223

224-
print(f"result: {result}")
225-
226224
if result['status'] == 'ok':
227225
# parse the content
228226
result_df = result['content']
229227
result['content'] = {
230-
'rows': result_df.to_dict(orient='records'),
228+
'rows': json.loads(result_df.to_json(orient='records')),
231229
}
232230
else:
233231
logger.info(result['content'])
Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
# Copyright (c) Microsoft Corporation.
2+
# Licensed under the MIT License.
3+
4+
import pandas as pd
5+
import json
6+
7+
from data_formulator.agents.agent_utils import extract_code_from_gpt_response, extract_json_objects
8+
import re
9+
import logging
10+
11+
12+
logger = logging.getLogger(__name__)
13+
14+
15+
SYSTEM_PROMPT = '''You are a data scientist to help with data queries.
16+
The user will provide you with a description of the data source and tables available in the [DATA SOURCE] section and a query in the [USER INPUTS] section.
17+
You will need to help the user complete the query and provide reasoning for the query you generated in the [OUTPUT] section.
18+
19+
Input format:
20+
* The data source description is a json object with the following fields:
21+
* `data_source`: the name of the data source
22+
* `tables`: a list of tables in the data source, which maps the table name to the list of columns available in the table.
23+
* The user input is a natural language description of the query or a partial query you need to complete.
24+
25+
Steps:
26+
* Based on data source description and user input, you should first decide on what language should be used to query the data.
27+
* Then, describe the logic for the query you generated in a json object in a block ```json``` with the following fields:
28+
* `language`: the language of the query you generated
29+
* `tables`: the names of the tables you will use in the query
30+
* `logic`: the reasoning behind why you chose the tables and the logic for the query you generated
31+
* Finally, generate the complete query in the language specified in a code block ```{language}```.
32+
33+
Output format:
34+
* The output should be in the following format, no other text should be included:
35+
36+
[REASONING]
37+
```json
38+
{
39+
"language": {language},
40+
"tables": {tables},
41+
"logic": {logic}
42+
}
43+
```
44+
45+
[QUERY]
46+
```{language}
47+
{query}
48+
```
49+
'''
50+
51+
class QueryCompletionAgent(object):
52+
53+
def __init__(self, client):
54+
self.client = client
55+
56+
def run(self, data_source_metadata, query):
57+
58+
user_query = f"[DATA SOURCE]\n\n{json.dumps(data_source_metadata, indent=2)}\n\n[USER INPUTS]\n\n{query}\n\n[REASONING]\n"
59+
60+
logger.info(user_query)
61+
62+
messages = [{"role":"system", "content": SYSTEM_PROMPT},
63+
{"role":"user","content": user_query}]
64+
65+
###### the part that calls open_ai
66+
response = self.client.get_completion(messages = messages)
67+
response_content = '[REASONING]\n' + response.choices[0].message.content
68+
69+
logger.info(f"=== query completion output ===>\n{response_content}\n")
70+
71+
reasoning = extract_json_objects(response_content.split("[REASONING]")[1].split("[QUERY]")[0].strip())[0]
72+
output_query = response_content.split("[QUERY]")[1].strip()
73+
74+
# Extract the query by removing the language markers
75+
language_pattern = r"```(\w+)\s+(.*?)```"
76+
match = re.search(language_pattern, output_query, re.DOTALL)
77+
if match:
78+
output_query = match.group(2).strip()
79+
80+
return reasoning, output_query

py-src/data_formulator/app.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@
3737
from data_formulator.tables_routes import tables_bp
3838
from data_formulator.agent_routes import agent_bp
3939

40+
4041
app = Flask(__name__, static_url_path='', static_folder=os.path.join(APP_ROOT, "dist"))
4142
app.secret_key = secrets.token_hex(16) # Generate a random secret key for sessions
4243

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
## Data Loader Module
2+
3+
This module provides a framework for loading data from various external sources into DuckDB. It follows an abstract base class pattern to ensure consistent implementation across different data sources.
4+
5+
### Building a New Data Loader
6+
7+
The abstract class `ExternalDataLoader` defines the data loader interface. Each concrete implementation (e.g., `KustoDataLoader`, `MySQLDataLoader`) handles specific data source connections and data ingestion.
8+
9+
To create a new data loader:
10+
11+
1. Create a new class that inherits from `ExternalDataLoader`
12+
2. Implement the required abstract methods:
13+
- `list_params()`: Define required connection parameters
14+
- `__init__()`: Initialize connection to data source
15+
- `list_tables()`: List available tables/views
16+
- `ingest_data()`: Load data from source
17+
- `view_query_sample()`: Preview query results
18+
- `ingest_data_from_query()`: Load data from custom query
19+
3. Register the new class into `__init__.py` so that the front-end can automatically discover the new data loader.
20+
21+
The UI automatically provide the query completion option to help user generate queries for the given data loader (from NL or partial queries).
22+
23+
### Example Implementations
24+
25+
- `KustoDataLoader`: Azure Data Explorer (Kusto) integration
26+
- `MySQLDataLoader`: MySQL database integration
27+
28+
### Testing
29+
30+
Ensure your implementation:
31+
- Handles connection errors gracefully
32+
- Properly sanitizes table names
33+
- Respects size limits for data ingestion
34+
- Returns consistent metadata format
35+
36+
Launch the front-end and test the data loader.
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
from data_formulator.data_loader.external_data_loader import ExternalDataLoader
2+
from data_formulator.data_loader.mysql_data_loader import MySQLDataLoader
3+
from data_formulator.data_loader.kusto_data_loader import KustoDataLoader
4+
5+
DATA_LOADERS = {
6+
"mysql": MySQLDataLoader,
7+
"kusto": KustoDataLoader
8+
}
9+
10+
__all__ = ["ExternalDataLoader", "MySQLDataLoader", "KustoDataLoader", "DATA_LOADERS"]

0 commit comments

Comments
 (0)