You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: .env.template
+2-1Lines changed: 2 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,8 @@ DISABLE_DISPLAY_KEYS=false # if true, the display keys will not be shown in the
6
6
EXEC_PYTHON_IN_SUBPROCESS=false # if true, the python code will be executed in a subprocess to avoid crashing the main app, but it will increase the time of response
7
7
8
8
# External atabase connection settings
9
-
# check https://duckdb.org/docs/stable/extensions/mysql.html and https://duckdb.org/docs/stable/extensions/postgres.html
Copy file name to clipboardExpand all lines: py-src/data_formulator/agents/agent_sql_data_transform.py
+44-16Lines changed: 44 additions & 16 deletions
Original file line number
Diff line number
Diff line change
@@ -9,13 +9,13 @@
9
9
importpandasaspd
10
10
11
11
importlogging
12
-
12
+
importre
13
13
# Replace/update the logger configuration
14
14
logger=logging.getLogger(__name__)
15
15
16
16
SYSTEM_PROMPT='''You are a data scientist to help user to transform data that will be used for visualization.
17
17
The user will provide you information about what data would be needed, and your job is to create a sql query based on the input data summary, transformation instruction and expected fields.
18
-
The users' instruction includes "expected fields" that the user want for visualization, and natural language instructions "goal" that describe what data is needed.
18
+
The users' instruction includes "visualization_fields" that the user want for visualization, and natural language instructions "goal" that describe what data is needed.
19
19
20
20
**Important:**
21
21
- NEVER make assumptions or judgments about a person's gender, biological sex, sexuality, religion, race, nationality, ethnicity, political stance, socioeconomic status, mental health, invisible disabilities, medical conditions, personality type, social impressions, emotional state, and cognitive state.
@@ -24,15 +24,22 @@
24
24
25
25
Concretely, you should first refine users' goal and then create a sql query in the [OUTPUT] section based off the [CONTEXT] and [GOAL]:
26
26
27
-
1. First, refine users' [GOAL]. The main objective in this step is to check if "visualization_fields" provided by the user are sufficient to achieve their "goal". Concretely:
28
-
(1) based on the user's "goal", elaborate the goal into a "detailed_instruction".
27
+
1. First, refine users' [GOAL]. The main objective in this step is to decide data transformation based on the user's goal.
28
+
Concretely:
29
+
(1) based on the user's "goal" and provided "visualization_fields", elaborate the goal into a "detailed_instruction".
30
+
- first elaborate which fields the user wants to visualize based on "visualization_fields";
31
+
- then, elaborate the goal into a "detailed_instruction" contextualized with the provided "visualization_fields".
32
+
* note: try to distinguish whether the user wants to fitler the data with some conditions, or they want to aggregate data based on some fields.
33
+
* e.g., filter data to show all items from top 20 categories based on their average values, is different from showing the top 20 categories with their average values
29
34
(2) determine "output_fields", the desired fields that the output data should have to achieve the user's goal, it's a good idea to include intermediate fields here.
30
-
(2) now, determine whether the user has provided sufficient fields in "visualization_fields" that are needed to achieve their goal:
31
-
- if the user's "visualization_fields" are sufficient, simply copy it.
35
+
- note: when the user asks for filtering the data, include all fields that are needed to filter the data in "output_fields" (as well as other fields the user asked for or necessary in computation).
36
+
(3) now, determine whether the user has provided sufficient fields in "visualization_fields" that are needed to achieve their goal:
37
+
- if the user's "visualization_fields" are sufficient, simply copy it from user input.
32
38
- if the user didn't provide sufficient fields in "visualization_fields", add missing fields in "visualization_fields" (ordered them based on whether the field will be used in x,y axes or legends);
33
39
- "visualization_fields" should only include fields that will be visualized (do not include other intermediate fields from "output_fields")
34
40
- when adding new fields to "visualization_fields", be efficient and add only a minimal number of fields that are needed to achive the user's goal. generally, the total number of fields in "visualization_fields" should be no more than 3 for x,y,legend.
35
-
41
+
- if the user's goal is to filter the data, include all fields that are needed to filter the data in "output_fields" (as well as other fields the user asked for or necessary in computation).
42
+
- all existing fields user provided in "visualization_fields" should be included in "visualization_fields" list.
36
43
Prepare the result in the following json format:
37
44
38
45
```
@@ -52,6 +59,10 @@
52
59
3. The [OUTPUT] must only contain two items:
53
60
- a json object (wrapped in ```json```) representing the refined goal (including "detailed_instruction", "output_fields", "visualization_fields" and "reason")
54
61
- a sql query block (wrapped in ```sql```) representing the transformation code, do not add any extra text explanation.
62
+
63
+
some notes:
64
+
- in DuckDB, you escape a single quote within a string by doubling it ('') rather than using a backslash (\').
65
+
- in DuckDB, you need to use proper date functions to perform date operations.
55
66
'''
56
67
57
68
EXAMPLE='''
@@ -104,6 +115,15 @@
104
115
```
105
116
'''
106
117
118
+
defsanitize_table_name(table_name: str) ->str:
119
+
"""Sanitize table name to be used in SQL queries"""
120
+
# Replace spaces with underscores
121
+
sanitized_name=table_name.replace(" ", "_")
122
+
sanitized_name=sanitized_name.replace("-", "_")
123
+
# Allow alphanumeric, underscore, dot, dash, and dollar sign
table_metadata_list=db.execute("SELECT database_name, schema_name, table_name, schema_name==current_schema() as is_current_schema FROM duckdb_tables() WHERE internal=False").fetchall()
44
+
table_metadata_list=db.execute("""
45
+
SELECT database_name, schema_name, table_name, schema_name==current_schema() as is_current_schema, 'table' as object_type
46
+
FROM duckdb_tables()
47
+
WHERE internal=False
48
+
UNION ALL
49
+
SELECT database_name, schema_name, view_name as table_name, schema_name==current_schema() as is_current_schema, 'view' as object_type
50
+
FROM duckdb_views()
51
+
WHERE view_name NOT LIKE 'duckdb_%' AND view_name NOT LIKE 'sqlite_%' AND view_name NOT LIKE 'pragma_%'
0 commit comments