You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
LLM-Powered Automatic Data Transformation Inference
Core Architecture
Raw Data → Smart Analyzer → Auto-Generated dbt Models → DuckDB → MXCP Tools
Phase 1: Data Discovery & Analysis
classSmartDataPreprocessor:
defanalyze_data_source(self, file_path):
# 1. READ AND SAMPLE DATAsample_data=read_sample(file_path, max_rows=100)
headers=extract_headers(sample_data)
# 2. LLM HEADER ANALYSISllm_analysis=llm_call(f""" Analyze these column headers and sample data: Headers: {headers} Sample rows: {sample_data[:5]} Tasks: 1. Group related columns (e.g., customer/Customer/customer_name) 2. Identify data types for each column 3. Detect data quality issues (nulls, formatting inconsistencies) 4. Suggest cleaning transformations needed Return JSON format: {{ "column_groups": [ {{ "concept": "customer_identity", "columns": ["customer", "Customer", "customer_name"], "best_column": "customer_name", "data_type": "string", "cleaning_needed": ["trim", "title_case"] }} ], "transformations": [...], "primary_metrics": ["amount", "sales_total"], "dimensions": ["region", "product", "date"] }} """)
returnparse_json(llm_analysis)
Phase 2: Auto-Generate dbt Models
defgenerate_dbt_model(self, analysis_result, table_name):
# 3. BUILD SQL TRANSFORMATIONsql_parts= []
forgroupinanalysis_result["column_groups"]:
column_name=group["concept"]
best_col=group["best_column"]
data_type=group["data_type"]
cleaning=group["cleaning_needed"]
# Generate SQL based on analysissql_part=build_sql_transformation(
column=best_col,
target_name=column_name,
data_type=data_type,
cleaning_rules=cleaning
)
sql_parts.append(sql_part)
# 4. GENERATE COMPLETE dbt MODELdbt_model=f""" {{{{ config(materialized='table') }}}} SELECT{',\n '.join(sql_parts)} FROM {{{{ source('raw', '{table_name}') }}}} WHERE {generate_quality_filters(analysis_result)} """returndbt_modeldefbuild_sql_transformation(self, column, target_name, data_type, cleaning_rules):
# Smart SQL generation based on LLM analysissql=columnif"trim"incleaning_rules:
sql=f"TRIM({sql})"if"upper_case"incleaning_rules:
sql=f"UPPER({sql})"if"remove_currency"incleaning_rules:
sql=f"CAST(REPLACE({sql}, '$', '') as DECIMAL)"ifdata_type=="date":
sql=f"{sql}::date"returnf"{sql} as {target_name}"
Phase 3: Auto-Generate MCP Tools
defgenerate_mcp_tools(self, analysis_result, clean_table_name):
# 5. CREATE SMART TOOLS BASED ON DATA STRUCTUREtools= []
metrics=analysis_result["primary_metrics"]
dimensions=analysis_result["dimensions"]
# Auto-generate common analytical toolsformetricinmetrics:
fordimensionindimensions:
tool= {
"name": f"get_{metric}_by_{dimension}",
"description": f"Analyze {metric} broken down by {dimension}",
"parameters": [
{"name": dimension, "type": "string", "description": f"Filter by {dimension}"}
],
"source": {
"code": f""" SELECT {dimension}, SUM({metric}) as total_{metric}, AVG({metric}) as avg_{metric}, COUNT(*) as record_count FROM {clean_table_name} WHERE {dimension} = ${dimension} GROUP BY {dimension} ORDER BY total_{metric} DESC """
}
}
tools.append(tool)
returntools
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Smart Data Preprocessing for MXCP
LLM-Powered Automatic Data Transformation Inference
Core Architecture
Phase 1: Data Discovery & Analysis
Phase 2: Auto-Generate dbt Models
Phase 3: Auto-Generate MCP Tools
Phase 4: Integration with MXCP Pipeline
Enhanced MXCP CLI Integration
Key Benefits Over Manual MXCP
Error Handling & Validation
Integration Points with Existing MXCP
mxcp init
with smart analysisHad worked on a pipeline where i did something like above, let me know if it makes sense/poses challenges! Sorry for the AI generated explainer above!
TLDR: iterate over column groups to generate dbt using slms/llms!
Beta Was this translation helpful? Give feedback.
All reactions