Skip to content

Commit 2780cde

Browse files
Merge branch 'main' into features/adapter
2 parents eb47e6a + c38e50f commit 2780cde

File tree

13 files changed

+1096
-112
lines changed

13 files changed

+1096
-112
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -208,3 +208,5 @@ __marimo__/
208208
notes.txt
209209

210210
testing_base
211+
212+
settings.json

README.md

Lines changed: 76 additions & 104 deletions
Original file line numberDiff line numberDiff line change
@@ -1,142 +1,114 @@
1-
# Data Tools: Automated Data Understanding and Integration
1+
# Data-Tools
22

3-
Data Tools is a Python library designed to automate the complex process of understanding and connecting siloed datasets. In modern data environments, tables are often disconnected and poorly documented. This library tackles that challenge head-on by providing two primary capabilities:
3+
[![Release](https://img.shields.io/github/release/Intugle/data-tools)](https://github.com/Intugle/data-tools/releases/tag/v0.1.0)
4+
[![Made with Python](https://img.shields.io/badge/Made_with-Python-blue?logo=python&logoColor=white)](https://www.python.org/)
5+
![contributions - welcome](https://img.shields.io/badge/contributions-welcome-blue)
6+
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
7+
[![Open Issues](https://img.shields.io/github/issues-raw/Intugle/data-tools)](https://github.com/Intugle/data-tools/issues)
8+
[![GitHub star chart](https://img.shields.io/github/stars/Intugle/data-tools?style=social)](https://github.com/Intugle/data-tools/stargazers)
49

5-
1. **Automated Link Prediction**: Its flagship feature, the `LinkPredictor`, analyzes multiple datasets, identifies primary keys, and automatically predicts relationships (foreign key links) between them.
6-
2. **In-Depth Data Profiling**: A flexible, multi-step analysis pipeline that creates a rich profile for each dataset, including data types, column statistics, and generating a business glossary.
10+
*Automated Data Profiling, Link Prediction, and Semantic Layer Generation*
711

8-
By combining these features, Data Tools helps you move from a collection of separate tables to a fully documented and interconnected data model, ready for integration or analysis.
12+
## Overview
913

10-
## Core Features
14+
Intugle's Data-Tools is a GenAI-powered Python library that simplifies and accelerates the journey from raw data to insights. It empowers data and business teams to build an intelligent semantic layer over their data, enabling self-serve analytics and natural language queries. By automating data profiling, link prediction, and SQL generation, Data-Tools helps you build data products faster and more efficiently than traditional methods.
1115

12-
- **Automated Link Prediction**: Intelligently discovers potential foreign key relationships across multiple datasets.
13-
- **In-Depth Data Profiling**: A multi-step pipeline that identifies keys, profiles columns, and determines data types.
14-
- **Business Glossary Generation**: Automatically generates business-friendly descriptions and tags for columns and tables based on a provided domain.
15-
- **Extensible Pipeline Architecture**: Easily add custom analysis steps to the pipeline.
16-
- **DataFrame Agnostic**: Uses a factory pattern to seamlessly handle different dataframe types (e.g., pandas).
16+
## Who is this for?
1717

18-
## Installation and Setup
18+
This tool is designed for both **data teams** and **business teams**.
1919

20-
### Installation
20+
* **Data teams** can use it to automate data profiling, schema discovery, and documentation, significantly accelerating their workflow.
21+
* **Business teams** can use it to gain a better understanding of their data and to perform self-service analytics without needing to write complex SQL queries.
22+
23+
## Features
2124

22-
To install the library and its dependencies, run the following command:
25+
* **Automated Data Profiling:** Generate detailed statistics for each column in your dataset, including distinct count, uniqueness, completeness, and more.
26+
* **Datatype Identification:** Automatically identify the data type of each column (e.g., integer, string, datetime).
27+
* **Key Identification:** Identify potential primary keys in your tables.
28+
* **LLM-Powered Link Prediction:** Use GenAI to automatically discover relationships (foreign keys) between tables.
29+
* **Business Glossary Generation:** Generate a business glossary for each column, with support for industry-specific domains.
30+
* **Semantic Layer Generation:** Create YAML files that defines your semantic layer, including models (tables) and their relationships.
31+
* **SQL Generation:** Generate SQL queries from the semantic layer, allowing you to query your data using business-friendly terms.
32+
33+
## Getting Started
34+
35+
### Installation
2336

2437
```bash
25-
pip install data_tools
38+
pip install data-tools
2639
```
2740

28-
### LLM Configuration
29-
30-
This library uses LLMs for features like Business Glossary Generation. It supports any LLM provider compatible with LangChain's `init_chat_model` function. To configure your LLM provider, you need to set environment variables.
41+
### Configuration
3142

32-
The `LLM_CONFIG` environment variable should be set to your desired model, optionally including the provider, in the format `provider:model_name`. If the provider is omitted, it will try to infer the provider.
43+
Before running the project, you need to configure a LLM. This is used for tasks like generating business glossaries and predicting links between tables.
3344

34-
**For OpenAI:**
45+
You can configure the LLM by setting the following environment variables:
3546

36-
```bash
37-
# Provider is optional
38-
export LLM_CONFIG="gpt-4"
39-
export OPENAI_API_KEY="your-super-secret-key"
40-
```
47+
* `LLM_PROVIDER`: The LLM provider and model to use (e.g., `openai:gpt-3.5-turbo`).
48+
* `OPENAI_API_KEY`: Your API key for the LLM provider.
4149

42-
**For Google GenAI:**
50+
Here's an example of how to set these variables in your environment:
4351

4452
```bash
45-
export LLM_CONFIG="google_genai:gemini-pro"
46-
export GOOGLE_API_KEY="your-google-api-key"
53+
export LLM_PROVIDER="openai:gpt-3.5-turbo"
54+
export OPENAI_API_KEY="your-openai-api-key"
4755
```
4856

57+
## Quickstart
4958

50-
## Usage Examples
51-
52-
### Example 1: Automated Link Prediction (Primary Use Case)
59+
For a detailed, hands-on introduction to the project, please see the [`quickstart.ipynb`](notebooks/quickstart.ipynb) notebook. It will walk you through the entire process of profiling your data, predicting links, generating a semantic layer, and querying your data.
5360

54-
This is the most direct way to use the library. Provide a dictionary of named dataframes, and the `LinkPredictor` will automatically run all prerequisite analysis steps and predict the links.
61+
## Usage
5562

56-
```python
57-
import pandas as pd
58-
from data_tools.link_predictor.predictor import LinkPredictor
63+
The core workflow of the project involves the following steps:
5964

60-
# 1. Prepare your collection of dataframes
61-
customers_df = pd.DataFrame({"id": [1, 2, 3], "name": ["A", "B", "C"]})
62-
orders_df = pd.DataFrame({"order_id": [101, 102], "customer_id": [1, 3]})
65+
1. **Load your data:** Load your data into a DataSet object.
66+
2. **Run the analysis pipeline:** Use the `run()` method to profile your data and generate a business glossary.
67+
3. **Predict links:** Use the `LinkPredictor` to discover relationships between your tables.
6368

64-
datasets = {
65-
"customers": customers_df,
66-
"orders": orders_df
67-
}
69+
```python
70+
from data_tools import LinkPredictor
6871

69-
# 2. Initialize the predictor
70-
# This automatically runs profiling, key identification, etc., for you.
71-
link_predictor = LinkPredictor(datasets)
72+
# Initialize the predictor
73+
predictor = LinkPredictor(datasets)
7274

73-
# 3. Predict the links
74-
# (This example assumes you have implemented the logic in _predict_for_pair)
75-
prediction_results = link_predictor.predict()
75+
# Run the prediction
76+
results = predictor.predict()
77+
results.show_graph()
78+
```
7679

77-
# 4. Review the results
78-
print(prediction_results.model_dump_json(indent=2))
79-
```
80+
5. **Generate SQL:** Use the `SqlGenerator` to generate SQL queries from the semantic layer.
8081

81-
### Example 2: In-Depth Analysis with the Pipeline
82-
83-
If you want to perform a deep analysis on a single dataset, you can use the pipeline directly. This gives you fine-grained control over the analysis steps.
84-
85-
```python
86-
import pandas as pd
87-
from data_tools.analysis.pipeline import Pipeline
88-
from data_tools.analysis.steps import (
89-
TableProfiler,
90-
KeyIdentifier,
91-
BusinessGlossaryGenerator
92-
)
93-
94-
# 1. Define your analysis pipeline
95-
pipeline = Pipeline([
96-
TableProfiler(),
97-
KeyIdentifier(),
98-
BusinessGlossaryGenerator(domain="e-commerce") # Provide context for the glossary
99-
])
100-
101-
# 2. Prepare your dataframe
102-
products_df = pd.DataFrame({
103-
"product_id": [10, 20, 30],
104-
"name": ["Laptop", "Mouse", "Keyboard"],
105-
"unit_price": [1200, 25, 75]
106-
})
107-
108-
# 3. Run the pipeline
109-
# The pipeline creates and returns a DataSet object
110-
product_dataset = pipeline.run(df=products_df, name="products")
111-
112-
# 4. Access the rich analysis results
113-
print(f"Identified Key: {product_dataset.results['key'].column_name}")
114-
print("\n--- Table Glossary ---")
115-
print(product_dataset.results['table_glossary'])
116-
```
82+
```python
83+
from data_tools import SqlGenerator
11784

118-
## Available Analysis Steps
85+
# Create a SqlGenerator
86+
sql_generator = SqlGenerator()
11987

120-
You can construct a custom `Pipeline` using any combination of the following steps:
88+
# Create an ETL model
89+
etl = {
90+
name": "test_etl",
91+
fields": [
92+
{"id": "patients.first", "name": "first_name"},
93+
{"id": "patients.last", "name": "last_name"},
94+
{"id": "allergies.start", "name": "start_date"},
95+
,
96+
filter": {
97+
"selections": [{"id": "claims.departmentid", "values": ["3", "20"]}],
98+
,
99+
}
121100

122-
- `TableProfiler`: Gathers basic table-level statistics (row count, column names).
123-
- `ColumnProfiler`: Runs detailed profiling for each column (null counts, distinct counts, samples).
124-
- `DataTypeIdentifierL1` & `L2`: Determines the logical data type for each column through multiple levels of analysis.
125-
- `KeyIdentifier`: Analyzes the profiled data to predict the primary key of the dataset.
126-
- `BusinessGlossaryGenerator(domain: str)`: Generates business-friendly descriptions and tags for all columns and the table itself, using the provided `domain` for context.
101+
# Generate the query
102+
sql_query = sql_generator.generate_query(etl_model)
103+
print(sql_query)
104+
```
127105

128-
## Running Tests
106+
For detailed code examples and a complete walkthrough, please refer to the [`quickstart.ipynb`](quickstart.ipynb) notebook.
129107

130-
To run the test suite, first install the testing dependencies and then execute `pytest` via `uv`.
108+
## Contributing
131109

132-
```bash
133-
# Install testing library
134-
uv pip install pytest
135-
136-
# Run tests
137-
uv run pytest
138-
```
110+
Contributions are welcome! Please see the [`CONTRIBUTING.md`](CONTRIBUTING.md) file for guidelines.
139111

140112
## License
141113

142-
This project is licensed under the terms of the LICENSE file.
114+
This project is licensed under the MIT License. See the [`LICENSE`](LICENSE) file for details.

0 commit comments

Comments
 (0)