|
1 | | -# Data Tools: Automated Data Understanding and Integration |
| 1 | +# Data-Tools |
2 | 2 |
|
3 | | -Data Tools is a Python library designed to automate the complex process of understanding and connecting siloed datasets. In modern data environments, tables are often disconnected and poorly documented. This library tackles that challenge head-on by providing two primary capabilities: |
| 3 | +[](https://github.com/Intugle/data-tools/releases/tag/v0.1.0) |
| 4 | +[](https://www.python.org/) |
| 5 | + |
| 6 | +[](https://opensource.org/licenses/MIT) |
| 7 | +[](https://github.com/Intugle/data-tools/issues) |
| 8 | +[](https://github.com/Intugle/data-tools/stargazers) |
4 | 9 |
|
5 | | -1. **Automated Link Prediction**: Its flagship feature, the `LinkPredictor`, analyzes multiple datasets, identifies primary keys, and automatically predicts relationships (foreign key links) between them. |
6 | | -2. **In-Depth Data Profiling**: A flexible, multi-step analysis pipeline that creates a rich profile for each dataset, including data types, column statistics, and generating a business glossary. |
| 10 | +*Automated Data Profiling, Link Prediction, and Semantic Layer Generation* |
7 | 11 |
|
8 | | -By combining these features, Data Tools helps you move from a collection of separate tables to a fully documented and interconnected data model, ready for integration or analysis. |
| 12 | +## Overview |
9 | 13 |
|
10 | | -## Core Features |
| 14 | +Intugle's Data-Tools is a GenAI-powered Python library that simplifies and accelerates the journey from raw data to insights. It empowers data and business teams to build an intelligent semantic layer over their data, enabling self-serve analytics and natural language queries. By automating data profiling, link prediction, and SQL generation, Data-Tools helps you build data products faster and more efficiently than traditional methods. |
11 | 15 |
|
12 | | -- **Automated Link Prediction**: Intelligently discovers potential foreign key relationships across multiple datasets. |
13 | | -- **In-Depth Data Profiling**: A multi-step pipeline that identifies keys, profiles columns, and determines data types. |
14 | | -- **Business Glossary Generation**: Automatically generates business-friendly descriptions and tags for columns and tables based on a provided domain. |
15 | | -- **Extensible Pipeline Architecture**: Easily add custom analysis steps to the pipeline. |
16 | | -- **DataFrame Agnostic**: Uses a factory pattern to seamlessly handle different dataframe types (e.g., pandas). |
| 16 | +## Who is this for? |
17 | 17 |
|
18 | | -## Installation and Setup |
| 18 | +This tool is designed for both **data teams** and **business teams**. |
19 | 19 |
|
20 | | -### Installation |
| 20 | +* **Data teams** can use it to automate data profiling, schema discovery, and documentation, significantly accelerating their workflow. |
| 21 | +* **Business teams** can use it to gain a better understanding of their data and to perform self-service analytics without needing to write complex SQL queries. |
| 22 | + |
| 23 | +## Features |
21 | 24 |
|
22 | | -To install the library and its dependencies, run the following command: |
| 25 | +* **Automated Data Profiling:** Generate detailed statistics for each column in your dataset, including distinct count, uniqueness, completeness, and more. |
| 26 | +* **Datatype Identification:** Automatically identify the data type of each column (e.g., integer, string, datetime). |
| 27 | +* **Key Identification:** Identify potential primary keys in your tables. |
| 28 | +* **LLM-Powered Link Prediction:** Use GenAI to automatically discover relationships (foreign keys) between tables. |
| 29 | +* **Business Glossary Generation:** Generate a business glossary for each column, with support for industry-specific domains. |
| 30 | +* **Semantic Layer Generation:** Create YAML files that defines your semantic layer, including models (tables) and their relationships. |
| 31 | +* **SQL Generation:** Generate SQL queries from the semantic layer, allowing you to query your data using business-friendly terms. |
| 32 | + |
| 33 | +## Getting Started |
| 34 | + |
| 35 | +### Installation |
23 | 36 |
|
24 | 37 | ```bash |
25 | | -pip install data_tools |
| 38 | +pip install data-tools |
26 | 39 | ``` |
27 | 40 |
|
28 | | -### LLM Configuration |
29 | | - |
30 | | -This library uses LLMs for features like Business Glossary Generation. It supports any LLM provider compatible with LangChain's `init_chat_model` function. To configure your LLM provider, you need to set environment variables. |
| 41 | +### Configuration |
31 | 42 |
|
32 | | -The `LLM_CONFIG` environment variable should be set to your desired model, optionally including the provider, in the format `provider:model_name`. If the provider is omitted, it will try to infer the provider. |
| 43 | +Before running the project, you need to configure a LLM. This is used for tasks like generating business glossaries and predicting links between tables. |
33 | 44 |
|
34 | | -**For OpenAI:** |
| 45 | +You can configure the LLM by setting the following environment variables: |
35 | 46 |
|
36 | | -```bash |
37 | | -# Provider is optional |
38 | | -export LLM_CONFIG="gpt-4" |
39 | | -export OPENAI_API_KEY="your-super-secret-key" |
40 | | -``` |
| 47 | +* `LLM_PROVIDER`: The LLM provider and model to use (e.g., `openai:gpt-3.5-turbo`). |
| 48 | +* `OPENAI_API_KEY`: Your API key for the LLM provider. |
41 | 49 |
|
42 | | -**For Google GenAI:** |
| 50 | +Here's an example of how to set these variables in your environment: |
43 | 51 |
|
44 | 52 | ```bash |
45 | | -export LLM_CONFIG="google_genai:gemini-pro" |
46 | | -export GOOGLE_API_KEY="your-google-api-key" |
| 53 | +export LLM_PROVIDER="openai:gpt-3.5-turbo" |
| 54 | +export OPENAI_API_KEY="your-openai-api-key" |
47 | 55 | ``` |
48 | 56 |
|
| 57 | +## Quickstart |
49 | 58 |
|
50 | | -## Usage Examples |
51 | | - |
52 | | -### Example 1: Automated Link Prediction (Primary Use Case) |
| 59 | +For a detailed, hands-on introduction to the project, please see the [`quickstart.ipynb`](notebooks/quickstart.ipynb) notebook. It will walk you through the entire process of profiling your data, predicting links, generating a semantic layer, and querying your data. |
53 | 60 |
|
54 | | -This is the most direct way to use the library. Provide a dictionary of named dataframes, and the `LinkPredictor` will automatically run all prerequisite analysis steps and predict the links. |
| 61 | +## Usage |
55 | 62 |
|
56 | | -```python |
57 | | -import pandas as pd |
58 | | -from data_tools.link_predictor.predictor import LinkPredictor |
| 63 | +The core workflow of the project involves the following steps: |
59 | 64 |
|
60 | | -# 1. Prepare your collection of dataframes |
61 | | -customers_df = pd.DataFrame({"id": [1, 2, 3], "name": ["A", "B", "C"]}) |
62 | | -orders_df = pd.DataFrame({"order_id": [101, 102], "customer_id": [1, 3]}) |
| 65 | +1. **Load your data:** Load your data into a DataSet object. |
| 66 | +2. **Run the analysis pipeline:** Use the `run()` method to profile your data and generate a business glossary. |
| 67 | +3. **Predict links:** Use the `LinkPredictor` to discover relationships between your tables. |
63 | 68 |
|
64 | | -datasets = { |
65 | | - "customers": customers_df, |
66 | | - "orders": orders_df |
67 | | -} |
| 69 | + ```python |
| 70 | + from data_tools import LinkPredictor |
68 | 71 |
|
69 | | -# 2. Initialize the predictor |
70 | | -# This automatically runs profiling, key identification, etc., for you. |
71 | | -link_predictor = LinkPredictor(datasets) |
| 72 | + # Initialize the predictor |
| 73 | + predictor = LinkPredictor(datasets) |
72 | 74 |
|
73 | | -# 3. Predict the links |
74 | | -# (This example assumes you have implemented the logic in _predict_for_pair) |
75 | | -prediction_results = link_predictor.predict() |
| 75 | + # Run the prediction |
| 76 | + results = predictor.predict() |
| 77 | + results.show_graph() |
| 78 | + ``` |
76 | 79 |
|
77 | | -# 4. Review the results |
78 | | -print(prediction_results.model_dump_json(indent=2)) |
79 | | -``` |
| 80 | +5. **Generate SQL:** Use the `SqlGenerator` to generate SQL queries from the semantic layer. |
80 | 81 |
|
81 | | -### Example 2: In-Depth Analysis with the Pipeline |
82 | | - |
83 | | -If you want to perform a deep analysis on a single dataset, you can use the pipeline directly. This gives you fine-grained control over the analysis steps. |
84 | | - |
85 | | -```python |
86 | | -import pandas as pd |
87 | | -from data_tools.analysis.pipeline import Pipeline |
88 | | -from data_tools.analysis.steps import ( |
89 | | - TableProfiler, |
90 | | - KeyIdentifier, |
91 | | - BusinessGlossaryGenerator |
92 | | -) |
93 | | - |
94 | | -# 1. Define your analysis pipeline |
95 | | -pipeline = Pipeline([ |
96 | | - TableProfiler(), |
97 | | - KeyIdentifier(), |
98 | | - BusinessGlossaryGenerator(domain="e-commerce") # Provide context for the glossary |
99 | | -]) |
100 | | - |
101 | | -# 2. Prepare your dataframe |
102 | | -products_df = pd.DataFrame({ |
103 | | - "product_id": [10, 20, 30], |
104 | | - "name": ["Laptop", "Mouse", "Keyboard"], |
105 | | - "unit_price": [1200, 25, 75] |
106 | | -}) |
107 | | - |
108 | | -# 3. Run the pipeline |
109 | | -# The pipeline creates and returns a DataSet object |
110 | | -product_dataset = pipeline.run(df=products_df, name="products") |
111 | | - |
112 | | -# 4. Access the rich analysis results |
113 | | -print(f"Identified Key: {product_dataset.results['key'].column_name}") |
114 | | -print("\n--- Table Glossary ---") |
115 | | -print(product_dataset.results['table_glossary']) |
116 | | -``` |
| 82 | + ```python |
| 83 | + from data_tools import SqlGenerator |
117 | 84 |
|
118 | | -## Available Analysis Steps |
| 85 | + # Create a SqlGenerator |
| 86 | + sql_generator = SqlGenerator() |
119 | 87 |
|
120 | | -You can construct a custom `Pipeline` using any combination of the following steps: |
| 88 | + # Create an ETL model |
| 89 | + etl = { |
| 90 | + name": "test_etl", |
| 91 | + fields": [ |
| 92 | + {"id": "patients.first", "name": "first_name"}, |
| 93 | + {"id": "patients.last", "name": "last_name"}, |
| 94 | + {"id": "allergies.start", "name": "start_date"}, |
| 95 | + , |
| 96 | + filter": { |
| 97 | + "selections": [{"id": "claims.departmentid", "values": ["3", "20"]}], |
| 98 | + , |
| 99 | + } |
121 | 100 |
|
122 | | -- `TableProfiler`: Gathers basic table-level statistics (row count, column names). |
123 | | -- `ColumnProfiler`: Runs detailed profiling for each column (null counts, distinct counts, samples). |
124 | | -- `DataTypeIdentifierL1` & `L2`: Determines the logical data type for each column through multiple levels of analysis. |
125 | | -- `KeyIdentifier`: Analyzes the profiled data to predict the primary key of the dataset. |
126 | | -- `BusinessGlossaryGenerator(domain: str)`: Generates business-friendly descriptions and tags for all columns and the table itself, using the provided `domain` for context. |
| 101 | + # Generate the query |
| 102 | + sql_query = sql_generator.generate_query(etl_model) |
| 103 | + print(sql_query) |
| 104 | + ``` |
127 | 105 |
|
128 | | -## Running Tests |
| 106 | +For detailed code examples and a complete walkthrough, please refer to the [`quickstart.ipynb`](quickstart.ipynb) notebook. |
129 | 107 |
|
130 | | -To run the test suite, first install the testing dependencies and then execute `pytest` via `uv`. |
| 108 | +## Contributing |
131 | 109 |
|
132 | | -```bash |
133 | | -# Install testing library |
134 | | -uv pip install pytest |
135 | | - |
136 | | -# Run tests |
137 | | -uv run pytest |
138 | | -``` |
| 110 | +Contributions are welcome! Please see the [`CONTRIBUTING.md`](CONTRIBUTING.md) file for guidelines. |
139 | 111 |
|
140 | 112 | ## License |
141 | 113 |
|
142 | | -This project is licensed under the terms of the LICENSE file. |
| 114 | +This project is licensed under the MIT License. See the [`LICENSE`](LICENSE) file for details. |
0 commit comments