Intugle · juhel-phanju-intugle · Aug 14, 2025 · Aug 6, 2025 · Aug 7, 2025 · Aug 10, 2025
diff --git a/.python-version b/.python-version
@@ -1 +1 @@
-3.10
+3.12
diff --git a/.vscode/settings.json b/.vscode/settings.json
@@ -0,0 +1,7 @@
+{
+    "python.testing.pytestArgs": [
+        "tests"
+    ],
+    "python.testing.unittestEnabled": false,
+    "python.testing.pytestEnabled": true
+}
diff --git a/README.md b/README.md
@@ -1,114 +1,97 @@
-# Data Analysis Pipeline
+# Data Tools: Automated Data Understanding and Integration
 
-A flexible and extensible Python library for building data analysis pipelines. This tool allows you to define a series of analysis steps that are executed in sequence, with each step building upon the results of the previous ones. The pipeline is designed to be dataframe-agnostic, with built-in support for pandas DataFrames and an easy-to-use plugin system for adding support for other libraries like Spark.
+Data Tools is a Python library designed to automate the complex process of understanding and connecting siloed datasets. In modern data environments, tables are often disconnected and poorly documented. This library tackles that challenge head-on by providing two primary capabilities:
 
-## Key Features
+1.  **Automated Link Prediction**: Its flagship feature, the `LinkPredictor`, analyzes multiple datasets, identifies primary keys, and automatically predicts relationships (foreign key links) between them.
+2.  **In-Depth Data Profiling**: A flexible, multi-step analysis pipeline that creates a rich profile for each dataset, including data types, column statistics, and generating a business glossary.
 
-- **Analysis Pipeline**: Chain together multiple analysis steps to create a sophisticated data processing workflow.
-- **Automatic Type Detection**: The underlying dataframe type is automatically detected, so you can use the same pipeline for different data sources.
-- **Extensible Plugin Architecture**: Easily add support for new dataframe libraries (e.g., Spark, Dask) by creating a simple plugin.
-- **Clear Dependency Management**: Each analysis step can depend on the results of previous steps, ensuring a robust and predictable workflow.
-- **Separation of Concerns**: The pipeline architecture cleanly separates the analysis logic from the data access layer, making the code easier to maintain and test.
+By combining these features, Data Tools helps you move from a collection of separate tables to a fully documented and interconnected data model, ready for integration or analysis.
 
-## Installation
+## Core Features
 
-To get started, clone the repository and install the necessary dependencies using `uv`.
+- **Automated Link Prediction**: Intelligently discovers potential foreign key relationships across multiple datasets.
+- **In-Depth Data Profiling**: A multi-step pipeline that identifies keys, profiles columns, and determines data types.
+- **Business Glossary Generation**: Automatically generates business-friendly descriptions and tags for columns and tables based on a provided domain.
+- **Extensible Pipeline Architecture**: Easily add custom analysis steps to the pipeline.
+- **DataFrame Agnostic**: Uses a factory pattern to seamlessly handle different dataframe types (e.g., pandas).
 
-```bash
-git clone <your-repository-url>
-cd data-tools
-uv pip install -e .
-```
+## Usage Examples
 
-## Quick Start
+### Example 1: Automated Link Prediction (Primary Use Case)
 
-Using the library is straightforward. Simply define your pipeline, create your dataframe, and run the analysis.
+This is the most direct way to use the library. Provide a dictionary of named dataframes, and the `LinkPredictor` will automatically run all prerequisite analysis steps and predict the links.
 
 ```python
 import pandas as pd
-from data_tools.analysis.pipeline import Pipeline
-from data_tools.analysis.steps import TableProfiler
+from data_tools.link_predictor.predictor import LinkPredictor
 
-# 1. Define your pipeline
-pipeline = Pipeline([
-    TableProfiler(),
-    # Add other analysis steps here
-])
+# 1. Prepare your collection of dataframes
+customers_df = pd.DataFrame({"id": [1, 2, 3], "name": ["A", "B", "C"]})
+orders_df = pd.DataFrame({"order_id": [101, 102], "customer_id": [1, 3]})
 
-# 2. Create your dataframe
-data = {
-    'name': ['Alice', 'Bob', 'Charlie'],
-    'age': [30, 25, 35],
-    'city': ['New York', 'Los Angeles', 'Chicago'],
-    'joined_date': pd.to_datetime(['2023-01-15', '2022-05-20', '2023-03-10'])
+datasets = {
+    "customers": customers_df,
+    "orders": orders_df
 }
-my_df = pd.DataFrame(data)
-
-# 3. Run the pipeline
-analysis_results = pipeline.run(my_df)
 
-# 4. Access the results
-profile = analysis_results.results["table_profile"]
+# 2. Initialize the predictor
+# This automatically runs profiling, key identification, etc., for you.
+link_predictor = LinkPredictor(datasets)
 
-# Print the results
-print(profile.model_dump_json(indent=2))
-```
+# 3. Predict the links
+# (This example assumes you have implemented the logic in _predict_for_pair)
+prediction_results = link_predictor.predict()
 
-### Expected Output
-
-```json
-{
-  "count": 3,
-  "columns": [
-    "name",
-    "age",
-    "city",
-    "joined_date"
-  ],
-  "dtypes": {
-    "name": "string",
-    "age": "integer",
-    "city": "string",
-    "joined_date": "date & time"
-  }
-}
+# 4. Review the results
+print(prediction_results.model_dump_json(indent=2))
 ```
 
-## How It Works
+### Example 2: In-Depth Analysis with the Pipeline
 
-The library is built around a central `Pipeline` class that executes a series of `AnalysisStep` objects on a `DataSet`.
+If you want to perform a deep analysis on a single dataset, you can use the pipeline directly. This gives you fine-grained control over the analysis steps.
 
-1.  **`DataSet`**: A container for the raw dataframe and all the analysis results. This object is passed from one pipeline step to the next.
-2.  **`AnalysisStep`**: An abstract base class that defines the interface for all analysis steps. Each step implements an `analyze` method that takes a `DataSet` as input and adds its results to the `DataSet`.
-3.  **`DataFrameFactory`**: A factory that automatically detects the dataframe type and creates a wrapper object that provides a consistent API for accessing the data.
-4.  **`Pipeline`**: The orchestrator that takes a list of analysis steps and a dataframe, creates a `DataSet`, and runs the steps in sequence.
+```python
+import pandas as pd
+from data_tools.analysis.pipeline import Pipeline
+from data_tools.analysis.steps import (
+    TableProfiler,
+    KeyIdentifier,
+    BusinessGlossaryGenerator
+)
 
-## Extending the Tool
+# 1. Define your analysis pipeline
+pipeline = Pipeline([
+    TableProfiler(),
+    KeyIdentifier(),
+    BusinessGlossaryGenerator(domain="e-commerce") # Provide context for the glossary
+])
 
-### Adding a New Analysis Step
+# 2. Prepare your dataframe
+products_df = pd.DataFrame({
+    "product_id": [10, 20, 30],
+    "name": ["Laptop", "Mouse", "Keyboard"],
+    "unit_price": [1200, 25, 75]
+})
 
-To add a new analysis step, create a new class that inherits from `AnalysisStep` and implement the `analyze` method.
+# 3. Run the pipeline
+# The pipeline creates and returns a DataSet object
+product_dataset = pipeline.run(df=products_df, name="products")
 
-```python
-from data_tools.analysis.steps import AnalysisStep
-from data_tools.analysis.models import DataSet
-
-class MyCustomProfiler(AnalysisStep):
-    def analyze(self, dataset: DataSet) -> None:
-        # Your analysis logic here
-        # ...
-        dataset.results["my_custom_profile"] = my_results
+# 4. Access the rich analysis results
+print(f"Identified Key: {product_dataset.results['key'].column_name}")
+print("\n--- Table Glossary ---")
+print(product_dataset.results['table_glossary'])
 ```
 
-### Adding a New DataFrame Type
+## Available Analysis Steps
 
-To add support for a new dataframe library (e.g., Spark), follow these steps:
+You can construct a custom `Pipeline` using any combination of the following steps:
 
-1.  **Create a New Module**: Add a file like `src/data_tools/dataframes/types/spark.py`.
-2.  **Implement the `DataFrame` Interface**: Create a class (e.g., `SparkDF`) that inherits from `DataFrame` and implements the `profile()` method with logic specific to Spark DataFrames.
-3.  **Create a Checker Function**: Write a function that returns `True` if the input object is a Spark DataFrame.
-4.  **Create a Register Function**: In the same file, create a function to register your new components with the factory.
-5.  **Activate the Plugin**: Add the path to your new module in the `DEFAULT_PLUGINS` list in `src/data_tools/dataframes/factory.py`.
+- `TableProfiler`: Gathers basic table-level statistics (row count, column names).
+- `ColumnProfiler`: Runs detailed profiling for each column (null counts, distinct counts, samples).
+- `DataTypeIdentifierL1` & `L2`: Determines the logical data type for each column through multiple levels of analysis.
+- `KeyIdentifier`: Analyzes the profiled data to predict the primary key of the dataset.
+- `BusinessGlossaryGenerator(domain: str)`: Generates business-friendly descriptions and tags for all columns and the table itself, using the provided `domain` for context.
 
 ## Running Tests
 
@@ -124,4 +107,4 @@ uv run pytest
 
 ## License
 
-This project is licensed under the terms of the LICENSE file.
+This project is licensed under the terms of the LICENSE file.
diff --git a/notebooks/sql_generator.ipynb b/notebooks/sql_generator.ipynb