Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .python-version
Original file line number Diff line number Diff line change
@@ -1 +1 @@
3.10
3.12
7 changes: 7 additions & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"python.testing.pytestArgs": [
"tests"
],
"python.testing.unittestEnabled": false,
"python.testing.pytestEnabled": true
}
149 changes: 66 additions & 83 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,114 +1,97 @@
# Data Analysis Pipeline
# Data Tools: Automated Data Understanding and Integration

A flexible and extensible Python library for building data analysis pipelines. This tool allows you to define a series of analysis steps that are executed in sequence, with each step building upon the results of the previous ones. The pipeline is designed to be dataframe-agnostic, with built-in support for pandas DataFrames and an easy-to-use plugin system for adding support for other libraries like Spark.
Data Tools is a Python library designed to automate the complex process of understanding and connecting siloed datasets. In modern data environments, tables are often disconnected and poorly documented. This library tackles that challenge head-on by providing two primary capabilities:

## Key Features
1. **Automated Link Prediction**: Its flagship feature, the `LinkPredictor`, analyzes multiple datasets, identifies primary keys, and automatically predicts relationships (foreign key links) between them.
2. **In-Depth Data Profiling**: A flexible, multi-step analysis pipeline that creates a rich profile for each dataset, including data types, column statistics, and generating a business glossary.

- **Analysis Pipeline**: Chain together multiple analysis steps to create a sophisticated data processing workflow.
- **Automatic Type Detection**: The underlying dataframe type is automatically detected, so you can use the same pipeline for different data sources.
- **Extensible Plugin Architecture**: Easily add support for new dataframe libraries (e.g., Spark, Dask) by creating a simple plugin.
- **Clear Dependency Management**: Each analysis step can depend on the results of previous steps, ensuring a robust and predictable workflow.
- **Separation of Concerns**: The pipeline architecture cleanly separates the analysis logic from the data access layer, making the code easier to maintain and test.
By combining these features, Data Tools helps you move from a collection of separate tables to a fully documented and interconnected data model, ready for integration or analysis.

## Installation
## Core Features

To get started, clone the repository and install the necessary dependencies using `uv`.
- **Automated Link Prediction**: Intelligently discovers potential foreign key relationships across multiple datasets.
- **In-Depth Data Profiling**: A multi-step pipeline that identifies keys, profiles columns, and determines data types.
- **Business Glossary Generation**: Automatically generates business-friendly descriptions and tags for columns and tables based on a provided domain.
- **Extensible Pipeline Architecture**: Easily add custom analysis steps to the pipeline.
- **DataFrame Agnostic**: Uses a factory pattern to seamlessly handle different dataframe types (e.g., pandas).

```bash
git clone <your-repository-url>
cd data-tools
uv pip install -e .
```
## Usage Examples

## Quick Start
### Example 1: Automated Link Prediction (Primary Use Case)

Using the library is straightforward. Simply define your pipeline, create your dataframe, and run the analysis.
This is the most direct way to use the library. Provide a dictionary of named dataframes, and the `LinkPredictor` will automatically run all prerequisite analysis steps and predict the links.

```python
import pandas as pd
from data_tools.analysis.pipeline import Pipeline
from data_tools.analysis.steps import TableProfiler
from data_tools.link_predictor.predictor import LinkPredictor

# 1. Define your pipeline
pipeline = Pipeline([
TableProfiler(),
# Add other analysis steps here
])
# 1. Prepare your collection of dataframes
customers_df = pd.DataFrame({"id": [1, 2, 3], "name": ["A", "B", "C"]})
orders_df = pd.DataFrame({"order_id": [101, 102], "customer_id": [1, 3]})

# 2. Create your dataframe
data = {
'name': ['Alice', 'Bob', 'Charlie'],
'age': [30, 25, 35],
'city': ['New York', 'Los Angeles', 'Chicago'],
'joined_date': pd.to_datetime(['2023-01-15', '2022-05-20', '2023-03-10'])
datasets = {
"customers": customers_df,
"orders": orders_df
}
my_df = pd.DataFrame(data)

# 3. Run the pipeline
analysis_results = pipeline.run(my_df)

# 4. Access the results
profile = analysis_results.results["table_profile"]
# 2. Initialize the predictor
# This automatically runs profiling, key identification, etc., for you.
link_predictor = LinkPredictor(datasets)

# Print the results
print(profile.model_dump_json(indent=2))
```
# 3. Predict the links
# (This example assumes you have implemented the logic in _predict_for_pair)
prediction_results = link_predictor.predict()

### Expected Output

```json
{
"count": 3,
"columns": [
"name",
"age",
"city",
"joined_date"
],
"dtypes": {
"name": "string",
"age": "integer",
"city": "string",
"joined_date": "date & time"
}
}
# 4. Review the results
print(prediction_results.model_dump_json(indent=2))
```

## How It Works
### Example 2: In-Depth Analysis with the Pipeline

The library is built around a central `Pipeline` class that executes a series of `AnalysisStep` objects on a `DataSet`.
If you want to perform a deep analysis on a single dataset, you can use the pipeline directly. This gives you fine-grained control over the analysis steps.

1. **`DataSet`**: A container for the raw dataframe and all the analysis results. This object is passed from one pipeline step to the next.
2. **`AnalysisStep`**: An abstract base class that defines the interface for all analysis steps. Each step implements an `analyze` method that takes a `DataSet` as input and adds its results to the `DataSet`.
3. **`DataFrameFactory`**: A factory that automatically detects the dataframe type and creates a wrapper object that provides a consistent API for accessing the data.
4. **`Pipeline`**: The orchestrator that takes a list of analysis steps and a dataframe, creates a `DataSet`, and runs the steps in sequence.
```python
import pandas as pd
from data_tools.analysis.pipeline import Pipeline
from data_tools.analysis.steps import (
TableProfiler,
KeyIdentifier,
BusinessGlossaryGenerator
)

## Extending the Tool
# 1. Define your analysis pipeline
pipeline = Pipeline([
TableProfiler(),
KeyIdentifier(),
BusinessGlossaryGenerator(domain="e-commerce") # Provide context for the glossary
])

### Adding a New Analysis Step
# 2. Prepare your dataframe
products_df = pd.DataFrame({
"product_id": [10, 20, 30],
"name": ["Laptop", "Mouse", "Keyboard"],
"unit_price": [1200, 25, 75]
})

To add a new analysis step, create a new class that inherits from `AnalysisStep` and implement the `analyze` method.
# 3. Run the pipeline
# The pipeline creates and returns a DataSet object
product_dataset = pipeline.run(df=products_df, name="products")

```python
from data_tools.analysis.steps import AnalysisStep
from data_tools.analysis.models import DataSet

class MyCustomProfiler(AnalysisStep):
def analyze(self, dataset: DataSet) -> None:
# Your analysis logic here
# ...
dataset.results["my_custom_profile"] = my_results
# 4. Access the rich analysis results
print(f"Identified Key: {product_dataset.results['key'].column_name}")
print("\n--- Table Glossary ---")
print(product_dataset.results['table_glossary'])
```

### Adding a New DataFrame Type
## Available Analysis Steps

To add support for a new dataframe library (e.g., Spark), follow these steps:
You can construct a custom `Pipeline` using any combination of the following steps:

1. **Create a New Module**: Add a file like `src/data_tools/dataframes/types/spark.py`.
2. **Implement the `DataFrame` Interface**: Create a class (e.g., `SparkDF`) that inherits from `DataFrame` and implements the `profile()` method with logic specific to Spark DataFrames.
3. **Create a Checker Function**: Write a function that returns `True` if the input object is a Spark DataFrame.
4. **Create a Register Function**: In the same file, create a function to register your new components with the factory.
5. **Activate the Plugin**: Add the path to your new module in the `DEFAULT_PLUGINS` list in `src/data_tools/dataframes/factory.py`.
- `TableProfiler`: Gathers basic table-level statistics (row count, column names).
- `ColumnProfiler`: Runs detailed profiling for each column (null counts, distinct counts, samples).
- `DataTypeIdentifierL1` & `L2`: Determines the logical data type for each column through multiple levels of analysis.
- `KeyIdentifier`: Analyzes the profiled data to predict the primary key of the dataset.
- `BusinessGlossaryGenerator(domain: str)`: Generates business-friendly descriptions and tags for all columns and the table itself, using the provided `domain` for context.

## Running Tests

Expand All @@ -124,4 +107,4 @@ uv run pytest

## License

This project is licensed under the terms of the LICENSE file.
This project is licensed under the terms of the LICENSE file.
130 changes: 130 additions & 0 deletions notebooks/sql_generator.ipynb

Large diffs are not rendered by default.

Loading