Skip to content

Commit 988cbeb

Browse files
Feature/multilp reltype (#124)
* composite key identification added * added test cases * moved distinct count to connector level * lint * storing distinct count * added composite key cache to store distinct_count * added langfuse observability * implement multi lp * tests * refactored predicted links * test cases pass * added asserts for accuracy and intersect ratios * added date for lp, connectors implementation, rate limiter * add basketball dataset * mcp server get schema refactored * dataproduct refactor * added uniqueness ratio * Link type identification added to data product * fix relationship type order mismatch bug * rc1 * databricks refactor * composite key graph visualization * sqlserver optimized queries * snowflake fixes * graph update * added legend. upgraded version * added docs * updated composite edge colour in streamlit * added example notebook * removed unnecessary print statements * linting * updated cardinality test case
1 parent 6715468 commit 988cbeb

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

62 files changed

+52020
-1385
lines changed

README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -110,11 +110,13 @@ For a detailed, hands-on introduction to the project, please see our quickstart
110110
| **Native Databricks with AI/BI Genie [ Tech Manufacturing ]** | [`quickstart_native_databricks.ipynb`](notebooks/quickstart_native_databricks.ipynb) | [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Intugle/data-tools/blob/main/notebooks/quickstart_native_databricks.ipynb) |
111111
| **Streamlit App** | [`quickstart_streamlit.ipynb`](notebooks/quickstart_streamlit.ipynb) | [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Intugle/data-tools/blob/main/notebooks/quickstart_streamlit.ipynb) |
112112
| **Conceptual Search** | [`quickstart_conceptual_search.ipynb`](notebooks/quickstart_conceptual_search.ipynb) | [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Intugle/data-tools/blob/main/notebooks/quickstart_conceptual_search.ipynb) |
113+
| **Composite Relationships Prediction** | [`quickstart_basketball_composite_links.ipynb`](notebooks/quickstart_basketball_composite_links.ipynb) | [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Intugle/data-tools/blob/main/notebooks/quickstart_basketball_composite_links.ipynb) |
114+
113115
These datasets will take you through the following steps:
114116

115117
* **Generate Semantic Model** → The unified layer that transforms fragmented datasets, creating the foundation for connected intelligence.
116118
* **1.1 Profile and classify data** → Analyze your data sources to understand their structure, data types, and other characteristics.
117-
* **1.2 Discover links & relationships among data** → Reveal meaningful connections (PK & FK) across fragmented tables.
119+
* **1.2 Discover links & relationships among data** → Reveal meaningful connections (PK & FK), including composite keys, across fragmented tables.
118120
* **1.3 Generate a business glossary** → Create business-friendly terms and use them to query data with context.
119121
* **1.4 Enable semantic search** → Intelligent search that understands meaning, not just keywords—making data more accessible across both technical and business users.
120122
* **1.5 Visualize semantic model**→ Get access to enriched metadata of the semantic layer in the form of YAML files and visualize in the form of graph

docsite/docs/core-concepts/semantic-intelligence/dataset.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,10 @@ The library organizes metadata using Pydantic models, but you can access it thro
3939
- **Table-Level Metadata**: Accessed via `dataset.source.table`.
4040
- `.name: str`
4141
- `.description: str`
42-
- `.key: Optional[str]`
42+
- `.key: Optional[PrimaryKey]`
43+
- **`PrimaryKey`**: Defines the primary key for a table.
44+
- `.columns: List[str]` (A list of column names that make up the primary key.)
45+
- `.distinct_count: Optional[int]` (The number of distinct values in the primary key column(s).)
4346
- **Column-Level Metadata**: Accessed via the `dataset.columns` dictionary, where keys are column names.
4447
- `[column_name].description: Optional[str]`
4548
- `[column_name].type: Optional[str]` (for example, 'integer', 'date')
@@ -68,7 +71,7 @@ print(f"Schema: {customers_dataset.source.schema}")
6871
# Access table-level metadata
6972
print(f"Table Name: {customers_dataset.source.table.name}")
7073
print(f"Table Description: {customers_dataset.source.table.description}")
71-
print(f"Primary Key: {customers_dataset.source.table.key}")
74+
print(f"Primary Key Columns: {customers_dataset.source.table.key.columns}")
7275

7376
# Access column-level metadata using the 'columns' dictionary
7477
email_column = customers_dataset.columns['email']

docsite/docs/core-concepts/semantic-intelligence/link-prediction.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ title: Link Prediction
55

66
# Link prediction
77

8-
Link prediction is one of the most powerful features of the Intugle Data Tools library. It's the process of automatically discovering meaningful relationships and potential join keys between different, isolated datasets. This turns a collection of separate tables into a connected semantic graph, which is the foundation for building unified data products.
8+
Link prediction is one of the most powerful features of the Intugle Data Tools library. It's the process of automatically discovering meaningful relationships and potential join keys between different, isolated datasets. This turns a collection of separate tables into a connected semantic graph, which is the foundation for building unified data products. The library now supports the prediction of **composite key** relationships, where multiple columns together form a link between tables.
99

1010
## The LinkPredictor class
1111

@@ -28,7 +28,7 @@ links_list = predictor_instance.links
2828
To use the `LinkPredictor` manually, you must give it a list of fully profiled `DataSet` objects.
2929

3030
```python
31-
from intugle.analysis.models import DataSet,
31+
from intugle.analysis.models import DataSet
3232
from intugle.link_predictor.predictor import LinkPredictor
3333

3434

@@ -52,7 +52,7 @@ predictor.predict(save=True)
5252
# The discovered links are stored as a list of PredictedLink objects in the `links` attribute
5353
links_list = predictor.links
5454
for link in links_list:
55-
print(f"Found link from {link.from_dataset}.{link.from_column} to {link.to_dataset}.{link.to_column}")
55+
print(f"Found link from {link.from_dataset}.{link.from_columns} to {link.to_dataset}.{link.to_columns}")
5656
```
5757

5858
### Caching mechanism
@@ -74,7 +74,7 @@ A utility function that converts the `links` list into a Pandas DataFrame. This
7474
links_df = predictor.get_links_df()
7575

7676
# Display the DataFrame
77-
# columns: from_dataset, from_column, to_dataset, to_column
77+
# columns: from_dataset, from_columns, to_dataset, to_columns
7878
print(links_df)
7979
```
8080

docsite/docs/examples.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ For a detailed, hands-on introduction to the project, please see our quickstart
1919
| **Native Snowflake with Cortex Analyst [ Tech Manufacturing ]** | [`quickstart_native_snowflake.ipynb`](https://github.com/Intugle/data-tools/blob/main/notebooks/quickstart_native_snowflake.ipynb) | [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Intugle/data-tools/blob/main/notebooks/quickstart_native_snowflake.ipynb) |
2020
| **Native Databricks with AI/BI Genie [ Tech Manufacturing ]** | [`quickstart_native_databricks.ipynb`](https://github.com/Intugle/data-tools/blob/main/notebooks/quickstart_native_databricks.ipynb) | [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Intugle/data-tools/blob/main/notebooks/quickstart_native_databricks.ipynb) |
2121
| **Streamlit App** | [`quickstart_streamlit.ipynb`](https://github.com/Intugle/data-tools/blob/main/notebooks/quickstart_streamlit.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Intugle/data-tools/blob/main/notebooks/quickstart_streamlit.ipynb) |
22+
| **Composite Relationships Prediction** | [`quickstart_basketball_composite_links.ipynb`](https://github.com/Intugle/data-tools/blob/main/notebooks/quickstart_basketball_composite_links.ipynb) | [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Intugle/data-tools/blob/main/notebooks/quickstart_basketball_composite_links.ipynb) |
2223

2324
These datasets will take you through the following steps:
2425

0 commit comments

Comments
 (0)