Skip to content

Commit 8ab0b25

Browse files
narotsit-intuglejuhel-phanju-intugleJaskaranIntugle
authored
Features/final touches (#16)
* feat: Duckdb compatible - Added duckdb adapter - Changes made in sql generator for query compatiblity - Added details in source * feat: updated model version compatible to scikit-learn==1.7.1, xgboost==3.0.4 * feat: added fallback for np.float128 as it's not supported by all systems * added Knowledge Builder Module * added DataProductBuilder, updated documentation * DataProductBuilder typo fix * added macos instructions * updated readme to correct API KEY * added dev1 * added httpfs manually for duckdb mac * updated version to 0.1.2dev2 * added SSL certs for nltk downloads for mac users * added retry to output parser, removed mcp documentation * updated documenations * added virtual env docs * included all tables --------- Co-authored-by: juhel-phanju-intugle <juhel@intugle.ai> Co-authored-by: JaskaranIntugle <jaskaran@intugle.ai>
1 parent eee45f5 commit 8ab0b25

File tree

35 files changed

+2082
-1317
lines changed

35 files changed

+2082
-1317
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -211,3 +211,4 @@ testing_base
211211
models
212212

213213
settings.json
214+
archived/

README.md

Lines changed: 67 additions & 62 deletions
Original file line numberDiff line numberDiff line change
@@ -10,45 +10,63 @@
1010
[![Open Issues](https://img.shields.io/github/issues-raw/Intugle/data-tools)](https://github.com/Intugle/data-tools/issues)
1111
[![GitHub star chart](https://img.shields.io/github/stars/Intugle/data-tools?style=social)](https://github.com/Intugle/data-tools/stargazers)
1212

13-
*Automated Data Profiling, Link Prediction, and Semantic Layer Generation*
13+
*Transform Fragmented Data into Connected Semantic Layer*
1414

1515
## Overview
1616

17-
Intugle provides a set of GenAI-powered Python tools to simplify and accelerate the journey from raw data to insights. This library empowers data and business teams to build an intelligent semantic layer over their data, enabling self-serve analytics and natural language queries. By automating data profiling, link prediction, and SQL generation, Intugle helps you build data products faster and more efficiently than traditional methods.
17+
Intugle’s GenAI-powered open-source Python library builds an intelligent semantic layer over your existing data systems. At its core, it discovers meaningful links and relationships across data assets — enriching them with profiles, classifications, and business glossaries. With this connected knowledge layer, you can enable semantic search and auto-generate queries to create unified data products, making data integration and exploration faster, more accurate, and far less manual.
1818

1919
## Who is this for?
2020

21-
This tool is designed for both **data teams** and **business teams**.
22-
23-
* **Data teams** can use it to automate data profiling, schema discovery, and documentation, significantly accelerating their workflow.
24-
* **Business teams** can use it to gain a better understanding of their data and to perform self-service analytics without needing to write complex SQL queries.
21+
* **Data Engineers & Architects** often spend weeks manually profiling, classifying, and stitching together fragmented data assets. With Intugle, they can automate this process end-to-end, uncovering meaningful links and relationships to instantly generate a connected semantic layer.
22+
* **Data Analysts & Scientists** spend endless hours on data readiness and preparation before they can even start the real analysis. Intugle accelerates this by providing contextual intelligence, automatically generating SQL and reusable data products enriched with relationships and business meaning.
23+
* **Business Analysts & Decision Makers** are slowed down by constant dependence on technical teams for answers. Intugle removes this bottleneck by enabling natural language queries and semantic search, giving them trusted insights on demand.
2524

2625
## Features
2726

28-
* **Automated Data Profiling:** Generate detailed statistics for each column in your dataset, including distinct count, uniqueness, completeness, and more.
29-
* **Datatype Identification:** Automatically identify the data type of each column (e.g., integer, string, datetime).
30-
* **Key Identification:** Identify potential primary keys in your tables.
31-
* **LLM-Powered Link Prediction:** Use GenAI to automatically discover relationships (foreign keys) between tables.
32-
* **Business Glossary Generation:** Generate a business glossary for each column, with support for industry-specific domains.
33-
* **Semantic Layer Generation:** Create YAML files that defines your semantic layer, including models (tables) and their relationships.
34-
* **SQL Generation:** Generate SQL queries from the semantic layer, allowing you to query your data using business-friendly terms.
27+
* **Semantic Intelligence:** Transform raw, fragmented datasets into an intelligent semantic graph that captures entities, relationships, and context — the foundation for connected intelligence.
28+
* **Business Glossary & Semantic Search:** Auto-generate a business glossary and enable search that understands meaning, not just keywords — making data more accessible across technical and business users.
29+
* **Smart SQL & Data Products:** Instantly generate SQL and reusable data products enriched with context, eliminating manual pipelines and accelerating data-to-insight.
3530

3631
## Getting Started
3732

3833
### Installation
3934

35+
Before installing, it is recommended to create a virtual environment:
36+
37+
```bash
38+
python -m venv .venv
39+
source .venv/bin/activate
40+
```
41+
42+
Then, install the package:
43+
4044
```bash
4145
pip install intugle
4246
```
4347

48+
#### macOS
49+
50+
For macOS users, you may need to install the `libomp` library:
51+
52+
```bash
53+
brew install libomp
54+
```
55+
56+
If you installed Python using the official installer from python.org, you may also need to install SSL certificates by running the following command in your terminal. Please replace `3.XX` with your specific Python version. This step is not necessary if you installed Python using Homebrew.
57+
58+
```bash
59+
/Applications/Python\ 3.XX/Install\ Certificates.command
60+
```
61+
4462
### Configuration
4563

4664
Before running the project, you need to configure a LLM. This is used for tasks like generating business glossaries and predicting links between tables.
4765

4866
You can configure the LLM by setting the following environment variables:
4967

5068
* `LLM_PROVIDER`: The LLM provider and model to use (e.g., `openai:gpt-3.5-turbo`) following LangChain's [conventions](https://python.langchain.com/docs/integrations/chat/)
51-
* `OPENAI_API_KEY`: Your API key for the LLM provider.
69+
* `API_KEY`: Your API key for the LLM provider. The exact name of the variable may vary from provider to provider.
5270

5371
Here's an example of how to set these variables in your environment:
5472

@@ -59,66 +77,53 @@ export OPENAI_API_KEY="your-openai-api-key"
5977

6078
## Quickstart
6179

62-
For a detailed, hands-on introduction to the project, please see the [`quickstart.ipynb`](quickstart.ipynb) notebook. It will walk you through the entire process of profiling your data, predicting links, generating a semantic layer, and querying your data.
63-
64-
## Usage
65-
66-
The core workflow of the project involves the following steps:
67-
68-
1. **Load your data:** Load your data into a DataSet object.
69-
2. **Run the analysis pipeline:** Use the `run()` method to profile your data and generate a business glossary.
70-
3. **Predict links:** Use the `LinkPredictor` to discover relationships between your tables.
71-
72-
```python
73-
from intugle import LinkPredictor
80+
For a detailed, hands-on introduction to the project, please see the [`quickstart.ipynb`](quickstart.ipynb) notebook. It will walk you through the entire process of building a semantic layer, including:
7481

75-
# Initialize the predictor
76-
predictor = LinkPredictor(datasets)
82+
* **Building a Knowledge Base:** Use the `KnowledgeBuilder` to automatically profile your data, generate a business glossary, and predict links between tables.
83+
* **Accessing Enriched Metadata:** Learn how to access the profiling results and business glossary for each dataset.
84+
* **Visualizing Relationships:** Visualize the predicted links between your tables.
85+
* **Generating Data Products:** Use the semantic layer to generate data products and retrieve data.
86+
* **Serving the Semantic Layer:** Learn how to start the MCP server to interact with your semantic layer using natural language.
7787

78-
# Run the prediction
79-
results = predictor.predict()
80-
results.show_graph()
81-
```
82-
83-
5. **Generate SQL:** Use the `SqlGenerator` to generate SQL queries from the semantic layer.
84-
85-
```python
86-
from intugle import SqlGenerator
88+
## Usage
8789

88-
# Create a SqlGenerator
89-
sql_generator = SqlGenerator()
90+
The core workflow of the project involves using the `KnowledgeBuilder` to build a semantic layer, and then using the `DataProductBuilder` to generate data products from that layer.
9091

91-
# Create an ETL model
92-
etl = {
93-
name": "test_etl",
94-
fields": [
95-
{"id": "patients.first", "name": "first_name"},
96-
{"id": "patients.last", "name": "last_name"},
97-
{"id": "allergies.start", "name": "start_date"},
98-
,
99-
filter": {
100-
"selections": [{"id": "claims.departmentid", "values": ["3", "20"]}],
101-
,
102-
}
92+
```python
93+
from intugle import KnowledgeBuilder, DataProductBuilder
10394

104-
# Generate the query
105-
sql_query = sql_generator.generate_query(etl_model)
106-
print(sql_query)
107-
```
95+
# Define your datasets
96+
datasets = {
97+
"allergies": {"path": "path/to/allergies.csv", "type": "csv"},
98+
"patients": {"path": "path/to/patients.csv", "type": "csv"},
99+
# ... add other datasets
100+
}
108101

109-
For detailed code examples and a complete walkthrough, please refer to the [`quickstart.ipynb`](quickstart.ipynb) notebook.
102+
# Build the knowledge base
103+
kb = KnowledgeBuilder(datasets, domain="Healthcare")
104+
kb.build()
110105

111-
### MCP Server
106+
# Create a DataProductBuilder
107+
dp_builder = DataProductBuilder()
112108

113-
This tool also includes an MCP server that exposes your semantic layer as a set of tools that can be used by an LLM client. This enables you to interact with your semantic layer using natural language to generate SQL queries, discover data, and more.
109+
# Define an ETL model
110+
etl = {
111+
"name": "patient_allergies",
112+
"fields": [
113+
{"id": "patients.first", "name": "first_name"},
114+
{"id": "patients.last", "name": "last_name"},
115+
{"id": "allergies.description", "name": "allergy"},
116+
],
117+
}
114118

115-
To start the MCP server, run the following command:
119+
# Generate the data product
120+
data_product = dp_builder.build(etl)
116121

117-
```bash
118-
intugle-mcp
122+
# View the data product as a DataFrame
123+
print(data_product.to_df())
119124
```
120125

121-
You can then connect to the server from any MCP client, such as Claude Desktop or Gemini CLI, at `http://localhost:8000/semantic_layer/mcp`.
126+
For detailed code examples and a complete walkthrough, please refer to the [`quickstart.ipynb`](quickstart.ipynb) notebook.
122127

123128
## Contributing
124129

0 commit comments

Comments
 (0)