Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
2fcbabd
feat: Duckdb compatible
juhel-phanju-intugle Aug 28, 2025
b84fcae
feat: updated model version compatible to scikit-learn==1.7.1, xgboos…
juhel-phanju-intugle Sep 2, 2025
49a06c2
feat: added fallback for np.float128 as it's not supported by all sys…
juhel-phanju-intugle Sep 6, 2025
02bcd31
Merge branch 'features/duckdb' into features/model-duckdb-merged
JaskaranIntugle Sep 7, 2025
75b48fd
added Knowledge Builder Module
JaskaranIntugle Sep 7, 2025
1481b39
added DataProductBuilder, updated documentation
JaskaranIntugle Sep 7, 2025
aa85d3e
DataProductBuilder typo fix
JaskaranIntugle Sep 7, 2025
464ece7
added macos instructions
JaskaranIntugle Sep 7, 2025
a0c4a99
updated readme to correct API KEY
JaskaranIntugle Sep 7, 2025
d29c6cc
added dev1
JaskaranIntugle Sep 7, 2025
362e8aa
added httpfs manually for duckdb mac
JaskaranIntugle Sep 7, 2025
f9098bd
updated version to 0.1.2dev2
JaskaranIntugle Sep 7, 2025
0766c30
added SSL certs for nltk downloads for mac users
JaskaranIntugle Sep 7, 2025
4894a3d
added warning when entering graph recursion
JaskaranIntugle Sep 9, 2025
09038ff
added " delimiter in dp_builder
JaskaranIntugle Sep 9, 2025
cb0e42d
semantic search
juhel-phanju-intugle Sep 9, 2025
05cecd6
knowledge builder can be resumed if failed for dataset pipeline, not …
JaskaranIntugle Sep 9, 2025
05b13bb
added key to yamls
JaskaranIntugle Sep 9, 2025
244da43
Made LLM_PROVIDER optional to decouple from downstream
JaskaranIntugle Sep 9, 2025
13f36d1
updated quickstart content
JaskaranIntugle Sep 9, 2025
889ea82
added configs, added syncing
JaskaranIntugle Sep 9, 2025
ef58204
added semantic search to knowledge_builder
JaskaranIntugle Sep 9, 2025
f5cf96c
updated tests, asyncio loop handling, sort in search
JaskaranIntugle Sep 10, 2025
ddc4ada
Merge branch 'features/semantic-search' into features/merged-semantic…
JaskaranIntugle Sep 10, 2025
22e0ff8
removed semantic search md
JaskaranIntugle Sep 10, 2025
bcb6fdc
updated quickstart
JaskaranIntugle Sep 10, 2025
23b2e59
incremented version
JaskaranIntugle Sep 10, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -209,6 +209,7 @@ notes.txt

testing_base
models
models_bak

settings.json
archived/
71 changes: 68 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
</p>

[![Release](https://img.shields.io/github/release/Intugle/data-tools)](https://github.com/Intugle/data-tools/releases/tag/v0.1.0)
[![Made with Python](https://img.shields.io/badge/Made_with-Python-blue?logo=python&logoColor=white)](https://www.python.org/)
[![Made with Python](https://img.shields.io/badge/Made_with-Python-blue?logo=python&logoColor=white)](https://www.python.org/)
![contributions - welcome](https://img.shields.io/badge/contributions-welcome-blue)
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Open Issues](https://img.shields.io/github/issues-raw/Intugle/data-tools)](https://github.com/Intugle/data-tools/issues)
Expand Down Expand Up @@ -85,7 +85,7 @@ For a detailed, hands-on introduction to the project, please see the [`quickstar
* **Accessing Enriched Metadata:** Learn how to access the profiling results and business glossary for each dataset.
* **Visualizing Relationships:** Visualize the predicted links between your tables.
* **Generating Data Products:** Use the semantic layer to generate data products and retrieve data.
* **Serving the Semantic Layer:** Learn how to start the MCP server to interact with your semantic layer using natural language.
* **Searching the Knowledge Base:** Use semantic search to find relevant columns in your datasets using natural language.

## Usage

Expand Down Expand Up @@ -147,7 +147,72 @@ data_product = dp_builder.build(etl)
print(data_product.to_df())
```

For detailed code examples and a complete walkthrough, please refer to our quickstart notebooks.
For detailed code examples and a complete walkthrough, please see the [`quickstart.ipynb`](quickstart.ipynb) notebook.

### Semantic Search

The semantic search feature allows you to search for columns in your datasets using natural language. It is built on top of the [Qdrant](https://qdrant.tech/) vector database.

#### Prerequisites

To use the semantic search feature, you need to have a running Qdrant instance. You can start one using the following Docker command:

```bash
docker run -p 6333:6333 -p 6334:6334 \
-v qdrant_storage:/qdrant/storage:z \
--name qdrant qdrant/qdrant
```

You also need to configure the Qdrant URL and API key (if using authorization) in your environment variables:

```bash
export QDRANT_URL="http://localhost:6333"
export QDRANT_API_KEY="your-qdrant-api-key" # if authorization is used
```

Currently, the semantic search feature only supports OpenAI embedding models. Therefore, you need to have an OpenAI API key set up in your environment. The default model is `text-embedding-ada-002`. You can change the embedding model by setting the `EMBEDDING_MODEL_NAME` environment variable.

**For OpenAI models:**

```bash
export OPENAI_API_KEY="your-openai-api-key"
export EMBEDDING_MODEL_NAME="openai:ada"
```

**For Azure OpenAI models:**

```bash
export AZURE_OPENAI_API_KEY="your-azure-openai-api-key"
export AZURE_OPENAI_ENDPOINT="your-azure-openai-endpoint"
export OPENAI_API_VERSION="your-openai-api-version"
export EMBEDDING_MODEL_NAME="azure_openai:ada"
```

#### Usage

Once you have built the knowledge base, you can use the `search` method to perform a semantic search. The search function returns a pandas DataFrame containing the search results, including the column\'s profiling metrics, category, table name, and table glossary.

```python
from intugle import KnowledgeBuilder

# Define your datasets
datasets = {
"allergies": {"path": "path/to/allergies.csv", "type": "csv"},
"patients": {"path": "path/to/patients.csv", "type": "csv"},
"claims": {"path": "path/to/claims.csv", "type": "csv"},
# ... add other datasets
}

# Build the knowledge base
kb = KnowledgeBuilder(datasets, domain="Healthcare")
kb.build()
# Perform a semantic search
search_results = kb.search("patient allergies")

# View the search results
print(search_results)
```
For detailed code examples and a complete walkthrough, please see the [`quickstart.ipynb`](quickstart.ipynb) notebook.

## Community

Expand Down
Loading