Skip to content

Commit 5ee1164

Browse files
Merge pull request #61 from Intugle/docs/connector-implentation
DOCS: How to add a connector
2 parents 2950740 + ee5f0d4 commit 5ee1164

File tree

6 files changed

+272
-7
lines changed

6 files changed

+272
-7
lines changed
Lines changed: 203 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,203 @@
1+
---
2+
sidebar_position: 4
3+
---
4+
5+
# Implementing a Connector
6+
7+
:::tip Pro Tip: Use an AI Coding Assistant
8+
The fastest way to implement a new adapter is to use an AI coding assistant like the **Gemini CLI**, **Cursor**, or **Claude**.
9+
10+
1. **Provide Context:** Give the assistant the code for an existing, similar adapter (e.g., `SnowflakeAdapter` or `DatabricksAdapter`).
11+
2. **State Your Goal:** Ask it to replicate the structure and logic for your new data source. For example: *"Using the Snowflake adapter as a reference, create a new adapter for MyConnector."*
12+
3. **Iterate:** The assistant can generate the boilerplate code for the models, the adapter class, and the registration functions, allowing you to focus on the specific implementation details for your database driver.
13+
:::
14+
15+
`intugle` is designed to be extensible, allowing you to connect to any data source by creating a custom adapter. This guide walks you through the process of building your own connector.
16+
17+
If you build a connector that could benefit the community, we strongly encourage you to [open a pull request and contribute it](https://github.com/Intugle/data-tools/blob/main/CONTRIBUTING.md) to the `intugle` project!
18+
19+
## Overview
20+
21+
An adapter is a Python class that inherits from `intugle.adapters.adapter.Adapter` and implements a set of methods for interacting with a specific data source. It handles everything from connecting to the database to profiling data and executing queries.
22+
23+
The core steps to create a new connector are:
24+
1. **Create the Scaffolding:** Set up the necessary directory and files.
25+
2. **Define Configuration Models:** Create Pydantic models for your connector's configuration.
26+
3. **Implement the Adapter Class:** Write the logic to interact with your data source.
27+
4. **Register the Adapter:** Make your new adapter discoverable by the `intugle` factory.
28+
29+
## Step 1: Create the Scaffolding
30+
31+
First, create a new directory for your connector within the `src/intugle/adapters/types/` directory. For a connector named `myconnector`, you would create:
32+
33+
```
34+
src/intugle/adapters/types/myconnector/
35+
├── __init__.py
36+
├── models.py
37+
└── myconnector.py
38+
```
39+
40+
- `__init__.py`: Can be an empty file.
41+
- `models.py`: Will contain the Pydantic configuration models.
42+
- `myconnector.py`: Will contain the main adapter class logic.
43+
44+
## Step 2: Define Configuration Models
45+
46+
In `src/intugle/adapters/types/myconnector/models.py`, you need to define two Pydantic models:
47+
48+
1. **Connection Config:** Defines the parameters needed to connect to your data source (e.g., host, user, password). This will be the format that will be picked up from the profiles.yml
49+
2. **Data Config:** Defines how to identify a specific table or asset from that source. This will be the format that will be used to pass the datasets into the SemanticModel
50+
51+
**Example `models.py`:**
52+
```python
53+
from typing import Optional
54+
from intugle.common.schema import SchemaBase
55+
56+
class MyConnectorConnectionConfig(SchemaBase):
57+
host: str
58+
port: int
59+
user: str
60+
password: str
61+
schema: Optional[str] = None
62+
63+
class MyConnectorConfig(SchemaBase):
64+
identifier: str
65+
type: str = "myconnector"
66+
```
67+
68+
Finally, open `src/intugle/adapters/models.py` and add your new `MyConnectorConfig` to the `DataSetData` type hint:
69+
70+
```python
71+
# src/intugle/adapters/models.py
72+
73+
# ... other imports
74+
from intugle.adapters.types.myconnector.models import MyConnectorConfig
75+
76+
DataSetData = pd.DataFrame | DuckdbConfig | ... | MyConnectorConfig
77+
```
78+
79+
## Step 3: Implement the Adapter Class
80+
81+
In `src/intugle/adapters/types/myconnector/myconnector.py`, create your adapter class. It must inherit from `Adapter` and implement its abstract methods.
82+
83+
This is a simplified skeleton. You can look at the `DatabricksAdapter` or `SnowflakeAdapter` for a more complete example.
84+
85+
**Example `myconnector.py`:**
86+
```python
87+
from typing import Any, Optional
88+
import pandas as pd
89+
from intugle.adapters.adapter import Adapter
90+
from intugle.adapters.factory import AdapterFactory
91+
from intugle.adapters.models import ColumnProfile, ProfilingOutput
92+
from .models import MyConnectorConfig, MyConnectorConnectionConfig
93+
from intugle.core import settings
94+
95+
# Import your database driver
96+
# import myconnector_driver
97+
98+
class MyConnectorAdapter(Adapter):
99+
def __init__(self):
100+
# Initialize your connection here
101+
connection_params = settings.PROFILES.get("myconnector", {})
102+
config = MyConnectorConnectionConfig.model_validate(connection_params)
103+
# self.connection = myconnector_driver.connect(**config.model_dump())
104+
pass
105+
106+
# --- Must be implemented ---
107+
108+
def profile(self, data: Any, table_name: str) -> ProfilingOutput:
109+
# Return table-level metadata: row count, column names, and dtypes
110+
raise NotImplementedError()
111+
112+
def column_profile(self, data: Any, table_name: str, column_name: str, total_count: int) -> Optional[ColumnProfile]:
113+
# Return column-level statistics: null count, distinct count, samples, etc.
114+
raise NotImplementedError()
115+
116+
def execute(self, query: str):
117+
# Execute a query and return the raw results
118+
raise NotImplementedError()
119+
120+
def to_df_from_query(self, query: str) -> pd.DataFrame:
121+
# Execute a query and return the result as a pandas DataFrame
122+
raise NotImplementedError()
123+
124+
def create_table_from_query(self, table_name: str, query: str) -> str:
125+
# Materialize a query as a new table or view
126+
raise NotImplementedError()
127+
128+
def create_new_config_from_etl(self, etl_name: str) -> "DataSetData":
129+
# Return a new MyConnectorConfig for a materialized table
130+
return MyConnectorConfig(identifier=etl_name)
131+
132+
def intersect_count(self, table1: "DataSet", column1_name: str, table2: "DataSet", column2_name: str) -> int:
133+
# Calculate the count of intersecting values between two columns
134+
raise NotImplementedError()
135+
136+
# --- Other required methods ---
137+
138+
def load(self, data: Any, table_name: str):
139+
# For database adapters, this is often a no-op
140+
pass
141+
142+
def to_df(self, data: DataSetData, table_name: str):
143+
# Read an entire table into a pandas DataFrame
144+
config = MyConnectorConfig.model_validate(data)
145+
return self.to_df_from_query(f"SELECT * FROM {config.identifier}")
146+
147+
def get_details(self, data: DataSetData):
148+
config = MyConnectorConfig.model_validate(data)
149+
return config.model_dump()
150+
```
151+
152+
## Step 4: Register the Adapter
153+
154+
To make `intugle` aware of your new adapter, you must register it with the factory.
155+
156+
1. **Add registration functions to `myconnector.py`:** At the bottom of your adapter file, add two functions: one to check if the adapter can handle a given data config, and one to register it with the factory.
157+
158+
```python
159+
# In src/intugle/adapters/types/myconnector/myconnector.py
160+
161+
def can_handle_myconnector(df: Any) -> bool:
162+
try:
163+
MyConnectorConfig.model_validate(df)
164+
return True
165+
except Exception:
166+
return False
167+
168+
def register(factory: AdapterFactory):
169+
# Check if the required driver is installed
170+
# if MYCONNECTOR_DRIVER_AVAILABLE:
171+
factory.register("myconnector", can_handle_myconnector, MyConnectorAdapter)
172+
```
173+
174+
2. **Add the adapter to the default plugins list:** Open `src/intugle/adapters/factory.py` and add the path to your new adapter module.
175+
176+
```python
177+
# In src/intugle/adapters/factory.py
178+
179+
DEFAULT_PLUGINS = [
180+
"intugle.adapters.types.pandas.pandas",
181+
# ... other adapters
182+
"intugle.adapters.types.myconnector.myconnector",
183+
]
184+
```
185+
186+
## Step 5: Add Optional Dependencies
187+
188+
If your adapter requires a specific driver library (like `databricks-sql-connector` for Databricks), you should add it as an optional dependency.
189+
190+
1. Open the `pyproject.toml` file at the root of the project.
191+
2. Add a new extra under the `[project.optional-dependencies]` section.
192+
193+
```toml
194+
# In pyproject.toml
195+
196+
[project.optional-dependencies]
197+
# ... other dependencies
198+
myconnector = ["myconnector-driver-library>=1.0.0"]
199+
```
200+
201+
This allows users to install the necessary libraries by running `pip install "intugle[myconnector]"`.
202+
203+
That's it! You have now implemented and registered a custom connector.

docsite/docs/core-concepts/semantic-intelligence/semantic-search.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,30 @@ export AZURE_OPENAI_ENDPOINT="your-azure-openai-endpoint"
7171
export OPENAI_API_VERSION="your-openai-api-version"
7272
```
7373

74+
#### Using a Custom Embeddings Instance
75+
76+
If you need to use a pre-initialized embeddings model, you can directly inject the model instance.
77+
78+
The custom model must be an instance of `langchain_core.embeddings.embeddings.Embeddings`.
79+
80+
You can set the custom instance by modifying the `intugle.core.settings` module **before** you import and use the `SemanticModel`.
81+
82+
**Example:**
83+
```python
84+
# main.py
85+
from intugle.core import settings
86+
87+
# This must be an object that inherits from Embeddings
88+
my_embeddings_instance = ...
89+
90+
# Set the custom instance in the settings
91+
settings.CUSTOM_EMBEDDINGS_INSTANCE = my_embeddings_instance
92+
93+
# Now, any intugle modules imported after this point will use your custom model
94+
# from intugle import SemanticModel
95+
# ...
96+
```
97+
7498
## Usage with SemanticModel
7599

76100
The simplest way to use semantic search is through the `SemanticModel` after the semantic model has been built.

docsite/docs/getting-started.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,4 +55,28 @@ Here's an example of how to set these variables in your environment:
5555
```bash
5656
export LLM_PROVIDER="openai:gpt-3.5-turbo"
5757
export OPENAI_API_KEY="your-openai-api-key"
58+
```
59+
60+
### Using a Custom LLM Instance
61+
62+
For environments where you need to use a pre-initialized language model, you can directly inject the model instance.
63+
64+
The custom LLM must be an instance of `langchain_core.language_models.chat_models.BaseChatModel`.
65+
66+
You can set the custom instance by modifying the `intugle.core.settings` module **before** you import and use any `intugle` classes.
67+
68+
**Example:**
69+
```python
70+
# main.py
71+
from intugle.core import settings
72+
73+
# This must be an object that inherits from BaseChatModel
74+
my_llm_instance = ...
75+
76+
# Set the custom instance in the settings
77+
settings.CUSTOM_LLM_INSTANCE = my_llm_instance
78+
79+
# Now, any intugle modules imported after this point will use your custom LLM
80+
81+
# ... rest of your code
5882
```

src/intugle/core/llms/chat.py

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
import logging
22

3-
from typing import TYPE_CHECKING
3+
from typing import TYPE_CHECKING, Optional
44

55
from langchain.chat_models import init_chat_model
66
from langchain.output_parsers import (
@@ -30,7 +30,7 @@ class ChatModelLLM:
3030

3131
def __init__(
3232
self,
33-
model_name: str,
33+
model_name: Optional[str] = None,
3434
response_schemas: list[ResponseSchema] = None,
3535
output_parser=StructuredOutputParser,
3636
prompt_template=ChatPromptTemplate,
@@ -39,9 +39,14 @@ def __init__(
3939
*args,
4040
**kwargs,
4141
):
42-
self.model: BaseChatModel = init_chat_model(
43-
model_name, max_retries=self.MAX_RETRIES, rate_limiter=self._get_rate_limiter(), **config
44-
) # llm model
42+
if settings.CUSTOM_LLM_INSTANCE:
43+
self.model: "BaseChatModel" = settings.CUSTOM_LLM_INSTANCE
44+
elif model_name:
45+
self.model: "BaseChatModel" = init_chat_model(
46+
model_name, max_retries=self.MAX_RETRIES, rate_limiter=self._get_rate_limiter(), **config
47+
)
48+
else:
49+
raise ValueError("Either 'settings.CUSTOM_LLM_INSTANCE' must be set or 'LLM_PROVIDER' must be provided.")
4550

4651
self.parser: StructuredOutputParser = output_parser # the output parser
4752

@@ -135,6 +140,8 @@ def invoke(self, *args, **kwargs):
135140

136141
@classmethod
137142
def get_llm(cls, model_name: str, llm_config: dict = {}):
143+
if settings.CUSTOM_LLM_INSTANCE:
144+
return settings.CUSTOM_LLM_INSTANCE
138145
return init_chat_model(
139146
model_name, max_retries=cls.MAX_RETRIES, rate_limiter=cls._get_rate_limiter(), **llm_config
140147
)

src/intugle/core/llms/embeddings.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,8 @@
88

99
from langchain.embeddings.base import init_embeddings
1010

11+
from intugle.core import settings
12+
1113

1214
class EmbeddingsType(str, Enum):
1315
DENSE = "dense"
@@ -30,7 +32,10 @@ def __init__(
3032
embeddings_size: Optional[int] = None,
3133
):
3234
self.model_name = model_name
33-
self.model = init_embeddings(model_name, **config)
35+
if settings.CUSTOM_EMBEDDINGS_INSTANCE:
36+
self.model = settings.CUSTOM_EMBEDDINGS_INSTANCE
37+
else:
38+
self.model = init_embeddings(model_name, **config)
3439
self._embed_func: Dict[EmbeddingsType, Callable] = {
3540
EmbeddingsType.DENSE: self.dense,
3641
EmbeddingsType.SPARSE: self.sparse,

src/intugle/core/settings.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
from functools import lru_cache
66
from pathlib import Path
7-
from typing import Optional
7+
from typing import Any, Optional
88

99
from dotenv import load_dotenv
1010
from pydantic_settings import BaseSettings, SettingsConfigDict
@@ -66,6 +66,8 @@ class Settings(BaseSettings):
6666
MAX_RETRIES: int = 5
6767
SLEEP_TIME: int = 25
6868
ENABLE_RATE_LIMITER: bool = False
69+
CUSTOM_LLM_INSTANCE: Optional[Any] = None
70+
CUSTOM_EMBEDDINGS_INSTANCE: Optional[Any] = None
6971

7072
# LP
7173
HALLUCINATIONS_MAX_RETRY: int = 2

0 commit comments

Comments
 (0)