Merge pull request #61 from Intugle/docs/connector-implentation

prinkanintugle · web-flow · commit 5ee11649d75e · 2025-10-06T02:47:24.000-07:00
DOCS: How to add a connector
diff --git a/docsite/docs/connectors/implementing-a-connector.md b/docsite/docs/connectors/implementing-a-connector.md
@@ -0,0 +1,203 @@
+---
+sidebar_position: 4
+---
+
+# Implementing a Connector
+
+:::tip Pro Tip: Use an AI Coding Assistant
+The fastest way to implement a new adapter is to use an AI coding assistant like the **Gemini CLI**, **Cursor**, or **Claude**.
+
+1.  **Provide Context:** Give the assistant the code for an existing, similar adapter (e.g., `SnowflakeAdapter` or `DatabricksAdapter`).
+2.  **State Your Goal:** Ask it to replicate the structure and logic for your new data source. For example: *"Using the Snowflake adapter as a reference, create a new adapter for MyConnector."*
+3.  **Iterate:** The assistant can generate the boilerplate code for the models, the adapter class, and the registration functions, allowing you to focus on the specific implementation details for your database driver.
+:::
+
+`intugle` is designed to be extensible, allowing you to connect to any data source by creating a custom adapter. This guide walks you through the process of building your own connector.
+
+If you build a connector that could benefit the community, we strongly encourage you to [open a pull request and contribute it](https://github.com/Intugle/data-tools/blob/main/CONTRIBUTING.md) to the `intugle` project!
+
+## Overview
+
+An adapter is a Python class that inherits from `intugle.adapters.adapter.Adapter` and implements a set of methods for interacting with a specific data source. It handles everything from connecting to the database to profiling data and executing queries.
+
+The core steps to create a new connector are:
+1.  **Create the Scaffolding:** Set up the necessary directory and files.
+2.  **Define Configuration Models:** Create Pydantic models for your connector's configuration.
+3.  **Implement the Adapter Class:** Write the logic to interact with your data source.
+4.  **Register the Adapter:** Make your new adapter discoverable by the `intugle` factory.
+
+## Step 1: Create the Scaffolding
+
+First, create a new directory for your connector within the `src/intugle/adapters/types/` directory. For a connector named `myconnector`, you would create:
+
+```
+src/intugle/adapters/types/myconnector/
+├── __init__.py
+├── models.py
+└── myconnector.py
+```
+
+-   `__init__.py`: Can be an empty file.
+-   `models.py`: Will contain the Pydantic configuration models.
+-   `myconnector.py`: Will contain the main adapter class logic.
+
+## Step 2: Define Configuration Models
+
+In `src/intugle/adapters/types/myconnector/models.py`, you need to define two Pydantic models:
+
+1.  **Connection Config:** Defines the parameters needed to connect to your data source (e.g., host, user, password). This will be the format that will be picked up from the profiles.yml
+2.  **Data Config:** Defines how to identify a specific table or asset from that source. This will be the format that will be used to pass the datasets into the SemanticModel
+
+**Example `models.py`:**
+```python
+from typing import Optional
+from intugle.common.schema import SchemaBase
+
+class MyConnectorConnectionConfig(SchemaBase):
+    host: str
+    port: int
+    user: str
+    password: str
+    schema: Optional[str] = None
+
+class MyConnectorConfig(SchemaBase):
+    identifier: str
+    type: str = "myconnector"
+```
+
+Finally, open `src/intugle/adapters/models.py` and add your new `MyConnectorConfig` to the `DataSetData` type hint:
+
+```python
+# src/intugle/adapters/models.py
+
+# ... other imports
+from intugle.adapters.types.myconnector.models import MyConnectorConfig
+
+DataSetData = pd.DataFrame | DuckdbConfig | ... | MyConnectorConfig
+```
+
+## Step 3: Implement the Adapter Class
+
+In `src/intugle/adapters/types/myconnector/myconnector.py`, create your adapter class. It must inherit from `Adapter` and implement its abstract methods.
+
+This is a simplified skeleton. You can look at the `DatabricksAdapter` or `SnowflakeAdapter` for a more complete example.
+
+**Example `myconnector.py`:**
+```python
+from typing import Any, Optional
+import pandas as pd
+from intugle.adapters.adapter import Adapter
+from intugle.adapters.factory import AdapterFactory
+from intugle.adapters.models import ColumnProfile, ProfilingOutput
+from .models import MyConnectorConfig, MyConnectorConnectionConfig
+from intugle.core import settings
+
+# Import your database driver
+# import myconnector_driver
+
+class MyConnectorAdapter(Adapter):
+    def __init__(self):
+        # Initialize your connection here
+        connection_params = settings.PROFILES.get("myconnector", {})
+        config = MyConnectorConnectionConfig.model_validate(connection_params)
+        # self.connection = myconnector_driver.connect(**config.model_dump())
+        pass
+
+    # --- Must be implemented ---
+
+    def profile(self, data: Any, table_name: str) -> ProfilingOutput:
+        # Return table-level metadata: row count, column names, and dtypes
+        raise NotImplementedError()
+
+    def column_profile(self, data: Any, table_name: str, column_name: str, total_count: int) -> Optional[ColumnProfile]:
+        # Return column-level statistics: null count, distinct count, samples, etc.
+        raise NotImplementedError()
+
+    def execute(self, query: str):
+        # Execute a query and return the raw results
+        raise NotImplementedError()
+
+    def to_df_from_query(self, query: str) -> pd.DataFrame:
+        # Execute a query and return the result as a pandas DataFrame
+        raise NotImplementedError()
+
+    def create_table_from_query(self, table_name: str, query: str) -> str:
+        # Materialize a query as a new table or view
+        raise NotImplementedError()
+
+    def create_new_config_from_etl(self, etl_name: str) -> "DataSetData":
+        # Return a new MyConnectorConfig for a materialized table
+        return MyConnectorConfig(identifier=etl_name)
+
+    def intersect_count(self, table1: "DataSet", column1_name: str, table2: "DataSet", column2_name: str) -> int:
+        # Calculate the count of intersecting values between two columns
+        raise NotImplementedError()
+
+    # --- Other required methods ---
+    
+    def load(self, data: Any, table_name: str):
+        # For database adapters, this is often a no-op
+        pass
+
+    def to_df(self, data: DataSetData, table_name: str):
+        # Read an entire table into a pandas DataFrame
+        config = MyConnectorConfig.model_validate(data)
+        return self.to_df_from_query(f"SELECT * FROM {config.identifier}")
+
+    def get_details(self, data: DataSetData):
+        config = MyConnectorConfig.model_validate(data)
+        return config.model_dump()
+```
+
+## Step 4: Register the Adapter
+
+To make `intugle` aware of your new adapter, you must register it with the factory.
+
+1.  **Add registration functions to `myconnector.py`:** At the bottom of your adapter file, add two functions: one to check if the adapter can handle a given data config, and one to register it with the factory.
+
+    ```python
+    # In src/intugle/adapters/types/myconnector/myconnector.py
+
+    def can_handle_myconnector(df: Any) -> bool:
+        try:
+            MyConnectorConfig.model_validate(df)
+            return True
+        except Exception:
+            return False
+
+    def register(factory: AdapterFactory):
+        # Check if the required driver is installed
+        # if MYCONNECTOR_DRIVER_AVAILABLE:
+        factory.register("myconnector", can_handle_myconnector, MyConnectorAdapter)
+    ```
+
+2.  **Add the adapter to the default plugins list:** Open `src/intugle/adapters/factory.py` and add the path to your new adapter module.
+
+    ```python
+    # In src/intugle/adapters/factory.py
+
+    DEFAULT_PLUGINS = [
+        "intugle.adapters.types.pandas.pandas",
+        # ... other adapters
+        "intugle.adapters.types.myconnector.myconnector",
+    ]
+    ```
+
+## Step 5: Add Optional Dependencies
+
+If your adapter requires a specific driver library (like `databricks-sql-connector` for Databricks), you should add it as an optional dependency.
+
+1.  Open the `pyproject.toml` file at the root of the project.
+2.  Add a new extra under the `[project.optional-dependencies]` section.
+
+    ```toml
+    # In pyproject.toml
+
+    [project.optional-dependencies]
+    # ... other dependencies
+    myconnector = ["myconnector-driver-library>=1.0.0"]
+    ```
+
+This allows users to install the necessary libraries by running `pip install "intugle[myconnector]"`.
+
+That's it! You have now implemented and registered a custom connector.
diff --git a/docsite/docs/core-concepts/semantic-intelligence/semantic-search.md b/docsite/docs/core-concepts/semantic-intelligence/semantic-search.md
@@ -71,6 +71,30 @@ export AZURE_OPENAI_ENDPOINT="your-azure-openai-endpoint"
 export OPENAI_API_VERSION="your-openai-api-version"
 ```
 
+#### Using a Custom Embeddings Instance
+
+If you need to use a pre-initialized embeddings model, you can directly inject the model instance.
+
+The custom model must be an instance of `langchain_core.embeddings.embeddings.Embeddings`.
+
+You can set the custom instance by modifying the `intugle.core.settings` module **before** you import and use the `SemanticModel`.
+
+**Example:**
+```python
+# main.py
+from intugle.core import settings
+
+# This must be an object that inherits from Embeddings
+my_embeddings_instance = ... 
+
+# Set the custom instance in the settings
+settings.CUSTOM_EMBEDDINGS_INSTANCE = my_embeddings_instance
+
+# Now, any intugle modules imported after this point will use your custom model
+# from intugle import SemanticModel
+# ...
+```
+
 ## Usage with SemanticModel
 
 The simplest way to use semantic search is through the `SemanticModel` after the semantic model has been built.
diff --git a/docsite/docs/getting-started.md b/docsite/docs/getting-started.md
@@ -55,4 +55,28 @@ Here's an example of how to set these variables in your environment:
 ```bash
 export LLM_PROVIDER="openai:gpt-3.5-turbo"
 export OPENAI_API_KEY="your-openai-api-key"
+```
+
+### Using a Custom LLM Instance
+
+For environments where you need to use a pre-initialized language model, you can directly inject the model instance.
+
+The custom LLM must be an instance of `langchain_core.language_models.chat_models.BaseChatModel`.
+
+You can set the custom instance by modifying the `intugle.core.settings` module **before** you import and use any `intugle` classes.
+
+**Example:**
+```python
+# main.py
+from intugle.core import settings
+
+# This must be an object that inherits from BaseChatModel
+my_llm_instance = ... 
+
+# Set the custom instance in the settings
+settings.CUSTOM_LLM_INSTANCE = my_llm_instance
+
+# Now, any intugle modules imported after this point will use your custom LLM
+
+# ... rest of your code
 ```
diff --git a/src/intugle/core/llms/chat.py b/src/intugle/core/llms/chat.py
@@ -1,6 +1,6 @@
 import logging
 
-from typing import TYPE_CHECKING
+from typing import TYPE_CHECKING, Optional
 
 from langchain.chat_models import init_chat_model
 from langchain.output_parsers import (
@@ -30,7 +30,7 @@ class ChatModelLLM:
 
     def __init__(
         self,
-        model_name: str,
+        model_name: Optional[str] = None,
         response_schemas: list[ResponseSchema] = None,
         output_parser=StructuredOutputParser,
         prompt_template=ChatPromptTemplate,
@@ -39,9 +39,14 @@ def __init__(
         *args,
         **kwargs,
     ):
-        self.model: BaseChatModel = init_chat_model(
-            model_name, max_retries=self.MAX_RETRIES, rate_limiter=self._get_rate_limiter(), **config
-        )  # llm model
+        if settings.CUSTOM_LLM_INSTANCE:
+            self.model: "BaseChatModel" = settings.CUSTOM_LLM_INSTANCE
+        elif model_name:
+            self.model: "BaseChatModel" = init_chat_model(
+                model_name, max_retries=self.MAX_RETRIES, rate_limiter=self._get_rate_limiter(), **config
+            )
+        else:
+            raise ValueError("Either 'settings.CUSTOM_LLM_INSTANCE' must be set or 'LLM_PROVIDER' must be provided.")
 
         self.parser: StructuredOutputParser = output_parser  # the output parser
 
@@ -135,6 +140,8 @@ def invoke(self, *args, **kwargs):
 
     @classmethod
     def get_llm(cls, model_name: str, llm_config: dict = {}):
+        if settings.CUSTOM_LLM_INSTANCE:
+            return settings.CUSTOM_LLM_INSTANCE
         return init_chat_model(
             model_name, max_retries=cls.MAX_RETRIES, rate_limiter=cls._get_rate_limiter(), **llm_config
         )
diff --git a/src/intugle/core/llms/embeddings.py b/src/intugle/core/llms/embeddings.py
@@ -8,6 +8,8 @@
 
 from langchain.embeddings.base import init_embeddings
 
+from intugle.core import settings
+
 
 class EmbeddingsType(str, Enum):
     DENSE = "dense"
@@ -30,7 +32,10 @@ def __init__(
         embeddings_size: Optional[int] = None,
     ):
         self.model_name = model_name
-        self.model = init_embeddings(model_name, **config)
+        if settings.CUSTOM_EMBEDDINGS_INSTANCE:
+            self.model = settings.CUSTOM_EMBEDDINGS_INSTANCE
+        else:
+            self.model = init_embeddings(model_name, **config)
         self._embed_func: Dict[EmbeddingsType, Callable] = {
             EmbeddingsType.DENSE: self.dense,
             EmbeddingsType.SPARSE: self.sparse,
diff --git a/src/intugle/core/settings.py b/src/intugle/core/settings.py
@@ -4,7 +4,7 @@
 
 from functools import lru_cache
 from pathlib import Path
-from typing import Optional
+from typing import Any, Optional
 
 from dotenv import load_dotenv
 from pydantic_settings import BaseSettings, SettingsConfigDict
@@ -66,6 +66,8 @@ class Settings(BaseSettings):
     MAX_RETRIES: int = 5
     SLEEP_TIME: int = 25
     ENABLE_RATE_LIMITER: bool = False
+    CUSTOM_LLM_INSTANCE: Optional[Any] = None
+    CUSTOM_EMBEDDINGS_INSTANCE: Optional[Any] = None
 
     # LP
     HALLUCINATIONS_MAX_RETRY: int = 2