Skip to content

Commit bfc1468

Browse files
Merge pull request #111 from Intugle/feature/postgres-connector
Feature/postgres connector
2 parents 2c16c6e + e490981 commit bfc1468

33 files changed

+951
-237
lines changed

docsite/docs/connectors/implementing-a-connector.md

Lines changed: 63 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,8 @@ sidebar_position: 4
77
:::tip Pro Tip: Use an AI Coding Assistant
88
The fastest way to implement a new adapter is to use an AI coding assistant like the **Gemini CLI**, **Cursor**, or **Claude**.
99

10-
1. **Provide Context:** Give the assistant the code for an existing, similar adapter (e.g., `SnowflakeAdapter` or `DatabricksAdapter`).
11-
2. **State Your Goal:** Ask it to replicate the structure and logic for your new data source. For example: *"Using the Snowflake adapter as a reference, create a new adapter for MyConnector."*
10+
1. **Provide Context:** Give the assistant the code for an existing, similar adapter (e.g., `PostgresAdapter` or `DatabricksAdapter`).
11+
2. **State Your Goal:** Ask it to replicate the structure and logic for your new data source. For example: *"Using the Postgres adapter as a reference, create a new adapter for Redshift."*
1212
3. **Iterate:** The assistant can generate the boilerplate code for the models, the adapter class, and the registration functions, allowing you to focus on the specific implementation details for your database driver.
1313
:::
1414

@@ -25,6 +25,7 @@ The core steps to create a new connector are:
2525
2. **Define Configuration Models:** Create Pydantic models for your connector's configuration.
2626
3. **Implement the Adapter Class:** Write the logic to interact with your data source.
2727
4. **Register the Adapter:** Make your new adapter discoverable by the `intugle` factory.
28+
5. **Add Optional Dependencies:** Declare the necessary driver libraries.
2829

2930
## Step 1: Create the Scaffolding
3031

@@ -45,8 +46,8 @@ src/intugle/adapters/types/myconnector/
4546

4647
In `src/intugle/adapters/types/myconnector/models.py`, you need to define two Pydantic models:
4748

48-
1. **Connection Config:** Defines the parameters needed to connect to your data source (e.g., host, user, password). This will be the format that will be picked up from the profiles.yml
49-
2. **Data Config:** Defines how to identify a specific table or asset from that source. This will be the format that will be used to pass the datasets into the SemanticModel
49+
1. **Connection Config:** Defines the parameters needed to connect to your data source (e.g., host, user, password). This is the structure that will be read from `profiles.yml`.
50+
2. **Data Config:** Defines how to identify a specific table or asset from that source. This is the structure used when passing datasets into the `SemanticModel`.
5051

5152
**Example `models.py`:**
5253
```python
@@ -58,37 +59,40 @@ class MyConnectorConnectionConfig(SchemaBase):
5859
port: int
5960
user: str
6061
password: str
62+
database: str
6163
schema: Optional[str] = None
6264

6365
class MyConnectorConfig(SchemaBase):
6466
identifier: str
6567
type: str = "myconnector"
6668
```
6769

68-
Finally, open `src/intugle/adapters/models.py` and add your new `MyConnectorConfig` to the `DataSetData` type hint:
70+
Finally, open `src/intugle/adapters/models.py` and add your new `MyConnectorConfig` to the `DataSetData` type hint. This is for static type checking and improves developer experience.
6971

7072
```python
7173
# src/intugle/adapters/models.py
7274

7375
# ... other imports
7476
from intugle.adapters.types.myconnector.models import MyConnectorConfig
7577

76-
DataSetData = pd.DataFrame | DuckdbConfig | ... | MyConnectorConfig
78+
if TYPE_CHECKING:
79+
# ... other configs
80+
DataSetData = pd.DataFrame | ... | MyConnectorConfig
7781
```
7882

7983
## Step 3: Implement the Adapter Class
8084

8185
In `src/intugle/adapters/types/myconnector/myconnector.py`, create your adapter class. It must inherit from `Adapter` and implement its abstract methods.
8286

83-
This is a simplified skeleton. You can look at the `DatabricksAdapter` or `SnowflakeAdapter` for a more complete example.
87+
This is a simplified skeleton. Refer to the `PostgresAdapter` or `DatabricksAdapter` for a complete example.
8488

8589
**Example `myconnector.py`:**
8690
```python
8791
from typing import Any, Optional
8892
import pandas as pd
8993
from intugle.adapters.adapter import Adapter
9094
from intugle.adapters.factory import AdapterFactory
91-
from intugle.adapters.models import ColumnProfile, ProfilingOutput
95+
from intugle.adapters.models import ColumnProfile, ProfilingOutput, DataSetData
9296
from .models import MyConnectorConfig, MyConnectorConnectionConfig
9397
from intugle.core import settings
9498

@@ -101,15 +105,30 @@ class MyConnectorAdapter(Adapter):
101105
connection_params = settings.PROFILES.get("myconnector", {})
102106
config = MyConnectorConnectionConfig.model_validate(connection_params)
103107
# self.connection = myconnector_driver.connect(**config.model_dump())
108+
self._database = config.database
109+
self._schema = config.schema
104110
pass
105111

106-
# --- Must be implemented ---
112+
# --- Properties ---
113+
@property
114+
def database(self) -> Optional[str]:
115+
return self._database
116+
117+
@property
118+
def schema(self) -> Optional[str]:
119+
return self._schema
120+
121+
@property
122+
def source_name(self) -> str:
123+
return "my_connector_source"
124+
125+
# --- Abstract Method Implementations ---
107126

108127
def profile(self, data: Any, table_name: str) -> ProfilingOutput:
109128
# Return table-level metadata: row count, column names, and dtypes
110129
raise NotImplementedError()
111130

112-
def column_profile(self, data: Any, table_name: str, column_name: str, total_count: int) -> Optional[ColumnProfile]:
131+
def column_profile(self, data: Any, table_name: str, column_name: str, total_count: int, **kwargs) -> Optional[ColumnProfile]:
113132
# Return column-level statistics: null count, distinct count, samples, etc.
114133
raise NotImplementedError()
115134

@@ -121,7 +140,7 @@ class MyConnectorAdapter(Adapter):
121140
# Execute a query and return the result as a pandas DataFrame
122141
raise NotImplementedError()
123142

124-
def create_table_from_query(self, table_name: str, query: str) -> str:
143+
def create_table_from_query(self, table_name: str, query: str, materialize: str = "view", **kwargs) -> str:
125144
# Materialize a query as a new table or view
126145
raise NotImplementedError()
127146

@@ -132,8 +151,6 @@ class MyConnectorAdapter(Adapter):
132151
def intersect_count(self, table1: "DataSet", column1_name: str, table2: "DataSet", column2_name: str) -> int:
133152
# Calculate the count of intersecting values between two columns
134153
raise NotImplementedError()
135-
136-
# --- Other required methods ---
137154

138155
def load(self, data: Any, table_name: str):
139156
# For database adapters, this is often a no-op
@@ -168,7 +185,7 @@ To make `intugle` aware of your new adapter, you must register it with the facto
168185
def register(factory: AdapterFactory):
169186
# Check if the required driver is installed
170187
# if MYCONNECTOR_DRIVER_AVAILABLE:
171-
factory.register("myconnector", can_handle_myconnector, MyConnectorAdapter)
188+
factory.register("myconnector", can_handle_myconnector, MyConnectorAdapter, MyConnectorConfig)
172189
```
173190

174191
2. **Add the adapter to the default plugins list:** Open `src/intugle/adapters/factory.py` and add the path to your new adapter module.
@@ -185,7 +202,7 @@ To make `intugle` aware of your new adapter, you must register it with the facto
185202

186203
## Step 5: Add Optional Dependencies
187204

188-
If your adapter requires a specific driver library (like `databricks-sql-connector` for Databricks), you should add it as an optional dependency.
205+
If your adapter requires a specific driver library (like `asyncpg` for Postgres), you should add it as an optional dependency.
189206

190207
1. Open the `pyproject.toml` file at the root of the project.
191208
2. Add a new extra under the `[project.optional-dependencies]` section.
@@ -200,4 +217,35 @@ If your adapter requires a specific driver library (like `databricks-sql-connect
200217

201218
This allows users to install the necessary libraries by running `pip install "intugle[myconnector]"`.
202219

220+
## Best Practices and Considerations
221+
222+
When implementing your adapter, keep the following points in mind to ensure it is robust, secure, and efficient.
223+
224+
### Handling Database Objects
225+
Your adapter should be able to interact with different types of database objects, not just tables.
226+
- **Tables, Views, and Materialized Views:** Ensure your `profile` method can read and `create_table_from_query` method can handle creating these different object types. The `materialize` parameter can be used to control this behavior. For example, the Postgres adapter supports `table`, `view`, and `materialized_view`.
227+
- **Identifier Quoting:** Always wrap table and column identifiers in quotes (e.g., `"` for Postgres and Snowflake) to handle special characters, spaces, and case-sensitivity correctly.
228+
229+
### Secure Query Execution
230+
- **Parameterized Queries:** To prevent SQL injection vulnerabilities, always use parameterized queries when user-provided values are part of a SQL statement. Most database drivers provide a safe way to pass parameters (e.g., using `?` or `$1` placeholders) instead of formatting them directly into the query string.
231+
232+
**Do this:**
233+
```python
234+
# Example with asyncpg
235+
await connection.fetch("SELECT * FROM users WHERE id = $1", user_id)
236+
```
237+
238+
**Avoid this:**
239+
```python
240+
# Unsafe - vulnerable to SQL injection
241+
await connection.fetch(f"SELECT * FROM users WHERE id = {user_id}")
242+
```
243+
244+
### Stability and Error Handling
245+
- **Network Errors and Timeouts:** Implement timeouts for both establishing connections and executing queries. This prevents your application from hanging indefinitely if the database is unresponsive. Your chosen database driver should provide options for setting these timeouts.
246+
- **Graceful Error Handling:** Wrap database calls in `try...except` blocks to catch potential exceptions (e.g., connection errors, permission denied) and provide clear, informative error messages to the user.
247+
248+
### Atomicity
249+
- **Transactions:** For operations that involve multiple SQL statements (like dropping and then recreating a table), wrap them in a transaction. This ensures that the entire operation is atomic—it either completes successfully, or all changes are rolled back if an error occurs, preventing the database from being left in an inconsistent state.
250+
203251
That's it! You have now implemented and registered a custom connector.
Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
---
2+
sidebar_position: 3
3+
---
4+
5+
# Postgres
6+
7+
`intugle` integrates with PostgreSQL, allowing you to read data from your tables, views, and materialized views, and deploy your `SemanticModel` by setting constraints and comments directly in your PostgreSQL database.
8+
9+
## Installation
10+
11+
To use `intugle` with PostgreSQL, you must install the optional dependencies:
12+
13+
```bash
14+
pip install "intugle[postgres]"
15+
```
16+
17+
This installs the `asyncpg` and `sqlglot` libraries.
18+
19+
## Configuration
20+
21+
To connect to your PostgreSQL database, you must provide connection credentials in a `profiles.yml` file at the root of your project. The adapter looks for a top-level `postgres:` key.
22+
23+
**Example `profiles.yml`:**
24+
25+
```yaml
26+
postgres:
27+
host: <your_postgres_host>
28+
port: 5432 # Default PostgreSQL port
29+
user: <your_username>
30+
password: <your_password>
31+
database: <your_database_name>
32+
schema: <your_schema_name>
33+
```
34+
35+
## Usage
36+
37+
### Reading Data from PostgreSQL
38+
39+
To include a PostgreSQL table, view, or materialized view in your `SemanticModel`, define it in your input dictionary with `type: "postgres"` and use the `identifier` key to specify the object name.
40+
41+
:::caution Important
42+
The dictionary key for your dataset (e.g., `"CUSTOMERS"`) must exactly match the table, view, or materialized view name specified in the `identifier`.
43+
:::
44+
45+
```python
46+
from intugle import SemanticModel
47+
48+
datasets = {
49+
"CUSTOMERS": {
50+
"identifier": "CUSTOMERS", # Must match the key above
51+
"type": "postgres"
52+
},
53+
"ORDERS_VIEW": {
54+
"identifier": "ORDERS_VIEW", # Can be a view
55+
"type": "postgres"
56+
},
57+
"PRODUCT_MV": {
58+
"identifier": "PRODUCT_MV", # Can be a materialized view
59+
"type": "postgres"
60+
}
61+
}
62+
63+
# Initialize the semantic model
64+
sm = SemanticModel(datasets, domain="E-commerce")
65+
66+
# Build the model as usual
67+
sm.build()
68+
```
69+
70+
### Materializing Data Products
71+
72+
When you use the `DataProduct` class with a PostgreSQL connection, the resulting data product can be materialized as a new **table**, **view**, or **materialized view** directly within your target schema.
73+
74+
```python
75+
from intugle import DataProduct
76+
77+
etl_model = {
78+
"name": "top_customers",
79+
"fields": [
80+
{"id": "CUSTOMERS.customer_id", "name": "customer_id"},
81+
{"id": "CUSTOMERS.name", "name": "customer_name"},
82+
]
83+
}
84+
85+
dp = DataProduct()
86+
87+
# Materialize as a view (default)
88+
dp.build(etl_model, materialize="view")
89+
90+
# Materialize as a table
91+
dp.build(etl_model, materialize="table")
92+
93+
# Materialize as a materialized view
94+
dp.build(etl_model, materialize="materialized_view")
95+
```
96+
97+
### Deploying the Semantic Model
98+
99+
Once your semantic model is built, you can deploy it to PostgreSQL using the `deploy()` method. This process syncs your model's intelligence to your physical tables by:
100+
1. **Syncing Metadata:** It updates the comments on your physical PostgreSQL tables and columns with the business glossaries from your `intugle` model.
101+
2. **Setting Constraints:** It sets `PRIMARY KEY` and `FOREIGN KEY` constraints on your tables based on the relationships discovered in the model.
102+
103+
```python
104+
# Deploy the model to PostgreSQL
105+
sm.deploy(target="postgres")
106+
107+
# You can also control which parts of the deployment to run
108+
sm.deploy(
109+
target="postgres",
110+
sync_glossary=True,
111+
set_primary_keys=True,
112+
set_foreign_keys=True
113+
)
114+
```
115+
116+
:::info Required Permissions
117+
To successfully deploy a semantic model, the PostgreSQL user you are using must have the following privileges:
118+
* `USAGE` on the target schema.
119+
* `CREATE TABLE`, `CREATE VIEW`, `CREATE MATERIALIZED VIEW` on the target schema.
120+
* `COMMENT` privilege on tables and columns.
121+
* `ALTER TABLE` to add primary and foreign key constraints.
122+
:::

docsite/docs/core-concepts/semantic-intelligence/dataset.md

Lines changed: 15 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -25,13 +25,18 @@ Currently, the library supports `csv`, `parquet`, and `excel` files. More integr
2525

2626
### Centralized metadata
2727

28-
All analysis results for a data source are stored within the `dataset.source_table_model` attribute. This attribute is a structured Pydantic model that makes accessing metadata predictable and easy. For more convenient access to column-level data, the `DataSet` also provides a `columns` dictionary.
28+
All analysis results for a data source are stored within the `dataset.source` attribute. This attribute is a structured Pydantic model that contains the `Source` object, which in turn holds the `SourceTables` model. This makes accessing metadata predictable and easy. For more convenient access to column-level data, the `DataSet` also provides a `columns` dictionary.
2929

3030
#### Metadata structure and access
3131

3232
The library organizes metadata using Pydantic models, but you can access it through the `DataSet`'s attributes.
3333

34-
- **Table-Level Metadata**: Accessed via `dataset.source_table_model`.
34+
- **Source-Level Metadata**: Accessed via `dataset.source`.
35+
- `.name: str`
36+
- `.description: str`
37+
- `.schema: str`
38+
- `.database: str`
39+
- **Table-Level Metadata**: Accessed via `dataset.source.table`.
3540
- `.name: str`
3641
- `.description: str`
3742
- `.key: Optional[str]`
@@ -55,9 +60,15 @@ The library organizes metadata using Pydantic models, but you can access it thro
5560
# Assuming 'sm' is a built SemanticModel instance
5661
customers_dataset = sm.datasets['customers']
5762

63+
# Access source-level metadata
64+
print(f"Source Name: {customers_dataset.source.name}")
65+
print(f"Database: {customers_dataset.source.database}")
66+
print(f"Schema: {customers_dataset.source.schema}")
67+
5868
# Access table-level metadata
59-
print(f"Table Name: {customers_dataset.source_table_model.name}")
60-
print(f"Primary Key: {customers_dataset.source_table_model.key}")
69+
print(f"Table Name: {customers_dataset.source.table.name}")
70+
print(f"Table Description: {customers_dataset.source.table.description}")
71+
print(f"Primary Key: {customers_dataset.source.table.key}")
6172

6273
# Access column-level metadata using the 'columns' dictionary
6374
email_column = customers_dataset.columns['email']

docsite/docs/core-concepts/semantic-intelligence/semantic-model.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -131,7 +131,7 @@ customers_dataset = sm.datasets['customers']
131131
link_predictor = sm.link_predictor
132132

133133
# Now you can explore rich metadata or results
134-
print(f"Primary Key for customers: {customers_dataset.source_table_model.description}")
134+
print(f"Primary Key for customers: {customers_dataset.source.table.description}")
135135
print("Discovered Links:")
136136
print(link_predictor.get_links_df())
137137
```

pyproject.toml

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,13 +56,18 @@ dependencies = [
5656

5757
[project.optional-dependencies]
5858
snowflake = [
59-
"snowflake-snowpark-python[pandas]>=1.12.0"
59+
"snowflake-snowpark-python[pandas]>=1.12.0",
60+
"sqlglot>=27.20.0",
6061
]
6162
databricks = [
6263
"databricks-sql-connector>=4.1.3",
6364
"pyspark>=3.5.0",
6465
"sqlglot>=27.20.0",
6566
]
67+
postgres = [
68+
"asyncpg>=0.30.0",
69+
"sqlglot>=27.20.0",
70+
]
6671

6772

6873
[project.urls]
@@ -84,6 +89,7 @@ test = [
8489
]
8590
lint = ["ruff"]
8691
dev = [
92+
"asyncpg>=0.30.0",
8793
"databricks-sql-connector>=4.1.3",
8894
"ipykernel>=6.30.1",
8995
"pysonar>=1.2.0.2419",

0 commit comments

Comments
 (0)