Skip to content

Commit 79bc750

Browse files
authored
fix(oracle): automatic CLOB hydration for msgspec integration (#140)
Adds automatic CLOB-to-string conversion in Oracle adapter to enable seamless msgspec/Pydantic integration. CLOB handles are now read automatically before schema conversion, eliminating the need for `DBMS_LOB.SUBSTR` workarounds.
1 parent cdc295e commit 79bc750

File tree

4 files changed

+936
-11
lines changed

4 files changed

+936
-11
lines changed

AGENTS.md

Lines changed: 224 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -619,6 +619,230 @@ async def test_numpy_vector_roundtrip(oracle_session):
619619
- Test against actual databases using the docker infrastructure
620620
- The SQL builder API is experimental and will change significantly
621621

622+
## LOB (Large Object) Hydration Pattern
623+
624+
### Overview
625+
626+
Some database drivers (Oracle, PostgreSQL with large objects) return handle objects for large data types that must be explicitly read before use. SQLSpec provides automatic hydration to ensure typed schemas receive concrete Python values.
627+
628+
### When to Use LOB Hydration
629+
630+
Use LOB hydration helpers when:
631+
632+
- Database driver returns handle objects (LOB, AsyncLOB) instead of concrete values
633+
- Typed schemas (msgspec, Pydantic) expect concrete types (str, bytes, dict)
634+
- Users would otherwise need manual workarounds (`DBMS_LOB.SUBSTR`)
635+
636+
### Implementation Pattern
637+
638+
**Step 1: Create Hydration Helpers**
639+
640+
Add helpers in the adapter's `driver.py` to read LOB handles:
641+
642+
```python
643+
def _coerce_sync_row_values(row: "tuple[Any, ...]") -> "list[Any]":
644+
"""Coerce LOB handles to concrete values for synchronous execution.
645+
646+
Processes each value in the row, reading LOB objects and applying
647+
type detection for JSON values stored in CLOBs.
648+
649+
Args:
650+
row: Tuple of column values from database fetch.
651+
652+
Returns:
653+
List of coerced values with LOBs read to strings/bytes.
654+
"""
655+
coerced_values: list[Any] = []
656+
for value in row:
657+
if hasattr(value, "read"): # Duck-typing for LOB detection
658+
try:
659+
processed_value = value.read()
660+
except Exception:
661+
coerced_values.append(value)
662+
continue
663+
if isinstance(processed_value, str):
664+
processed_value = _type_converter.convert_if_detected(processed_value)
665+
coerced_values.append(processed_value)
666+
else:
667+
coerced_values.append(value)
668+
return coerced_values
669+
670+
671+
async def _coerce_async_row_values(row: "tuple[Any, ...]") -> "list[Any]":
672+
"""Coerce LOB handles to concrete values for asynchronous execution.
673+
674+
Processes each value in the row, reading LOB objects asynchronously
675+
and applying type detection for JSON values stored in CLOBs.
676+
677+
Args:
678+
row: Tuple of column values from database fetch.
679+
680+
Returns:
681+
List of coerced values with LOBs read to strings/bytes.
682+
"""
683+
coerced_values: list[Any] = []
684+
for value in row:
685+
if hasattr(value, "read"):
686+
try:
687+
processed_value = await _type_converter.process_lob(value)
688+
except Exception:
689+
coerced_values.append(value)
690+
continue
691+
if isinstance(processed_value, str):
692+
processed_value = _type_converter.convert_if_detected(processed_value)
693+
coerced_values.append(processed_value)
694+
else:
695+
coerced_values.append(value)
696+
return coerced_values
697+
```
698+
699+
**Step 2: Integrate into Execution Path**
700+
701+
Call hydration helpers before dict construction in `_execute_statement`:
702+
703+
```python
704+
# Sync driver
705+
async for row in cursor:
706+
coerced = _coerce_sync_row_values(row)
707+
rows.append(dict(zip(columns, coerced)))
708+
709+
# Async driver
710+
async for row in cursor:
711+
coerced = await _coerce_async_row_values(row)
712+
rows.append(dict(zip(columns, coerced)))
713+
```
714+
715+
### Key Design Principles
716+
717+
**Duck-Typing for LOB Detection**:
718+
719+
- Use `hasattr(value, "read")` to detect LOB handles
720+
- This is appropriate duck-typing, NOT defensive programming
721+
- Avoids importing driver-specific types
722+
723+
**Error Handling**:
724+
725+
- Catch exceptions during LOB reading
726+
- Fall back to original value on error
727+
- Prevents breaking queries with unexpected handle types
728+
729+
**Type Detection After Reading**:
730+
731+
- Apply `convert_if_detected()` to string results
732+
- Enables JSON detection for JSON-in-CLOB scenarios
733+
- Preserves binary data (bytes) without conversion
734+
735+
**Separation of Concerns**:
736+
737+
- Hydration happens at result-fetching layer
738+
- Type conversion handled by existing type converter
739+
- Schema conversion remains unchanged
740+
741+
### Testing Requirements
742+
743+
**Integration Tests** - Test with real database and typed schemas:
744+
745+
```python
746+
import msgspec
747+
748+
class Article(msgspec.Struct):
749+
id: int
750+
content: str # CLOB column
751+
752+
async def test_clob_msgspec_hydration(session):
753+
large_text = "x" * 5000 # >4KB to ensure CLOB
754+
await session.execute(
755+
"INSERT INTO articles (id, content) VALUES (:1, :2)",
756+
(1, large_text)
757+
)
758+
759+
result = await session.execute(
760+
"SELECT id, content FROM articles WHERE id = :1",
761+
(1,)
762+
)
763+
764+
article = result.get_first(schema_type=Article)
765+
assert isinstance(article.content, str)
766+
assert article.content == large_text
767+
```
768+
769+
**Test Coverage Areas**:
770+
771+
1. Basic CLOB/text LOB hydration to string
772+
2. BLOB/binary LOB hydration to bytes
773+
3. JSON detection in CLOB content
774+
4. Mixed CLOB and regular columns
775+
5. Multiple LOB columns in one row
776+
6. NULL/empty LOB handling
777+
7. Both sync and async drivers
778+
779+
### Performance Considerations
780+
781+
**Memory Usage**:
782+
783+
- LOBs are fully materialized into memory
784+
- Document limitations for very large LOBs (>100MB)
785+
- Consider pagination for multi-GB LOBs
786+
787+
**Sync vs Async**:
788+
789+
- Sync uses `.read()` directly
790+
- Async uses `await` for LOB reading
791+
- Both approaches have equivalent performance
792+
793+
### Examples from Existing Adapters
794+
795+
**Oracle CLOB Hydration** (`oracledb/driver.py`):
796+
797+
- Automatically reads CLOB handles to strings
798+
- Preserves BLOB as bytes
799+
- Enables JSON detection for JSON-in-CLOB
800+
- No configuration required - always enabled
801+
- Eliminates need for `DBMS_LOB.SUBSTR` workaround
802+
803+
### Documentation Requirements
804+
805+
When implementing LOB hydration:
806+
807+
1. **Update adapter guide** - Document new behavior and before/after comparison
808+
2. **Add examples** - Show typed schema usage without manual workarounds
809+
3. **Note performance** - Mention memory considerations for large LOBs
810+
4. **Show JSON detection** - Demonstrate automatic JSON parsing in LOBs
811+
812+
Example documentation structure:
813+
814+
```markdown
815+
## CLOB/BLOB Handling
816+
817+
### Automatic CLOB Hydration
818+
819+
CLOB values are automatically read and converted to Python strings:
820+
821+
[Example with msgspec]
822+
823+
### JSON Detection in CLOBs
824+
825+
[Example showing JSON parsing]
826+
827+
### BLOB Handling (Binary Data)
828+
829+
BLOB columns remain as bytes:
830+
831+
[Example with bytes]
832+
833+
### Before and After
834+
835+
**Before (manual workaround):**
836+
[SQL with DBMS_LOB.SUBSTR]
837+
838+
**After (automatic):**
839+
[Clean SQL without workarounds]
840+
841+
### Performance Considerations
842+
- Memory usage notes
843+
- When to use pagination
844+
```
845+
622846
## driver_features Pattern
623847

624848
### Overview

docs/guides/adapters/oracle.md

Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -102,6 +102,122 @@ If `use_in_memory=True` but In-Memory is not available/licensed, table creation
102102

103103
**Recommendation:** Use `use_in_memory=False` (default) unless you have confirmed licensing and configuration.
104104

105+
## CLOB/BLOB Handling
106+
107+
Oracle's `oracledb` driver returns LOB (Large Object) handles for CLOB and BLOB columns, which must be read before use. SQLSpec **automatically reads CLOB columns into strings** to provide seamless integration with typed schemas like msgspec and Pydantic.
108+
109+
### Automatic CLOB Hydration
110+
111+
CLOB values are automatically read and converted to Python strings:
112+
113+
```python
114+
from sqlspec.adapters.oracledb import OracleAsyncConfig
115+
import msgspec
116+
117+
class Article(msgspec.Struct):
118+
id: int
119+
title: str
120+
content: str # CLOB column automatically becomes string
121+
122+
config = OracleAsyncConfig(pool_config={"dsn": "oracle://..."})
123+
124+
async with config.provide_session() as session:
125+
# Insert large text content
126+
large_text = "x" * 5000 # >4KB content
127+
await session.execute(
128+
"INSERT INTO articles (id, title, content) VALUES (:1, :2, :3)",
129+
(1, "My Article", large_text)
130+
)
131+
132+
# Query returns string, not LOB handle
133+
result = await session.execute(
134+
"SELECT id, title, content FROM articles WHERE id = :1",
135+
(1,)
136+
)
137+
138+
# Works seamlessly with msgspec
139+
article = result.get_first(schema_type=Article)
140+
assert isinstance(article.content, str)
141+
assert len(article.content) == 5000
142+
```
143+
144+
### JSON Detection in CLOBs
145+
146+
When CLOB content contains JSON, it is automatically detected and parsed into Python dictionaries:
147+
148+
```python
149+
import json
150+
151+
class Document(msgspec.Struct):
152+
id: int
153+
metadata: dict # JSON stored in CLOB
154+
155+
# Store JSON in CLOB
156+
metadata = {"key": "value", "nested": {"data": "example"}}
157+
await session.execute(
158+
"INSERT INTO documents (id, metadata) VALUES (:1, :2)",
159+
(1, json.dumps(metadata))
160+
)
161+
162+
# Retrieved as parsed dict, not string
163+
result = await session.execute(
164+
"SELECT id, metadata FROM documents WHERE id = :1",
165+
(1,)
166+
)
167+
doc = result.get_first(schema_type=Document)
168+
assert isinstance(doc.metadata, dict)
169+
assert doc.metadata["key"] == "value"
170+
```
171+
172+
### BLOB Handling (Binary Data)
173+
174+
BLOB columns remain as bytes and are not converted to strings:
175+
176+
```python
177+
class FileRecord(msgspec.Struct):
178+
id: int
179+
data: bytes # BLOB column remains bytes
180+
181+
binary_data = b"\x00\x01\x02\x03" * 2000
182+
await session.execute(
183+
"INSERT INTO files (id, data) VALUES (:1, :2)",
184+
(1, binary_data)
185+
)
186+
187+
result = await session.execute(
188+
"SELECT id, data FROM files WHERE id = :1",
189+
(1,)
190+
)
191+
file_record = result.get_first(schema_type=FileRecord)
192+
assert isinstance(file_record.data, bytes)
193+
```
194+
195+
### Before and After
196+
197+
**Before (manual workaround required):**
198+
199+
```python
200+
# Had to use DBMS_LOB.SUBSTR, truncating to 4000 chars
201+
result = await session.execute(
202+
"SELECT id, DBMS_LOB.SUBSTR(content, 4000) as content FROM articles"
203+
)
204+
```
205+
206+
**After (automatic, no truncation):**
207+
208+
```python
209+
# CLOB automatically read to full string
210+
result = await session.execute(
211+
"SELECT id, content FROM articles"
212+
)
213+
```
214+
215+
### Performance Considerations
216+
217+
- **Memory usage:** Large CLOBs (>100MB) are fully materialized into memory. For multi-GB CLOBs, consider using database-side processing or pagination.
218+
- **Sync vs Async:** Both sync and async drivers perform automatic CLOB hydration with equivalent performance.
219+
- **Multiple CLOBs:** All CLOB columns in a result row are hydrated automatically.
220+
105221
## Column Name Normalization
106222

107223
Oracle returns unquoted identifiers in uppercase (for example `ID`, `PRODUCT_NAME`). When those rows feed into schema libraries that expect snake_case fields, the uppercase keys can trigger validation errors. SQLSpec resolves this automatically through the `enable_lowercase_column_names` driver feature, which is **enabled by default**.

0 commit comments

Comments
 (0)