Skip to content

Commit c17a00d

Browse files
Merge pull request #63 from linkml/claude/add-dremio-adapter-6AXgW
Add Dremio adapter for data lakehouse connectivity
2 parents a2f92a1 + 72b49e6 commit c17a00d

File tree

12 files changed

+3239
-0
lines changed

12 files changed

+3239
-0
lines changed

docs/how-to/Use-Dremio-REST.md

Lines changed: 308 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,308 @@
1+
# Using Dremio REST API with LinkML-Store
2+
3+
## Overview
4+
5+
LinkML-Store provides a `dremio-rest` adapter for connecting to [Dremio](https://www.dremio.com/) data lakehouse instances using the REST API v3. This is useful when the Arrow Flight SQL port (32010) is not accessible, such as when Dremio is behind a firewall or Cloudflare Access.
6+
7+
## When to Use This Adapter
8+
9+
Use `dremio-rest://` when:
10+
- Dremio is behind Cloudflare Access or similar proxy
11+
- The Arrow Flight port (32010) is blocked
12+
- You need to connect via HTTPS (port 443)
13+
14+
Use `dremio://` (Flight SQL) when:
15+
- You have direct network access to Dremio
16+
- Port 32010 is accessible
17+
- You need maximum query performance
18+
19+
## Installation
20+
21+
The Dremio REST adapter is included in the base linkml-store installation:
22+
23+
```bash
24+
pip install linkml-store
25+
```
26+
27+
## Connection String Format
28+
29+
```
30+
dremio-rest://[username:password@]host[:port][?params]
31+
```
32+
33+
### Examples
34+
35+
**Basic connection:**
36+
```python
37+
handle = "dremio-rest://lakehouse.example.com"
38+
```
39+
40+
**With credentials in URL:**
41+
```python
42+
handle = "dremio-rest://user:pass@lakehouse.example.com"
43+
```
44+
45+
**With default schema:**
46+
```python
47+
handle = "dremio-rest://lakehouse.example.com?schema=gold.tables"
48+
```
49+
50+
**Disable SSL verification (for testing):**
51+
```python
52+
handle = "dremio-rest://localhost:9047?verify_ssl=false"
53+
```
54+
55+
### Query Parameters
56+
57+
| Parameter | Default | Description |
58+
|-----------|---------|-------------|
59+
| `schema` | None | Default schema for unqualified table names |
60+
| `verify_ssl` | `true` | Whether to verify SSL certificates |
61+
| `username_env` | `DREMIO_USER` | Environment variable for username |
62+
| `password_env` | `DREMIO_PASSWORD` | Environment variable for password |
63+
| `cf_token_env` | `CF_AUTHORIZATION` | Environment variable for Cloudflare Access token |
64+
65+
## Environment Variables
66+
67+
The adapter reads credentials from environment variables by default:
68+
69+
| Variable | Description |
70+
|----------|-------------|
71+
| `DREMIO_USER` | Dremio username |
72+
| `DREMIO_PASSWORD` | Dremio password |
73+
| `CF_AUTHORIZATION` | Cloudflare Access token (if behind Cloudflare) |
74+
75+
### Getting the Cloudflare Access Token
76+
77+
If your Dremio instance is behind Cloudflare Access:
78+
79+
1. Open Dremio in your browser
80+
2. Open Developer Tools (F12) → Application/Storage → Cookies
81+
3. Copy the value of the `CF_Authorization` cookie
82+
4. Set it as an environment variable:
83+
```bash
84+
export CF_AUTHORIZATION="your_token_here"
85+
```
86+
87+
## Python API Usage
88+
89+
### Basic Example
90+
91+
```python
92+
from linkml_store import Client
93+
94+
# Create client and attach database
95+
client = Client()
96+
db = client.attach_database("dremio-rest://lakehouse.example.com", alias="dremio")
97+
98+
# Query a table using its full path
99+
study = db.get_collection('"gold-db-2 postgresql".gold.study')
100+
101+
# Find all rows (with limit)
102+
result = study.find({}, limit=10)
103+
print(f"Found {len(result.rows)} rows")
104+
105+
# Find with filter
106+
result = study.find({"is_public": "Yes"}, limit=10)
107+
for row in result.rows:
108+
print(row["gold_id"], row["study_name"])
109+
```
110+
111+
### Using MongoDB-Style Query Operators
112+
113+
```python
114+
# Greater than
115+
result = study.find({"year": {"$gt": 2020}})
116+
117+
# IN operator
118+
result = study.find({"ecosystem": {"$in": ["Environmental", "Host-associated"]}})
119+
120+
# LIKE (case-sensitive)
121+
result = study.find({"study_name": {"$like": "%methane%"}})
122+
123+
# ILIKE (case-insensitive)
124+
result = study.find({"study_name": {"$ilike": "%Methane%"}})
125+
126+
# Combined conditions (AND)
127+
result = study.find({
128+
"is_public": "Yes",
129+
"metagenomic": "Yes",
130+
"ecosystem": {"$in": ["Environmental"]}
131+
}, limit=20)
132+
```
133+
134+
### Supported Operators
135+
136+
| Operator | SQL Equivalent | Example |
137+
|----------|---------------|---------|
138+
| `$gt` | `>` | `{"age": {"$gt": 30}}` |
139+
| `$gte` | `>=` | `{"age": {"$gte": 30}}` |
140+
| `$lt` | `<` | `{"age": {"$lt": 30}}` |
141+
| `$lte` | `<=` | `{"age": {"$lte": 30}}` |
142+
| `$ne` | `!=` or `IS NOT NULL` | `{"status": {"$ne": "deleted"}}` |
143+
| `$in` | `IN (...)` | `{"status": {"$in": ["a", "b"]}}` |
144+
| `$nin` | `NOT IN (...)` | `{"status": {"$nin": ["deleted"]}}` |
145+
| `$like` | `LIKE` | `{"name": {"$like": "%test%"}}` |
146+
| `$ilike` | `LOWER() LIKE LOWER()` | `{"name": {"$ilike": "%Test%"}}` |
147+
| `$regex` | `REGEXP_LIKE` | `{"name": {"$regex": "^test.*"}}` |
148+
149+
### Using with Environment Variables
150+
151+
```python
152+
import os
153+
from dotenv import load_dotenv
154+
from linkml_store import Client
155+
156+
# Load credentials from .env file
157+
load_dotenv()
158+
159+
client = Client()
160+
db = client.attach_database("dremio-rest://lakehouse.jgi.lbl.gov", alias="jgi")
161+
162+
# Credentials are automatically read from DREMIO_USER, DREMIO_PASSWORD
163+
collection = db.get_collection('"gold-db-2 postgresql".gold.study')
164+
result = collection.find({"is_public": "Yes"}, limit=5)
165+
```
166+
167+
## Command Line Usage
168+
169+
### Basic Query
170+
171+
```bash
172+
# Set credentials
173+
export DREMIO_USER=myuser
174+
export DREMIO_PASSWORD=mypass
175+
176+
# Query with limit
177+
linkml-store -d 'dremio-rest://lakehouse.example.com' \
178+
-c '"schema".table' \
179+
query -l 10
180+
181+
# Query with filter
182+
linkml-store -d 'dremio-rest://lakehouse.example.com' \
183+
-c '"gold-db-2 postgresql".gold.study' \
184+
query -w 'is_public: Yes' -l 10
185+
186+
# Output as table
187+
linkml-store -d 'dremio-rest://lakehouse.example.com' \
188+
-c '"gold-db-2 postgresql".gold.study' \
189+
query -w 'is_public: Yes' -l 10 -O table
190+
```
191+
192+
### Case-Insensitive Search
193+
194+
```bash
195+
# Using $ilike for case-insensitive search
196+
linkml-store -d 'dremio-rest://lakehouse.jgi.lbl.gov' \
197+
-c '"gold-db-2 postgresql".gold.study' \
198+
query -w 'study_name: {$ilike: "%methane%"}' -l 10
199+
```
200+
201+
### Using with dotenv
202+
203+
Create a wrapper script to load environment variables:
204+
205+
```bash
206+
#!/bin/bash
207+
# dremio-query.sh
208+
set -a; source ~/.dremio.env; set +a
209+
linkml-store -d 'dremio-rest://lakehouse.jgi.lbl.gov' "$@"
210+
```
211+
212+
Then use it:
213+
214+
```bash
215+
./dremio-query.sh -c '"gold-db-2 postgresql".gold.study' query -l 5
216+
```
217+
218+
## Table Naming
219+
220+
Dremio uses a hierarchical namespace for tables. Fully qualified table names may include:
221+
222+
- **Source**: The data source name (e.g., `"gold-db-2 postgresql"`)
223+
- **Schema/Space**: The schema or space name (e.g., `gold`)
224+
- **Table**: The table name (e.g., `study`)
225+
226+
Full path example: `"gold-db-2 postgresql".gold.study`
227+
228+
When specifying table names:
229+
- Use double quotes around names with special characters or spaces
230+
- The full path can be specified directly in `get_collection()`
231+
- Or set a default schema in the connection string
232+
233+
```python
234+
# Full path
235+
collection = db.get_collection('"gold-db-2 postgresql".gold.study')
236+
237+
# With default schema
238+
db = client.attach_database(
239+
'dremio-rest://lakehouse.example.com?schema="gold-db-2 postgresql".gold',
240+
alias="dremio"
241+
)
242+
collection = db.get_collection('study') # Uses default schema
243+
```
244+
245+
## Performance Considerations
246+
247+
1. **Use specific table paths**: Don't rely on table discovery - specify exact paths
248+
2. **Add LIMIT**: Always use `limit` parameter to avoid fetching too many rows
249+
3. **Filter on server**: Use WHERE clauses to filter data on the server side
250+
4. **Avoid `search`**: The semantic search command loads all data locally - use `query` with `$like`/`$ilike` instead
251+
252+
## Comparison: REST vs Flight SQL
253+
254+
| Feature | `dremio-rest://` | `dremio://` |
255+
|---------|------------------|-------------|
256+
| Protocol | HTTPS REST API | Arrow Flight SQL |
257+
| Port | 443 (default) | 32010 |
258+
| Works behind proxy | Yes | No |
259+
| Performance | Good | Better |
260+
| Pagination | Automatic | Native |
261+
262+
## Troubleshooting
263+
264+
### Authentication Errors
265+
266+
```
267+
ConnectionError: Dremio authentication failed: 401
268+
```
269+
270+
- Check `DREMIO_USER` and `DREMIO_PASSWORD` are set correctly
271+
- If behind Cloudflare, ensure `CF_AUTHORIZATION` is set and not expired
272+
273+
### SSL Certificate Errors
274+
275+
```
276+
SSLError: certificate verify failed
277+
```
278+
279+
For testing only, disable SSL verification:
280+
```python
281+
db = client.attach_database("dremio-rest://localhost?verify_ssl=false", alias="test")
282+
```
283+
284+
### Slow Startup
285+
286+
If the adapter is slow to start, it may be scanning for all tables. The adapter now skips this by default. If you need to list all tables:
287+
288+
```python
289+
db.discover_collections() # Explicitly scan for tables
290+
print(db.list_collection_names())
291+
```
292+
293+
### Query Syntax Errors
294+
295+
Ensure table names with special characters are properly quoted:
296+
```python
297+
# Correct
298+
collection = db.get_collection('"gold-db-2 postgresql".gold.study')
299+
300+
# Wrong - missing quotes
301+
collection = db.get_collection('gold-db-2 postgresql.gold.study')
302+
```
303+
304+
## Resources
305+
306+
- [Dremio REST API Documentation](https://docs.dremio.com/current/reference/api/)
307+
- [Dremio SQL Reference](https://docs.dremio.com/current/reference/sql/)
308+
- [LinkML-Store Documentation](https://linkml.io/linkml-store/)

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,7 @@ mongodb = ["pymongo"]
4444
neo4j = ["neo4j", "py2neo", "networkx"]
4545
h5py = ["h5py"]
4646
pyarrow = ["pyarrow"]
47+
dremio = ["pyarrow"]
4748
pyreadr = ["pyreadr"]
4849
validation = ["linkml>=1.8.0"]
4950
map = ["linkml_map>=0.3.9", "ucumvert>=0.2.0"]

src/linkml_store/api/client.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,8 @@
2020
"chromadb": "linkml_store.api.stores.chromadb.chromadb_database.ChromaDBDatabase",
2121
"neo4j": "linkml_store.api.stores.neo4j.neo4j_database.Neo4jDatabase",
2222
"file": "linkml_store.api.stores.filesystem.filesystem_database.FileSystemDatabase",
23+
"dremio": "linkml_store.api.stores.dremio.dremio_database.DremioDatabase",
24+
"dremio-rest": "linkml_store.api.stores.dremio_rest.dremio_rest_database.DremioRestDatabase",
2325
"ibis": "linkml_store.api.stores.ibis.ibis_database.IbisDatabase",
2426
# Ibis backend-specific schemes
2527
"ibis+duckdb": "linkml_store.api.stores.ibis.ibis_database.IbisDatabase",
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
"""Dremio database adapter for linkml-store.
2+
3+
This module provides a Dremio adapter that uses Arrow Flight SQL for high-performance
4+
data access to Dremio data lakehouse.
5+
"""
6+
7+
from linkml_store.api.stores.dremio.dremio_collection import DremioCollection
8+
from linkml_store.api.stores.dremio.dremio_database import DremioDatabase
9+
10+
__all__ = ["DremioDatabase", "DremioCollection"]

0 commit comments

Comments
 (0)