|
| 1 | +# Using Dremio REST API with LinkML-Store |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +LinkML-Store provides a `dremio-rest` adapter for connecting to [Dremio](https://www.dremio.com/) data lakehouse instances using the REST API v3. This is useful when the Arrow Flight SQL port (32010) is not accessible, such as when Dremio is behind a firewall or Cloudflare Access. |
| 6 | + |
| 7 | +## When to Use This Adapter |
| 8 | + |
| 9 | +Use `dremio-rest://` when: |
| 10 | +- Dremio is behind Cloudflare Access or similar proxy |
| 11 | +- The Arrow Flight port (32010) is blocked |
| 12 | +- You need to connect via HTTPS (port 443) |
| 13 | + |
| 14 | +Use `dremio://` (Flight SQL) when: |
| 15 | +- You have direct network access to Dremio |
| 16 | +- Port 32010 is accessible |
| 17 | +- You need maximum query performance |
| 18 | + |
| 19 | +## Installation |
| 20 | + |
| 21 | +The Dremio REST adapter is included in the base linkml-store installation: |
| 22 | + |
| 23 | +```bash |
| 24 | +pip install linkml-store |
| 25 | +``` |
| 26 | + |
| 27 | +## Connection String Format |
| 28 | + |
| 29 | +``` |
| 30 | +dremio-rest://[username:password@]host[:port][?params] |
| 31 | +``` |
| 32 | + |
| 33 | +### Examples |
| 34 | + |
| 35 | +**Basic connection:** |
| 36 | +```python |
| 37 | +handle = "dremio-rest://lakehouse.example.com" |
| 38 | +``` |
| 39 | + |
| 40 | +**With credentials in URL:** |
| 41 | +```python |
| 42 | +handle = "dremio-rest://user:pass@lakehouse.example.com" |
| 43 | +``` |
| 44 | + |
| 45 | +**With default schema:** |
| 46 | +```python |
| 47 | +handle = "dremio-rest://lakehouse.example.com?schema=gold.tables" |
| 48 | +``` |
| 49 | + |
| 50 | +**Disable SSL verification (for testing):** |
| 51 | +```python |
| 52 | +handle = "dremio-rest://localhost:9047?verify_ssl=false" |
| 53 | +``` |
| 54 | + |
| 55 | +### Query Parameters |
| 56 | + |
| 57 | +| Parameter | Default | Description | |
| 58 | +|-----------|---------|-------------| |
| 59 | +| `schema` | None | Default schema for unqualified table names | |
| 60 | +| `verify_ssl` | `true` | Whether to verify SSL certificates | |
| 61 | +| `username_env` | `DREMIO_USER` | Environment variable for username | |
| 62 | +| `password_env` | `DREMIO_PASSWORD` | Environment variable for password | |
| 63 | +| `cf_token_env` | `CF_AUTHORIZATION` | Environment variable for Cloudflare Access token | |
| 64 | + |
| 65 | +## Environment Variables |
| 66 | + |
| 67 | +The adapter reads credentials from environment variables by default: |
| 68 | + |
| 69 | +| Variable | Description | |
| 70 | +|----------|-------------| |
| 71 | +| `DREMIO_USER` | Dremio username | |
| 72 | +| `DREMIO_PASSWORD` | Dremio password | |
| 73 | +| `CF_AUTHORIZATION` | Cloudflare Access token (if behind Cloudflare) | |
| 74 | + |
| 75 | +### Getting the Cloudflare Access Token |
| 76 | + |
| 77 | +If your Dremio instance is behind Cloudflare Access: |
| 78 | + |
| 79 | +1. Open Dremio in your browser |
| 80 | +2. Open Developer Tools (F12) → Application/Storage → Cookies |
| 81 | +3. Copy the value of the `CF_Authorization` cookie |
| 82 | +4. Set it as an environment variable: |
| 83 | + ```bash |
| 84 | + export CF_AUTHORIZATION="your_token_here" |
| 85 | + ``` |
| 86 | + |
| 87 | +## Python API Usage |
| 88 | + |
| 89 | +### Basic Example |
| 90 | + |
| 91 | +```python |
| 92 | +from linkml_store import Client |
| 93 | + |
| 94 | +# Create client and attach database |
| 95 | +client = Client() |
| 96 | +db = client.attach_database("dremio-rest://lakehouse.example.com", alias="dremio") |
| 97 | + |
| 98 | +# Query a table using its full path |
| 99 | +study = db.get_collection('"gold-db-2 postgresql".gold.study') |
| 100 | + |
| 101 | +# Find all rows (with limit) |
| 102 | +result = study.find({}, limit=10) |
| 103 | +print(f"Found {len(result.rows)} rows") |
| 104 | + |
| 105 | +# Find with filter |
| 106 | +result = study.find({"is_public": "Yes"}, limit=10) |
| 107 | +for row in result.rows: |
| 108 | + print(row["gold_id"], row["study_name"]) |
| 109 | +``` |
| 110 | + |
| 111 | +### Using MongoDB-Style Query Operators |
| 112 | + |
| 113 | +```python |
| 114 | +# Greater than |
| 115 | +result = study.find({"year": {"$gt": 2020}}) |
| 116 | + |
| 117 | +# IN operator |
| 118 | +result = study.find({"ecosystem": {"$in": ["Environmental", "Host-associated"]}}) |
| 119 | + |
| 120 | +# LIKE (case-sensitive) |
| 121 | +result = study.find({"study_name": {"$like": "%methane%"}}) |
| 122 | + |
| 123 | +# ILIKE (case-insensitive) |
| 124 | +result = study.find({"study_name": {"$ilike": "%Methane%"}}) |
| 125 | + |
| 126 | +# Combined conditions (AND) |
| 127 | +result = study.find({ |
| 128 | + "is_public": "Yes", |
| 129 | + "metagenomic": "Yes", |
| 130 | + "ecosystem": {"$in": ["Environmental"]} |
| 131 | +}, limit=20) |
| 132 | +``` |
| 133 | + |
| 134 | +### Supported Operators |
| 135 | + |
| 136 | +| Operator | SQL Equivalent | Example | |
| 137 | +|----------|---------------|---------| |
| 138 | +| `$gt` | `>` | `{"age": {"$gt": 30}}` | |
| 139 | +| `$gte` | `>=` | `{"age": {"$gte": 30}}` | |
| 140 | +| `$lt` | `<` | `{"age": {"$lt": 30}}` | |
| 141 | +| `$lte` | `<=` | `{"age": {"$lte": 30}}` | |
| 142 | +| `$ne` | `!=` or `IS NOT NULL` | `{"status": {"$ne": "deleted"}}` | |
| 143 | +| `$in` | `IN (...)` | `{"status": {"$in": ["a", "b"]}}` | |
| 144 | +| `$nin` | `NOT IN (...)` | `{"status": {"$nin": ["deleted"]}}` | |
| 145 | +| `$like` | `LIKE` | `{"name": {"$like": "%test%"}}` | |
| 146 | +| `$ilike` | `LOWER() LIKE LOWER()` | `{"name": {"$ilike": "%Test%"}}` | |
| 147 | +| `$regex` | `REGEXP_LIKE` | `{"name": {"$regex": "^test.*"}}` | |
| 148 | + |
| 149 | +### Using with Environment Variables |
| 150 | + |
| 151 | +```python |
| 152 | +import os |
| 153 | +from dotenv import load_dotenv |
| 154 | +from linkml_store import Client |
| 155 | + |
| 156 | +# Load credentials from .env file |
| 157 | +load_dotenv() |
| 158 | + |
| 159 | +client = Client() |
| 160 | +db = client.attach_database("dremio-rest://lakehouse.jgi.lbl.gov", alias="jgi") |
| 161 | + |
| 162 | +# Credentials are automatically read from DREMIO_USER, DREMIO_PASSWORD |
| 163 | +collection = db.get_collection('"gold-db-2 postgresql".gold.study') |
| 164 | +result = collection.find({"is_public": "Yes"}, limit=5) |
| 165 | +``` |
| 166 | + |
| 167 | +## Command Line Usage |
| 168 | + |
| 169 | +### Basic Query |
| 170 | + |
| 171 | +```bash |
| 172 | +# Set credentials |
| 173 | +export DREMIO_USER=myuser |
| 174 | +export DREMIO_PASSWORD=mypass |
| 175 | + |
| 176 | +# Query with limit |
| 177 | +linkml-store -d 'dremio-rest://lakehouse.example.com' \ |
| 178 | + -c '"schema".table' \ |
| 179 | + query -l 10 |
| 180 | + |
| 181 | +# Query with filter |
| 182 | +linkml-store -d 'dremio-rest://lakehouse.example.com' \ |
| 183 | + -c '"gold-db-2 postgresql".gold.study' \ |
| 184 | + query -w 'is_public: Yes' -l 10 |
| 185 | + |
| 186 | +# Output as table |
| 187 | +linkml-store -d 'dremio-rest://lakehouse.example.com' \ |
| 188 | + -c '"gold-db-2 postgresql".gold.study' \ |
| 189 | + query -w 'is_public: Yes' -l 10 -O table |
| 190 | +``` |
| 191 | + |
| 192 | +### Case-Insensitive Search |
| 193 | + |
| 194 | +```bash |
| 195 | +# Using $ilike for case-insensitive search |
| 196 | +linkml-store -d 'dremio-rest://lakehouse.jgi.lbl.gov' \ |
| 197 | + -c '"gold-db-2 postgresql".gold.study' \ |
| 198 | + query -w 'study_name: {$ilike: "%methane%"}' -l 10 |
| 199 | +``` |
| 200 | + |
| 201 | +### Using with dotenv |
| 202 | + |
| 203 | +Create a wrapper script to load environment variables: |
| 204 | + |
| 205 | +```bash |
| 206 | +#!/bin/bash |
| 207 | +# dremio-query.sh |
| 208 | +set -a; source ~/.dremio.env; set +a |
| 209 | +linkml-store -d 'dremio-rest://lakehouse.jgi.lbl.gov' "$@" |
| 210 | +``` |
| 211 | + |
| 212 | +Then use it: |
| 213 | + |
| 214 | +```bash |
| 215 | +./dremio-query.sh -c '"gold-db-2 postgresql".gold.study' query -l 5 |
| 216 | +``` |
| 217 | + |
| 218 | +## Table Naming |
| 219 | + |
| 220 | +Dremio uses a hierarchical namespace for tables. Fully qualified table names may include: |
| 221 | + |
| 222 | +- **Source**: The data source name (e.g., `"gold-db-2 postgresql"`) |
| 223 | +- **Schema/Space**: The schema or space name (e.g., `gold`) |
| 224 | +- **Table**: The table name (e.g., `study`) |
| 225 | + |
| 226 | +Full path example: `"gold-db-2 postgresql".gold.study` |
| 227 | + |
| 228 | +When specifying table names: |
| 229 | +- Use double quotes around names with special characters or spaces |
| 230 | +- The full path can be specified directly in `get_collection()` |
| 231 | +- Or set a default schema in the connection string |
| 232 | + |
| 233 | +```python |
| 234 | +# Full path |
| 235 | +collection = db.get_collection('"gold-db-2 postgresql".gold.study') |
| 236 | + |
| 237 | +# With default schema |
| 238 | +db = client.attach_database( |
| 239 | + 'dremio-rest://lakehouse.example.com?schema="gold-db-2 postgresql".gold', |
| 240 | + alias="dremio" |
| 241 | +) |
| 242 | +collection = db.get_collection('study') # Uses default schema |
| 243 | +``` |
| 244 | + |
| 245 | +## Performance Considerations |
| 246 | + |
| 247 | +1. **Use specific table paths**: Don't rely on table discovery - specify exact paths |
| 248 | +2. **Add LIMIT**: Always use `limit` parameter to avoid fetching too many rows |
| 249 | +3. **Filter on server**: Use WHERE clauses to filter data on the server side |
| 250 | +4. **Avoid `search`**: The semantic search command loads all data locally - use `query` with `$like`/`$ilike` instead |
| 251 | + |
| 252 | +## Comparison: REST vs Flight SQL |
| 253 | + |
| 254 | +| Feature | `dremio-rest://` | `dremio://` | |
| 255 | +|---------|------------------|-------------| |
| 256 | +| Protocol | HTTPS REST API | Arrow Flight SQL | |
| 257 | +| Port | 443 (default) | 32010 | |
| 258 | +| Works behind proxy | Yes | No | |
| 259 | +| Performance | Good | Better | |
| 260 | +| Pagination | Automatic | Native | |
| 261 | + |
| 262 | +## Troubleshooting |
| 263 | + |
| 264 | +### Authentication Errors |
| 265 | + |
| 266 | +``` |
| 267 | +ConnectionError: Dremio authentication failed: 401 |
| 268 | +``` |
| 269 | + |
| 270 | +- Check `DREMIO_USER` and `DREMIO_PASSWORD` are set correctly |
| 271 | +- If behind Cloudflare, ensure `CF_AUTHORIZATION` is set and not expired |
| 272 | + |
| 273 | +### SSL Certificate Errors |
| 274 | + |
| 275 | +``` |
| 276 | +SSLError: certificate verify failed |
| 277 | +``` |
| 278 | + |
| 279 | +For testing only, disable SSL verification: |
| 280 | +```python |
| 281 | +db = client.attach_database("dremio-rest://localhost?verify_ssl=false", alias="test") |
| 282 | +``` |
| 283 | + |
| 284 | +### Slow Startup |
| 285 | + |
| 286 | +If the adapter is slow to start, it may be scanning for all tables. The adapter now skips this by default. If you need to list all tables: |
| 287 | + |
| 288 | +```python |
| 289 | +db.discover_collections() # Explicitly scan for tables |
| 290 | +print(db.list_collection_names()) |
| 291 | +``` |
| 292 | + |
| 293 | +### Query Syntax Errors |
| 294 | + |
| 295 | +Ensure table names with special characters are properly quoted: |
| 296 | +```python |
| 297 | +# Correct |
| 298 | +collection = db.get_collection('"gold-db-2 postgresql".gold.study') |
| 299 | + |
| 300 | +# Wrong - missing quotes |
| 301 | +collection = db.get_collection('gold-db-2 postgresql.gold.study') |
| 302 | +``` |
| 303 | + |
| 304 | +## Resources |
| 305 | + |
| 306 | +- [Dremio REST API Documentation](https://docs.dremio.com/current/reference/api/) |
| 307 | +- [Dremio SQL Reference](https://docs.dremio.com/current/reference/sql/) |
| 308 | +- [LinkML-Store Documentation](https://linkml.io/linkml-store/) |
0 commit comments