Skip to content

Commit c78ab51

Browse files
Merge pull request #1304 from datajoint/claude/modern-fetch-api
DataJoint 2.0: Modern Fetch and Insert APIs
2 parents 83b380f + d1dafdc commit c78ab51

36 files changed

+2351
-632
lines changed
Lines changed: 302 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,302 @@
1+
# DataJoint 2.0 Fetch API Specification
2+
3+
## Overview
4+
5+
DataJoint 2.0 replaces the complex `fetch()` method with a set of explicit, composable output methods. This provides better discoverability, clearer intent, and more efficient iteration.
6+
7+
## Design Principles
8+
9+
1. **Explicit over implicit**: Each output format has its own method
10+
2. **Composable**: Use existing `.proj()` for column selection
11+
3. **Lazy iteration**: Single cursor streaming instead of fetch-all-keys
12+
4. **Modern formats**: First-class support for polars and Arrow
13+
14+
---
15+
16+
## New API Reference
17+
18+
### Output Methods
19+
20+
| Method | Returns | Description |
21+
|--------|---------|-------------|
22+
| `to_dicts()` | `list[dict]` | All rows as list of dictionaries |
23+
| `to_pandas()` | `DataFrame` | pandas DataFrame with primary key as index |
24+
| `to_polars()` | `polars.DataFrame` | polars DataFrame (requires `datajoint[polars]`) |
25+
| `to_arrow()` | `pyarrow.Table` | PyArrow Table (requires `datajoint[arrow]`) |
26+
| `to_arrays()` | `np.ndarray` | numpy structured array (recarray) |
27+
| `to_arrays('a', 'b')` | `tuple[array, array]` | Tuple of arrays for specific columns |
28+
| `keys()` | `list[dict]` | Primary key values only |
29+
| `fetch1()` | `dict` | Single row as dict (raises if not exactly 1) |
30+
| `fetch1('a', 'b')` | `tuple` | Single row attribute values |
31+
32+
### Common Parameters
33+
34+
All output methods accept these optional parameters:
35+
36+
```python
37+
table.to_dicts(
38+
order_by=None, # str or list: column(s) to sort by, e.g. "KEY", "name DESC"
39+
limit=None, # int: maximum rows to return
40+
offset=None, # int: rows to skip
41+
squeeze=False, # bool: remove singleton dimensions from arrays
42+
download_path="." # str: path for downloading external data
43+
)
44+
```
45+
46+
### Iteration
47+
48+
```python
49+
# Lazy streaming - yields one dict per row from database cursor
50+
for row in table:
51+
process(row) # row is a dict
52+
```
53+
54+
---
55+
56+
## Migration Guide
57+
58+
### Basic Fetch Operations
59+
60+
| Old Pattern (1.x) | New Pattern (2.0) |
61+
|-------------------|-------------------|
62+
| `table.fetch()` | `table.to_arrays()` or `table.to_dicts()` |
63+
| `table.fetch(format="array")` | `table.to_arrays()` |
64+
| `table.fetch(format="frame")` | `table.to_pandas()` |
65+
| `table.fetch(as_dict=True)` | `table.to_dicts()` |
66+
67+
### Attribute Fetching
68+
69+
| Old Pattern (1.x) | New Pattern (2.0) |
70+
|-------------------|-------------------|
71+
| `table.fetch('a')` | `table.to_arrays('a')` |
72+
| `a, b = table.fetch('a', 'b')` | `a, b = table.to_arrays('a', 'b')` |
73+
| `table.fetch('a', 'b', as_dict=True)` | `table.proj('a', 'b').to_dicts()` |
74+
75+
### Primary Key Fetching
76+
77+
| Old Pattern (1.x) | New Pattern (2.0) |
78+
|-------------------|-------------------|
79+
| `table.fetch('KEY')` | `table.keys()` |
80+
| `table.fetch(dj.key)` | `table.keys()` |
81+
| `keys, a = table.fetch('KEY', 'a')` | See note below |
82+
83+
For mixed KEY + attribute fetch:
84+
```python
85+
# Old: keys, a = table.fetch('KEY', 'a')
86+
# New: Combine keys() with to_arrays()
87+
keys = table.keys()
88+
a = table.to_arrays('a')
89+
# Or use to_dicts() which includes all columns
90+
```
91+
92+
### Ordering, Limiting, Offset
93+
94+
| Old Pattern (1.x) | New Pattern (2.0) |
95+
|-------------------|-------------------|
96+
| `table.fetch(order_by='name')` | `table.to_arrays(order_by='name')` |
97+
| `table.fetch(limit=10)` | `table.to_arrays(limit=10)` |
98+
| `table.fetch(order_by='KEY', limit=10, offset=5)` | `table.to_arrays(order_by='KEY', limit=10, offset=5)` |
99+
100+
### Single Row Fetch (fetch1)
101+
102+
| Old Pattern (1.x) | New Pattern (2.0) |
103+
|-------------------|-------------------|
104+
| `table.fetch1()` | `table.fetch1()` (unchanged) |
105+
| `a, b = table.fetch1('a', 'b')` | `a, b = table.fetch1('a', 'b')` (unchanged) |
106+
| `table.fetch1('KEY')` | `table.fetch1()` then extract pk columns |
107+
108+
### Configuration
109+
110+
| Old Pattern (1.x) | New Pattern (2.0) |
111+
|-------------------|-------------------|
112+
| `dj.config['fetch_format'] = 'frame'` | Use `.to_pandas()` explicitly |
113+
| `with dj.config.override(fetch_format='frame'):` | Use `.to_pandas()` in the block |
114+
115+
### Iteration
116+
117+
| Old Pattern (1.x) | New Pattern (2.0) |
118+
|-------------------|-------------------|
119+
| `for row in table:` | `for row in table:` (same syntax, now lazy!) |
120+
| `list(table)` | `table.to_dicts()` |
121+
122+
### Column Selection with proj()
123+
124+
Use `.proj()` for column selection, then apply output method:
125+
126+
```python
127+
# Select specific columns
128+
table.proj('col1', 'col2').to_pandas()
129+
table.proj('col1', 'col2').to_dicts()
130+
131+
# Computed columns
132+
table.proj(total='price * quantity').to_pandas()
133+
```
134+
135+
---
136+
137+
## Removed Features
138+
139+
### Removed Methods and Parameters
140+
141+
- `fetch()` method - use explicit output methods
142+
- `fetch('KEY')` - use `keys()`
143+
- `dj.key` class - use `keys()` method
144+
- `format=` parameter - use explicit methods
145+
- `as_dict=` parameter - use `to_dicts()`
146+
- `config['fetch_format']` setting - use explicit methods
147+
148+
### Removed Imports
149+
150+
```python
151+
# Old (removed)
152+
from datajoint import key
153+
result = table.fetch(dj.key)
154+
155+
# New
156+
result = table.keys()
157+
```
158+
159+
---
160+
161+
## Examples
162+
163+
### Example 1: Basic Data Retrieval
164+
165+
```python
166+
# Get all data as DataFrame
167+
df = Experiment().to_pandas()
168+
169+
# Get all data as list of dicts
170+
rows = Experiment().to_dicts()
171+
172+
# Get all data as numpy array
173+
arr = Experiment().to_arrays()
174+
```
175+
176+
### Example 2: Filtered and Sorted Query
177+
178+
```python
179+
# Get recent experiments, sorted by date
180+
recent = (Experiment() & 'date > "2024-01-01"').to_pandas(
181+
order_by='date DESC',
182+
limit=100
183+
)
184+
```
185+
186+
### Example 3: Specific Columns
187+
188+
```python
189+
# Fetch specific columns as arrays
190+
names, dates = Experiment().to_arrays('name', 'date')
191+
192+
# Or with primary key included
193+
names, dates = Experiment().to_arrays('name', 'date', include_key=True)
194+
```
195+
196+
### Example 4: Primary Keys for Iteration
197+
198+
```python
199+
# Get keys for restriction
200+
keys = Experiment().keys()
201+
for key in keys:
202+
process(Session() & key)
203+
```
204+
205+
### Example 5: Single Row
206+
207+
```python
208+
# Get one row as dict
209+
row = (Experiment() & key).fetch1()
210+
211+
# Get specific attributes
212+
name, date = (Experiment() & key).fetch1('name', 'date')
213+
```
214+
215+
### Example 6: Lazy Iteration
216+
217+
```python
218+
# Stream rows efficiently (single database cursor)
219+
for row in Experiment():
220+
if should_process(row):
221+
process(row)
222+
if done:
223+
break # Early termination - no wasted fetches
224+
```
225+
226+
### Example 7: Modern DataFrame Libraries
227+
228+
```python
229+
# Polars (fast, modern)
230+
import polars as pl
231+
df = Experiment().to_polars()
232+
result = df.filter(pl.col('value') > 100).group_by('category').agg(pl.mean('value'))
233+
234+
# PyArrow (zero-copy interop)
235+
table = Experiment().to_arrow()
236+
# Can convert to pandas or polars with zero copy
237+
```
238+
239+
---
240+
241+
## Performance Considerations
242+
243+
### Lazy Iteration
244+
245+
The new iteration is significantly more efficient:
246+
247+
```python
248+
# Old (1.x): N+1 queries
249+
# 1. fetch("KEY") gets ALL keys
250+
# 2. fetch1() for EACH key
251+
252+
# New (2.0): Single query
253+
# Streams rows from one cursor
254+
for row in table:
255+
...
256+
```
257+
258+
### Memory Efficiency
259+
260+
- `to_dicts()`: Returns full list in memory
261+
- `for row in table:`: Streams one row at a time
262+
- `to_arrays(limit=N)`: Fetches only N rows
263+
264+
### Format Selection
265+
266+
| Use Case | Recommended Method |
267+
|----------|-------------------|
268+
| Data analysis | `to_pandas()` or `to_polars()` |
269+
| JSON API responses | `to_dicts()` |
270+
| Numeric computation | `to_arrays()` |
271+
| Large datasets | `for row in table:` (streaming) |
272+
| Interop with other tools | `to_arrow()` |
273+
274+
---
275+
276+
## Error Messages
277+
278+
When attempting to use removed methods, users see helpful error messages:
279+
280+
```python
281+
>>> table.fetch()
282+
AttributeError: fetch() has been removed in DataJoint 2.0.
283+
Use to_dicts(), to_pandas(), to_arrays(), or keys() instead.
284+
See table.fetch.__doc__ for details.
285+
```
286+
287+
---
288+
289+
## Optional Dependencies
290+
291+
Install optional dependencies for additional output formats:
292+
293+
```bash
294+
# For polars support
295+
pip install datajoint[polars]
296+
297+
# For PyArrow support
298+
pip install datajoint[arrow]
299+
300+
# For both
301+
pip install datajoint[polars,arrow]
302+
```

pyproject.toml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -87,18 +87,24 @@ test = [
8787
"requests",
8888
"graphviz",
8989
"testcontainers[mysql,minio]>=4.0",
90+
"polars>=0.20.0",
91+
"pyarrow>=14.0.0",
9092
]
9193

9294
[project.optional-dependencies]
9395
s3 = ["s3fs>=2023.1.0"]
9496
gcs = ["gcsfs>=2023.1.0"]
9597
azure = ["adlfs>=2023.1.0"]
98+
polars = ["polars>=0.20.0"]
99+
arrow = ["pyarrow>=14.0.0"]
96100
test = [
97101
"pytest",
98102
"pytest-cov",
99103
"requests",
100104
"s3fs>=2023.1.0",
101105
"testcontainers[mysql,minio]>=4.0",
106+
"polars>=0.20.0",
107+
"pyarrow>=14.0.0",
102108
]
103109
dev = [
104110
"pre-commit",
@@ -107,6 +113,8 @@ dev = [
107113
# including test
108114
"pytest",
109115
"pytest-cov",
116+
"polars>=0.20.0",
117+
"pyarrow>=14.0.0",
110118
]
111119

112120
[tool.ruff]

0 commit comments

Comments
 (0)