DataJoint 2.0 #1311

dimitri-yatsenko · 2026-01-07T16:28:35Z

Summary

DataJoint 2.0 is a major release that modernizes the entire codebase while maintaining backward compatibility for core functionality. This release focuses on extensibility, type safety, and developer experience.

Planning: DataJoint 2.0 Plan | Milestone 2.0

Major Features

Codec System (Extensible Types)

Replaces the adapter system with a modern, composable codec architecture:

Base codecs: <blob>, <json>, <attach>, <filepath>, <object>, <hash>
Chaining: Codecs can wrap other codecs (e.g., <blob> wraps <json> for external storage)
Auto-registration: Custom codecs register via __init_subclass__
Validation: Optional validate() method for type checking before insert

from datajoint import Codec

class MyCodec(Codec):
    python_type = MyClass
    dj_type = "<blob>"  # Storage format
    
    def encode(self, value): ...
    def decode(self, value): ...

Semantic Matching

Attribute lineage tracking ensures joins only match semantically compatible attributes:

Attributes track their origin through foreign key inheritance
Joins require matching lineage (not just matching names)
Prevents accidental matches on generic names like id or name
semantic_check=False for legacy permissive behavior

# These join on subject_id because both inherit from Subject
Session * Recording  # ✓ Works - same lineage

# These fail because 'id' has different origins
TableA * TableB  # ✗ Fails - different lineage for 'id'

Primary Key Rules

Rigorous primary key propagation through all operators:

Join: Result PK based on functional dependencies (A→B, B→A, both, neither)
Aggregation: Groups by left operand's primary key
Projection: Preserves PK attributes, drops secondary
Universal set: dj.U('attr') creates ad-hoc grouping entities

AutoPopulate 2.0 (Jobs System)

Per-table job management with enhanced tracking:

Hidden metadata: ~~_job_timestamp and ~~_job_duration columns
Per-table jobs: Each computed table has its own ~~table_name job table
Schema.jobs: List all job tables in a schema
Progress tracking: table.progress() returns (remaining, total)
Priority scheduling: Jobs ordered by priority, then timestamp

Modern Fetch & Insert API

New fetch methods:

to_dicts() - List of dictionaries
to_pandas() - DataFrame with PK as index
to_arrays(*attrs) - NumPy arrays (structured or individual)
keys() - Primary keys only
fetch1() - Single row

Insert improvements:

validate() - Check rows before inserting
chunk_size - Batch large inserts
insert_dataframe() - DataFrame with index handling
Empty inserts for tables with all-default attributes (IMPR: Error specificity on empty insert for table with default values #1280)
Polars and PyArrow support

Type Aliases

Core DataJoint types for portability:

Alias	MySQL Type
`int8`, `int16`, `int32`, `int64`	tinyint, smallint, int, bigint
`uint8`, `uint16`, `uint32`, `uint64`	unsigned variants
`float32`, `float64`	float, double
`bool`	tinyint
`uuid`	binary(16)

Object Storage

Content-addressed and object storage types:

<hash> - Content-addressed storage with deduplication
<object> - Named object storage (Zarr, folders)
<filepath> - Reference to managed files
<attach> - File attachments (uploaded on insert)

Virtual Schema Infrastructure (#1307)

New schema introspection API for exploring existing databases:

Schema.get_table(name) - Direct table access with auto tier prefix detection
Schema['TableName'] - Bracket notation access
for table in schema - Iterate tables in dependency order
'TableName' in schema - Check table existence
dj.virtual_schema() - Clean entry point for accessing schemas
dj.VirtualModule() - Virtual modules with custom names

CLI Improvements

The dj command-line interface for interactive exploration:

dj -s schema:alias - Load schemas as virtual modules
--host, --user, --password - Connection options
Fixed -h conflict with --help

Settings Modernization

Pydantic-based configuration with validation:

Type-safe settings with automatic validation
dj.config.override() context manager
Secrets directory support (.secrets/)
Environment variable overrides (DJ_HOST, etc.)

License Change

Changed from LGPL to Apache 2.0 license (#1235 (discussion)):

More permissive for commercial and academic use
Compatible with broader ecosystem of tools
Clearer patent grant provisions

Breaking Changes

Removed Support

Python 3.8, 3.9 (minimum 3.10)
MySQL 5.x (minimum 8.0)
Legacy fetch() with format parameter
safemode parameter (use prompt)
Adapter API (use Codec)
create_virtual_module (use dj.virtual_schema() or dj.VirtualModule())
~log table (IMPR: Deprecate and Remove the ~log Table. #1298)
otumat support (IMPR: Deprecate otumat support #1252)

API Changes

fetch() → to_dicts(), to_pandas(), to_arrays()
fetch(format='frame') → to_pandas()
fetch(as_dict=True) → to_dicts()
safemode=False → prompt=False

Semantic Changes

Joins now require lineage compatibility by default
Aggregation keeps non-matching rows by default (like LEFT JOIN)

Documentation

Developer Documentation (this repo)

Comprehensive updates in docs/:

NumPy-style docstrings for all public APIs
Architecture guides for contributors
Auto-generated API reference via mkdocstrings

User Documentation (datajoint-docs)

Full documentation site following the Diátaxis framework:

Tutorials (learning-oriented, Jupyter notebooks):

Getting Started - Installation, connection, first schema
Schema Design - Table tiers, definitions, foreign keys
Data Entry - Insert patterns, lookups, manual tables
Queries - Restriction, projection, join, aggregation, fetch
Computation - Computed tables, make(), populate patterns
Object Storage - Blobs, attachments, external storage

How-To Guides (task-oriented):

Configure object storage, Design primary keys, Model relationships
Handle computation errors, Manage large datasets, Create custom codecs
Use the CLI, Migrate from 1.x

Reference (specifications):

Table Declaration, Query Algebra, Data Manipulation
Primary Keys, Semantic Matching, Type System, Virtual Schemas
Codec API, AutoPopulate, Fetch API, Job Metadata

Project Structure

src/ layout for proper packaging (IMPR: src layout #1267)
Testcontainers for pytest-managed containers
Pre-commit hooks: ruff, mypy, unit tests (IMPR: Modernize pre-commit #1271)
GitHub Actions CI/CD
Split unit/integration tests (IMPR: split unit/integration test #1211)

Test Plan

580+ integration tests pass
80+ unit tests pass
Pre-commit hooks pass
Documentation builds successfully
Tutorials execute against test database

Closes

Related PRs

datajoint-docs PR #97 - DataJoint 2.0 Documentation
datajoint-docs PR #98 - Virtual schemas spec and CLI docs

Migration Guide

See How to Migrate from 1.x for detailed migration instructions.

🤖 Generated with Claude Code

use src layout

update test workflow to use src layout

use pytest to manage docker container startup for tests

Chore/dev env fixes

into impr/modernize-pre-commit

Simplify by always computing scheduled_time with MySQL server time, removing the special case for delay=0. Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Use setup-pixi action instead of manual setup - Graphviz installed automatically via pixi conda dependency - Testcontainers manages MySQL/MinIO containers automatically - No manual pip install needed, pixi handles dependencies Addresses reviewer feedback on PR #1312. Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Make pixi the recommended development setup (matches CI) - Add DOCKER_HOST note for macOS Docker Desktop users - Keep pip as alternative for users who prefer it - Update pre-commit commands to use pixi Co-Authored-By: Claude Opus 4.5 <[email protected]>

Allow lock file updates in CI since the lock file may be stale. Co-Authored-By: Claude Opus 4.5 <[email protected]>

The test feature was not including s3fs and other test dependencies because the feature-specific pypi-dependencies were missing. Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Add mypy configuration to pyproject.toml - Add mypy hook to pre-commit with type stubs - Start with lenient settings, strict checking for content_registry - All other modules excluded until fully typed (gradual adoption) Addresses #1266. Co-Authored-By: Claude Opus 4.5 <[email protected]>

@drewyangdev

Run unit tests locally before commit to catch issues before CI. Addresses feedback from @drewyangdev on #1211. Co-Authored-By: Claude Opus 4.5 <[email protected]>

feat: dj.Top order inheritance, part_integrity parameter, and storage fixes

Update test_top_restriction_with_keywords to verify that dj.Top properly preserves ordering in fetch results. Use secondary sort by 'id' to ensure deterministic results when there are ties. Fixes #1205 Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Add Schema.get_table() for direct table access - Add Schema.__getitem__ for bracket notation: schema['TableName'] - Add Schema.__iter__ to iterate over all tables - Add Schema.__contains__ for 'TableName' in schema - Add dj.virtual_schema() as clean entry point - Remove create_virtual_module (breaking change) - Fix gc.py to use get_table() instead of spawn_table() - Remove specs/ folder (moved to datajoint-docs) - Add comprehensive tests for virtual schema infrastructure Fixes #1307 Co-Authored-By: Claude Opus 4.5 <[email protected]>

The pre-commit config has been modernized to use ruff instead of flake8. Update the SKIP example comment accordingly. Closes #1271 Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Add type annotations to errors.py (suggest method) - Add type annotations to hash.py (key_hash, uuid_from_buffer) - Enable strict mypy checking for these modules - Now 3 modules under strict checking: content_registry, errors, hash Increases type coverage incrementally following gradual adoption strategy. Related #1266 Co-Authored-By: Claude Opus 4.5 <[email protected]>

Change sync-labels from true to false in PR labeler workflow. This prevents the GitHub Actions labeler from removing manually added labels like "breaking" when they don't match the automatic labeling rules. With sync-labels: true, the action removes any labels not matched by the configuration. With sync-labels: false, it only adds labels based on patterns and preserves manually added labels. Co-Authored-By: Claude Opus 4.5 <[email protected]>

Update PyPI keywords to reflect DataJoint 2.0 positioning and modern data engineering terminology: Added: - data-engineering, data-pipelines, workflow-management - data-integrity, reproducibility, declarative - object-storage, schema-management, data-lineage - scientific-computing, research-software - postgresql (upcoming support) Removed: - Generic terms: database, automated, automation, compute, data - Redundant terms: pipeline, workflow, scientific, science, research - Domain-specific: bioinformatics (kept neuroscience as primary) Updated GitHub repository topics to match (18 topics total). Focuses on searchable terms, 2.0 features, and differentiators. Co-Authored-By: Claude Opus 4.5 <[email protected]>

The get_table(), __getitem__, and __contains__ methods now auto-detect table tier prefixes (Manual: none, Lookup: #, Imported: _, Computed: __). This allows users to access tables by their base name without knowing the tier prefix: - schema.get_table("experiment") finds "_experiment" (Imported) - schema["Subject"] finds "#subject" (Lookup) - "Experiment" in schema returns True Added _find_table_name() helper that checks exact match first, then tries each tier prefix. Co-Authored-By: Claude Opus 4.5 <[email protected]>

Replace deprecated fetch() calls with to_dicts() in test_virtual_module.py: - test_virtual_schema_tables_are_queryable: use lab.Experiment().to_dicts() - test_getitem_is_queryable: use table.to_dicts() Co-Authored-By: Claude Opus 4.5 <[email protected]>

The create_virtual_module function was removed in 2.0. Update the CLI to use dj.virtual_schema() for loading schemas via the -s flag. Co-Authored-By: Claude Opus 4.5 <[email protected]>

VirtualModule allows specifying both module name and schema name, while virtual_schema() uses schema name for both. The CLI needs custom module names for the -s flag, so use VirtualModule directly. Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Remove -h shorthand for --host (conflicts with argparse --help) - Add module-level docstring with usage examples - Improve function docstring with NumPy style - Add explicit error handling for invalid schema format - Improve banner message with version and usage hint - Use modern type hints (list[str] | None) - Fix locals() issue: explicitly include dj in REPL namespace Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Replace -h shorthand with --host (removed to avoid -h/--help conflict) - Use separate arguments instead of concatenated form - Use prefix variable for schema name consistency - Fix assertion string matching Co-Authored-By: Claude Opus 4.5 <[email protected]>

feat: virtual schema infrastructure and CI improvements

d-v-b and others added 30 commits August 29, 2025 10:09

use src layout

139258d

Merge pull request #1268 from d-v-b/impr/src-layout

049310c

use src layout

use pytest to manage docker container startup for tests

ebeab88

fix environment variable mismatch

718e219

add database.port to settings.py, and update conftest

1252add

revert python version floor increment

fecbb83

use normal healthcheck intervals

76aaf5e

revert change to healthcheck, because the nanoseconds were correct

c68f1df

update gitignore

3a34d82

add pixi lockfile

f73d7c7

update .gitattributes for pixi

de13442

lint

64fc0ac

add astroid exemption to codespell rc

b3f82d9

spruce up linting workflow

a321f92

lint with ruff

1aa30f4

update linting workflow

de4ce27

update test workflow to use src layout

59d0159

update test workflow to use src layout

85ff041

update hook invocations to use src layout

2007f33

Merge pull request #1274 from d-v-b/fix/unbreak-test-workflow

9fcb25a

update test workflow to use src layout

Merge pull request #1269 from d-v-b/feat/pytest-container-management

896e6cd

use pytest to manage docker container startup for tests

simplify devcontainer

b3b712b

update deps, and add activate script for dot

a506d40

refactor test fixtures

88ca4dc

skip multiprocessing tests on osx

a30d41b

skip c901 check

d631b8b

update pre-commit

f45e7c8

Merge pull request #1279 from d-v-b/chore/dev-env-fixes

4893e3d

Chore/dev env fixes

Merge branch 'pre/v2.0' of https://github.com/datajoint/datajoint-python

66c9ebd

into impr/modernize-pre-commit

more linting

b00a4f0

dimitri-yatsenko and others added 7 commits January 8, 2026 10:08

refactor(jobs): always use NOW(3) + INTERVAL for scheduled_time

2100487

Simplify by always computing scheduled_time with MySQL server time, removing the special case for delay=0. Co-Authored-By: Claude Opus 4.5 <[email protected]>

ci: disable locked mode for pixi install

272fcb5

Allow lock file updates in CI since the lock file may be stale. Co-Authored-By: Claude Opus 4.5 <[email protected]>

fix(pixi): add test extras to feature-specific pypi-dependencies

b8645f8

The test feature was not including s3fs and other test dependencies because the feature-specific pypi-dependencies were missing. Co-Authored-By: Claude Opus 4.5 <[email protected]>

feat: add unit tests to pre-commit hooks

f195110

Run unit tests locally before commit to catch issues before CI. Addresses feedback from @drewyangdev on #1211. Co-Authored-By: Claude Opus 4.5 <[email protected]>

dimitri-yatsenko added the breaking Not backward compatible changes label Jan 8, 2026

Merge PR #1312: DataJoint 2.0 - Jobs 2.0, CI, part_integrity, and more

46333e0

feat: dj.Top order inheritance, part_integrity parameter, and storage fixes

github-actions bot removed the breaking Not backward compatible changes label Jan 8, 2026

dimitri-yatsenko and others added 5 commits January 8, 2026 13:48

chore: update pre-commit comment to reference ruff

a6bc04b

The pre-commit config has been modernized to use ruff instead of flake8. Update the SKIP example comment accordingly. Closes #1271 Co-Authored-By: Claude Opus 4.5 <[email protected]>

dimitri-yatsenko added the breaking Not backward compatible changes label Jan 8, 2026

dimitri-yatsenko and others added 10 commits January 8, 2026 14:58

fix: update CLI to use virtual_schema instead of create_virtual_module

612511d

The create_virtual_module function was removed in 2.0. Update the CLI to use dj.virtual_schema() for loading schemas via the -s flag. Co-Authored-By: Claude Opus 4.5 <[email protected]>

style: format cli.py with ruff

3c8258b

style: format test_cli.py with ruff

ef66992

Merge pull request #1313 from datajoint/virtual-modules

c1b36f0

feat: virtual schema infrastructure and CI improvements

github-actions bot removed the breaking Not backward compatible changes label Jan 9, 2026

dimitri-yatsenko added the breaking Not backward compatible changes label Jan 9, 2026

dimitri-yatsenko self-assigned this Jan 9, 2026

dimitri-yatsenko mentioned this pull request Jan 9, 2026

File-Augmented Schema #1151

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DataJoint 2.0 #1311

DataJoint 2.0 #1311

Uh oh!

dimitri-yatsenko commented Jan 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

DataJoint 2.0 #1311

Are you sure you want to change the base?

DataJoint 2.0 #1311

Uh oh!

Conversation

dimitri-yatsenko commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Major Features

Codec System (Extensible Types)

Semantic Matching

Primary Key Rules

AutoPopulate 2.0 (Jobs System)

Modern Fetch & Insert API

Type Aliases

Object Storage

Virtual Schema Infrastructure (#1307)

CLI Improvements

Settings Modernization

License Change

Breaking Changes

Removed Support

API Changes

Semantic Changes

Documentation

Developer Documentation (this repo)

User Documentation (datajoint-docs)

Project Structure

Test Plan

Closes

Milestone 2.0 Issues

Bug Fixes

Improvements

Related PRs

Migration Guide

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dimitri-yatsenko commented Jan 7, 2026 •

edited

Loading