Skip to content

Conversation

@dimitri-yatsenko
Copy link
Member

@dimitri-yatsenko dimitri-yatsenko commented Jan 7, 2026

Summary

DataJoint 2.0 is a major release that modernizes the entire codebase while maintaining backward compatibility for core functionality. This release focuses on extensibility, type safety, and developer experience.

Planning: DataJoint 2.0 Plan | Milestone 2.0

Major Features

Codec System (Extensible Types)

Replaces the adapter system with a modern, composable codec architecture:

  • Base codecs: <blob>, <json>, <attach>, <filepath>, <object>, <hash>
  • Chaining: Codecs can wrap other codecs (e.g., <blob> wraps <json> for external storage)
  • Auto-registration: Custom codecs register via __init_subclass__
  • Validation: Optional validate() method for type checking before insert
from datajoint import Codec

class MyCodec(Codec):
    python_type = MyClass
    dj_type = "<blob>"  # Storage format
    
    def encode(self, value): ...
    def decode(self, value): ...

Semantic Matching

Attribute lineage tracking ensures joins only match semantically compatible attributes:

  • Attributes track their origin through foreign key inheritance
  • Joins require matching lineage (not just matching names)
  • Prevents accidental matches on generic names like id or name
  • semantic_check=False for legacy permissive behavior
# These join on subject_id because both inherit from Subject
Session * Recording  # ✓ Works - same lineage

# These fail because 'id' has different origins
TableA * TableB  # ✗ Fails - different lineage for 'id'

Primary Key Rules

Rigorous primary key propagation through all operators:

  • Join: Result PK based on functional dependencies (A→B, B→A, both, neither)
  • Aggregation: Groups by left operand's primary key
  • Projection: Preserves PK attributes, drops secondary
  • Universal set: dj.U('attr') creates ad-hoc grouping entities

AutoPopulate 2.0 (Jobs System)

Per-table job management with enhanced tracking:

  • Hidden metadata: ~~_job_timestamp and ~~_job_duration columns
  • Per-table jobs: Each computed table has its own ~~table_name job table
  • Schema.jobs: List all job tables in a schema
  • Progress tracking: table.progress() returns (remaining, total)
  • Priority scheduling: Jobs ordered by priority, then timestamp

Modern Fetch & Insert API

New fetch methods:

  • to_dicts() - List of dictionaries
  • to_pandas() - DataFrame with PK as index
  • to_arrays(*attrs) - NumPy arrays (structured or individual)
  • keys() - Primary keys only
  • fetch1() - Single row

Insert improvements:

Type Aliases

Core DataJoint types for portability:

Alias MySQL Type
int8, int16, int32, int64 tinyint, smallint, int, bigint
uint8, uint16, uint32, uint64 unsigned variants
float32, float64 float, double
bool tinyint
uuid binary(16)

Object Storage

Content-addressed and object storage types:

  • <hash> - Content-addressed storage with deduplication
  • <object> - Named object storage (Zarr, folders)
  • <filepath> - Reference to managed files
  • <attach> - File attachments (uploaded on insert)

Virtual Schema Infrastructure (#1307)

New schema introspection API for exploring existing databases:

  • Schema.get_table(name) - Direct table access with auto tier prefix detection
  • Schema['TableName'] - Bracket notation access
  • for table in schema - Iterate tables in dependency order
  • 'TableName' in schema - Check table existence
  • dj.virtual_schema() - Clean entry point for accessing schemas
  • dj.VirtualModule() - Virtual modules with custom names

CLI Improvements

The dj command-line interface for interactive exploration:

  • dj -s schema:alias - Load schemas as virtual modules
  • --host, --user, --password - Connection options
  • Fixed -h conflict with --help

Settings Modernization

Pydantic-based configuration with validation:

  • Type-safe settings with automatic validation
  • dj.config.override() context manager
  • Secrets directory support (.secrets/)
  • Environment variable overrides (DJ_HOST, etc.)

License Change

Changed from LGPL to Apache 2.0 license (#1235 (discussion)):

  • More permissive for commercial and academic use
  • Compatible with broader ecosystem of tools
  • Clearer patent grant provisions

Breaking Changes

Removed Support

API Changes

  • fetch()to_dicts(), to_pandas(), to_arrays()
  • fetch(format='frame')to_pandas()
  • fetch(as_dict=True)to_dicts()
  • safemode=Falseprompt=False

Semantic Changes

  • Joins now require lineage compatibility by default
  • Aggregation keeps non-matching rows by default (like LEFT JOIN)

Documentation

Developer Documentation (this repo)

Comprehensive updates in docs/:

  • NumPy-style docstrings for all public APIs
  • Architecture guides for contributors
  • Auto-generated API reference via mkdocstrings

User Documentation (datajoint-docs)

Full documentation site following the Diátaxis framework:

Tutorials (learning-oriented, Jupyter notebooks):

  1. Getting Started - Installation, connection, first schema
  2. Schema Design - Table tiers, definitions, foreign keys
  3. Data Entry - Insert patterns, lookups, manual tables
  4. Queries - Restriction, projection, join, aggregation, fetch
  5. Computation - Computed tables, make(), populate patterns
  6. Object Storage - Blobs, attachments, external storage

How-To Guides (task-oriented):

  • Configure object storage, Design primary keys, Model relationships
  • Handle computation errors, Manage large datasets, Create custom codecs
  • Use the CLI, Migrate from 1.x

Reference (specifications):

  • Table Declaration, Query Algebra, Data Manipulation
  • Primary Keys, Semantic Matching, Type System, Virtual Schemas
  • Codec API, AutoPopulate, Fetch API, Job Metadata

Project Structure

Test Plan

  • 580+ integration tests pass
  • 80+ unit tests pass
  • Pre-commit hooks pass
  • Documentation builds successfully
  • Tutorials execute against test database

Closes

Milestone 2.0 Issues

Bug Fixes

Improvements

Related PRs

Migration Guide

See How to Migrate from 1.x for detailed migration instructions.


🤖 Generated with Claude Code

d-v-b and others added 30 commits August 29, 2025 10:09
update test workflow to use src layout
use pytest to manage docker container startup for tests
dimitri-yatsenko and others added 7 commits January 8, 2026 10:08
Simplify by always computing scheduled_time with MySQL server time,
removing the special case for delay=0.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Use setup-pixi action instead of manual setup
- Graphviz installed automatically via pixi conda dependency
- Testcontainers manages MySQL/MinIO containers automatically
- No manual pip install needed, pixi handles dependencies

Addresses reviewer feedback on PR #1312.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Make pixi the recommended development setup (matches CI)
- Add DOCKER_HOST note for macOS Docker Desktop users
- Keep pip as alternative for users who prefer it
- Update pre-commit commands to use pixi

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Allow lock file updates in CI since the lock file may be stale.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
The test feature was not including s3fs and other test dependencies
because the feature-specific pypi-dependencies were missing.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add mypy configuration to pyproject.toml
- Add mypy hook to pre-commit with type stubs
- Start with lenient settings, strict checking for content_registry
- All other modules excluded until fully typed (gradual adoption)

Addresses #1266.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Run unit tests locally before commit to catch issues before CI.
Addresses feedback from @drewyangdev on #1211.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@dimitri-yatsenko dimitri-yatsenko added the breaking Not backward compatible changes label Jan 8, 2026
feat: dj.Top order inheritance, part_integrity parameter, and storage fixes
@github-actions github-actions bot removed the breaking Not backward compatible changes label Jan 8, 2026
dimitri-yatsenko and others added 5 commits January 8, 2026 13:48
Update test_top_restriction_with_keywords to verify that dj.Top
properly preserves ordering in fetch results. Use secondary sort
by 'id' to ensure deterministic results when there are ties.

Fixes #1205

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add Schema.get_table() for direct table access
- Add Schema.__getitem__ for bracket notation: schema['TableName']
- Add Schema.__iter__ to iterate over all tables
- Add Schema.__contains__ for 'TableName' in schema
- Add dj.virtual_schema() as clean entry point
- Remove create_virtual_module (breaking change)
- Fix gc.py to use get_table() instead of spawn_table()
- Remove specs/ folder (moved to datajoint-docs)
- Add comprehensive tests for virtual schema infrastructure

Fixes #1307

Co-Authored-By: Claude Opus 4.5 <[email protected]>
The pre-commit config has been modernized to use ruff instead of
flake8. Update the SKIP example comment accordingly.

Closes #1271

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add type annotations to errors.py (suggest method)
- Add type annotations to hash.py (key_hash, uuid_from_buffer)
- Enable strict mypy checking for these modules
- Now 3 modules under strict checking: content_registry, errors, hash

Increases type coverage incrementally following gradual adoption strategy.

Related #1266

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Change sync-labels from true to false in PR labeler workflow.
This prevents the GitHub Actions labeler from removing manually
added labels like "breaking" when they don't match the automatic
labeling rules.

With sync-labels: true, the action removes any labels not matched
by the configuration. With sync-labels: false, it only adds labels
based on patterns and preserves manually added labels.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@dimitri-yatsenko dimitri-yatsenko added the breaking Not backward compatible changes label Jan 8, 2026
dimitri-yatsenko and others added 10 commits January 8, 2026 14:58
Update PyPI keywords to reflect DataJoint 2.0 positioning and
modern data engineering terminology:

Added:
- data-engineering, data-pipelines, workflow-management
- data-integrity, reproducibility, declarative
- object-storage, schema-management, data-lineage
- scientific-computing, research-software
- postgresql (upcoming support)

Removed:
- Generic terms: database, automated, automation, compute, data
- Redundant terms: pipeline, workflow, scientific, science, research
- Domain-specific: bioinformatics (kept neuroscience as primary)

Updated GitHub repository topics to match (18 topics total).

Focuses on searchable terms, 2.0 features, and differentiators.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
The get_table(), __getitem__, and __contains__ methods now auto-detect
table tier prefixes (Manual: none, Lookup: #, Imported: _, Computed: __).

This allows users to access tables by their base name without knowing
the tier prefix:
  - schema.get_table("experiment") finds "_experiment" (Imported)
  - schema["Subject"] finds "#subject" (Lookup)
  - "Experiment" in schema returns True

Added _find_table_name() helper that checks exact match first, then
tries each tier prefix.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Replace deprecated fetch() calls with to_dicts() in test_virtual_module.py:
- test_virtual_schema_tables_are_queryable: use lab.Experiment().to_dicts()
- test_getitem_is_queryable: use table.to_dicts()

Co-Authored-By: Claude Opus 4.5 <[email protected]>
The create_virtual_module function was removed in 2.0. Update the CLI
to use dj.virtual_schema() for loading schemas via the -s flag.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
VirtualModule allows specifying both module name and schema name,
while virtual_schema() uses schema name for both. The CLI needs
custom module names for the -s flag, so use VirtualModule directly.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Remove -h shorthand for --host (conflicts with argparse --help)
- Add module-level docstring with usage examples
- Improve function docstring with NumPy style
- Add explicit error handling for invalid schema format
- Improve banner message with version and usage hint
- Use modern type hints (list[str] | None)
- Fix locals() issue: explicitly include dj in REPL namespace

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Replace -h shorthand with --host (removed to avoid -h/--help conflict)
- Use separate arguments instead of concatenated form
- Use prefix variable for schema name consistency
- Fix assertion string matching

Co-Authored-By: Claude Opus 4.5 <[email protected]>
feat: virtual schema infrastructure and CI improvements
@github-actions github-actions bot removed the breaking Not backward compatible changes label Jan 9, 2026
@dimitri-yatsenko dimitri-yatsenko added the breaking Not backward compatible changes label Jan 9, 2026
@dimitri-yatsenko dimitri-yatsenko self-assigned this Jan 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking Not backward compatible changes documentation Issues related to documentation enhancement Indicates new improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants