-
Notifications
You must be signed in to change notification settings - Fork 93
DataJoint 2.0 #1311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
dimitri-yatsenko
wants to merge
226
commits into
master
Choose a base branch
from
pre/v2.0
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
DataJoint 2.0 #1311
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
use src layout
update test workflow to use src layout
use pytest to manage docker container startup for tests
Chore/dev env fixes
Simplify by always computing scheduled_time with MySQL server time, removing the special case for delay=0. Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Use setup-pixi action instead of manual setup - Graphviz installed automatically via pixi conda dependency - Testcontainers manages MySQL/MinIO containers automatically - No manual pip install needed, pixi handles dependencies Addresses reviewer feedback on PR #1312. Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Make pixi the recommended development setup (matches CI) - Add DOCKER_HOST note for macOS Docker Desktop users - Keep pip as alternative for users who prefer it - Update pre-commit commands to use pixi Co-Authored-By: Claude Opus 4.5 <[email protected]>
Allow lock file updates in CI since the lock file may be stale. Co-Authored-By: Claude Opus 4.5 <[email protected]>
The test feature was not including s3fs and other test dependencies because the feature-specific pypi-dependencies were missing. Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add mypy configuration to pyproject.toml - Add mypy hook to pre-commit with type stubs - Start with lenient settings, strict checking for content_registry - All other modules excluded until fully typed (gradual adoption) Addresses #1266. Co-Authored-By: Claude Opus 4.5 <[email protected]>
Run unit tests locally before commit to catch issues before CI. Addresses feedback from @drewyangdev on #1211. Co-Authored-By: Claude Opus 4.5 <[email protected]>
feat: dj.Top order inheritance, part_integrity parameter, and storage fixes
Update test_top_restriction_with_keywords to verify that dj.Top properly preserves ordering in fetch results. Use secondary sort by 'id' to ensure deterministic results when there are ties. Fixes #1205 Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add Schema.get_table() for direct table access - Add Schema.__getitem__ for bracket notation: schema['TableName'] - Add Schema.__iter__ to iterate over all tables - Add Schema.__contains__ for 'TableName' in schema - Add dj.virtual_schema() as clean entry point - Remove create_virtual_module (breaking change) - Fix gc.py to use get_table() instead of spawn_table() - Remove specs/ folder (moved to datajoint-docs) - Add comprehensive tests for virtual schema infrastructure Fixes #1307 Co-Authored-By: Claude Opus 4.5 <[email protected]>
The pre-commit config has been modernized to use ruff instead of flake8. Update the SKIP example comment accordingly. Closes #1271 Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add type annotations to errors.py (suggest method) - Add type annotations to hash.py (key_hash, uuid_from_buffer) - Enable strict mypy checking for these modules - Now 3 modules under strict checking: content_registry, errors, hash Increases type coverage incrementally following gradual adoption strategy. Related #1266 Co-Authored-By: Claude Opus 4.5 <[email protected]>
Change sync-labels from true to false in PR labeler workflow. This prevents the GitHub Actions labeler from removing manually added labels like "breaking" when they don't match the automatic labeling rules. With sync-labels: true, the action removes any labels not matched by the configuration. With sync-labels: false, it only adds labels based on patterns and preserves manually added labels. Co-Authored-By: Claude Opus 4.5 <[email protected]>
Update PyPI keywords to reflect DataJoint 2.0 positioning and modern data engineering terminology: Added: - data-engineering, data-pipelines, workflow-management - data-integrity, reproducibility, declarative - object-storage, schema-management, data-lineage - scientific-computing, research-software - postgresql (upcoming support) Removed: - Generic terms: database, automated, automation, compute, data - Redundant terms: pipeline, workflow, scientific, science, research - Domain-specific: bioinformatics (kept neuroscience as primary) Updated GitHub repository topics to match (18 topics total). Focuses on searchable terms, 2.0 features, and differentiators. Co-Authored-By: Claude Opus 4.5 <[email protected]>
The get_table(), __getitem__, and __contains__ methods now auto-detect
table tier prefixes (Manual: none, Lookup: #, Imported: _, Computed: __).
This allows users to access tables by their base name without knowing
the tier prefix:
- schema.get_table("experiment") finds "_experiment" (Imported)
- schema["Subject"] finds "#subject" (Lookup)
- "Experiment" in schema returns True
Added _find_table_name() helper that checks exact match first, then
tries each tier prefix.
Co-Authored-By: Claude Opus 4.5 <[email protected]>
Replace deprecated fetch() calls with to_dicts() in test_virtual_module.py: - test_virtual_schema_tables_are_queryable: use lab.Experiment().to_dicts() - test_getitem_is_queryable: use table.to_dicts() Co-Authored-By: Claude Opus 4.5 <[email protected]>
The create_virtual_module function was removed in 2.0. Update the CLI to use dj.virtual_schema() for loading schemas via the -s flag. Co-Authored-By: Claude Opus 4.5 <[email protected]>
VirtualModule allows specifying both module name and schema name, while virtual_schema() uses schema name for both. The CLI needs custom module names for the -s flag, so use VirtualModule directly. Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Remove -h shorthand for --host (conflicts with argparse --help) - Add module-level docstring with usage examples - Improve function docstring with NumPy style - Add explicit error handling for invalid schema format - Improve banner message with version and usage hint - Use modern type hints (list[str] | None) - Fix locals() issue: explicitly include dj in REPL namespace Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Replace -h shorthand with --host (removed to avoid -h/--help conflict) - Use separate arguments instead of concatenated form - Use prefix variable for schema name consistency - Fix assertion string matching Co-Authored-By: Claude Opus 4.5 <[email protected]>
feat: virtual schema infrastructure and CI improvements
Closed
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
breaking
Not backward compatible changes
documentation
Issues related to documentation
enhancement
Indicates new improvements
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
DataJoint 2.0 is a major release that modernizes the entire codebase while maintaining backward compatibility for core functionality. This release focuses on extensibility, type safety, and developer experience.
Planning: DataJoint 2.0 Plan | Milestone 2.0
Major Features
Codec System (Extensible Types)
Replaces the adapter system with a modern, composable codec architecture:
<blob>,<json>,<attach>,<filepath>,<object>,<hash><blob>wraps<json>for external storage)__init_subclass__validate()method for type checking before insertSemantic Matching
Attribute lineage tracking ensures joins only match semantically compatible attributes:
idornamesemantic_check=Falsefor legacy permissive behaviorPrimary Key Rules
Rigorous primary key propagation through all operators:
dj.U('attr')creates ad-hoc grouping entitiesAutoPopulate 2.0 (Jobs System)
Per-table job management with enhanced tracking:
~~_job_timestampand~~_job_durationcolumns~~table_namejob tabletable.progress()returns (remaining, total)Modern Fetch & Insert API
New fetch methods:
to_dicts()- List of dictionariesto_pandas()- DataFrame with PK as indexto_arrays(*attrs)- NumPy arrays (structured or individual)keys()- Primary keys onlyfetch1()- Single rowInsert improvements:
validate()- Check rows before insertingchunk_size- Batch large insertsinsert_dataframe()- DataFrame with index handlingType Aliases
Core DataJoint types for portability:
int8,int16,int32,int64uint8,uint16,uint32,uint64float32,float64booluuidObject Storage
Content-addressed and object storage types:
<hash>- Content-addressed storage with deduplication<object>- Named object storage (Zarr, folders)<filepath>- Reference to managed files<attach>- File attachments (uploaded on insert)Virtual Schema Infrastructure (#1307)
New schema introspection API for exploring existing databases:
Schema.get_table(name)- Direct table access with auto tier prefix detectionSchema['TableName']- Bracket notation accessfor table in schema- Iterate tables in dependency order'TableName' in schema- Check table existencedj.virtual_schema()- Clean entry point for accessing schemasdj.VirtualModule()- Virtual modules with custom namesCLI Improvements
The
djcommand-line interface for interactive exploration:dj -s schema:alias- Load schemas as virtual modules--host,--user,--password- Connection options-hconflict with--helpSettings Modernization
Pydantic-based configuration with validation:
dj.config.override()context manager.secrets/)DJ_HOST, etc.)License Change
Changed from LGPL to Apache 2.0 license (#1235 (discussion)):
Breaking Changes
Removed Support
fetch()with format parametersafemodeparameter (useprompt)create_virtual_module(usedj.virtual_schema()ordj.VirtualModule())~logtable (IMPR: Deprecate and Remove the~logTable. #1298)API Changes
fetch()→to_dicts(),to_pandas(),to_arrays()fetch(format='frame')→to_pandas()fetch(as_dict=True)→to_dicts()safemode=False→prompt=FalseSemantic Changes
Documentation
Developer Documentation (this repo)
Comprehensive updates in
docs/:User Documentation (datajoint-docs)
Full documentation site following the Diátaxis framework:
Tutorials (learning-oriented, Jupyter notebooks):
How-To Guides (task-oriented):
Reference (specifications):
Project Structure
src/layout for proper packaging (IMPR:srclayout #1267)Test Plan
Closes
Milestone 2.0 Issues
~logTable. #1298 - Deprecate and remove~logtablesuper.deletekwargs toPart.delete#1276 - Part.delete kwargs pass-throughsrclayout #1267 -srclayoutdj.Toporders the preview withorder_by#1242 -dj.Toporders the preview withorder_byBug Fixes
pyarrow(apandasdependency) #1202 - DataJoint import error with missing pyarrowValueErrorin DataJoint-Python 0.14.3 when using numpy 2.2.* #1201 - ValueError with numpy 2.2dj.Diagram()and new release ofpydot==3.0.*#1169 - Error with dj.Diagram() and pydot 3.0Improvements
Related PRs
Migration Guide
See How to Migrate from 1.x for detailed migration instructions.
🤖 Generated with Claude Code