diff --git a/docs/src/archive/citation.md b/docs/src/archive/citation.md deleted file mode 100644 index b5eb2d88b..000000000 --- a/docs/src/archive/citation.md +++ /dev/null @@ -1,7 +0,0 @@ -# Citation - -If your work uses the DataJoint for Python, please cite the following manuscript and Research Resource Identifier (RRID): - -- Yatsenko D, Reimer J, Ecker AS, Walker EY, Sinz F, Berens P, Hoenselaar A, Cotton RJ, Siapas AS, Tolias AS. DataJoint: managing big scientific data using MATLAB or Python. bioRxiv. 2015 Jan 1:031658. doi: https://doi.org/10.1101/031658 - -- DataJoint for Python - [RRID:SCR_014543](https://scicrunch.org/resolver/SCR_014543) - Version `Enter datajoint-python version you are using here` diff --git a/docs/src/archive/client/credentials.md b/docs/src/archive/client/credentials.md deleted file mode 100644 index 28e685f1f..000000000 --- a/docs/src/archive/client/credentials.md +++ /dev/null @@ -1,82 +0,0 @@ -# Credentials - -Database credentials should never be stored in config files. Use environment variables or a secrets directory instead. - -## Environment Variables (Recommended) - -Set the following environment variables: - -```bash -export DJ_HOST=db.example.com -export DJ_USER=alice -export DJ_PASS=secret -``` - -These take priority over all other configuration sources. - -## Secrets Directory - -Create a `.secrets/` directory next to your `datajoint.json`: - -``` -myproject/ -├── datajoint.json -└── .secrets/ - ├── database.user # Contains: alice - └── database.password # Contains: secret -``` - -Each file contains a single secret value (no JSON, just the raw value). - -Add `.secrets/` to your `.gitignore`: - -``` -# .gitignore -.secrets/ -``` - -## Docker / Kubernetes - -Mount secrets at `/run/secrets/datajoint/`: - -```yaml -# docker-compose.yml -services: - app: - volumes: - - ./secrets:/run/secrets/datajoint:ro -``` - -## Interactive Prompt - -If credentials are not provided via environment variables or secrets, DataJoint will prompt for them when connecting: - -```python ->>> import datajoint as dj ->>> dj.conn() -Please enter DataJoint username: alice -Please enter DataJoint password: -``` - -## Programmatic Access - -You can also set credentials in Python (useful for testing): - -```python -import datajoint as dj - -dj.config.database.user = "alice" -dj.config.database.password = "secret" -``` - -Note that `password` uses `SecretStr` internally, so it will be masked in logs and repr output. - -## Changing Database Password - -To change your database password, use your database's native tools: - -```sql -ALTER USER 'alice'@'%' IDENTIFIED BY 'new_password'; -``` - -Then update your environment variables or secrets file accordingly. diff --git a/docs/src/archive/client/install.md b/docs/src/archive/client/install.md deleted file mode 100644 index 18e6b79f4..000000000 --- a/docs/src/archive/client/install.md +++ /dev/null @@ -1,209 +0,0 @@ -# Install and Connect - -DataJoint is implemented for Python 3.10+. -You may install it from [PyPI](https://pypi.python.org/pypi/datajoint): - -```bash -pip3 install datajoint -``` - -or upgrade - -```bash -pip3 install --upgrade datajoint -``` - -## DataJoint Python Windows Install Guide - -This document outlines the steps necessary to install DataJoint on Windows for use in -connecting to a remote server hosting a DataJoint database. -Some limited discussion of installing MySQL is discussed in `MySQL for Windows`, but is -not covered in-depth since this is an uncommon usage scenario and not strictly required -to connect to DataJoint pipelines. - -### Quick steps - -Quick install steps for advanced users are as follows: - -- Install latest Python 3.x and ensure it is in `PATH` (3.10+ required) - ```bash - pip install datajoint - ``` - -For ERD drawing support: - -- Install Graphviz for Windows and ensure it is in `PATH` (64 bit builds currently -tested; URL below.) - ```bash - pip install pydotplus matplotlib - ``` - -Detailed instructions follow. - -### Step 1: install Python - -Python for Windows is available from: - -https://www.python.org/downloads/windows - -The latest 64 bit 3.x version (3.10 or later required) is available from the [Python site](https://www.python.org/downloads/windows/). - -From here run the installer to install Python. - -For a single-user machine, the regular installation process is sufficient - be sure to -select the `Add Python to PATH` option: - -![install-python-simple](../images/install-python-simple.png){: style="align:left"} - -For a shared machine, run the installer as administrator (right-click, run as -administrator) and select the advanced installation. -Be sure to select options as follows: - -![install-python-advanced-1](../images/install-python-advanced-1.png){: style="align:left"} -![install-python-advanced-2](../images/install-python-advanced-2.png){: style="align:left"} - -### Step 2: verify installation - -To verify the Python installation and make sure that your system is ready to install -DataJoint, open a command window by entering `cmd` into the Windows search bar: - -![install-cmd-prompt](../images/install-cmd-prompt.png){: style="align:left"} - -From here `python` and the Python package manager `pip` can be verified by running -`python -V` and `pip -V`, respectively: - -![install-verify-python](../images/install-verify-python.png){: style="align:left"} - -If you receive the error message that either `pip` or `python` is not a recognized -command, please uninstall Python and ensure that the option to add Python to the `PATH` -variable was properly configured. - -### Step 3: install DataJoint - -DataJoint (and other Python modules) can be easily installed using the `pip` Python -package manager which is installed as a part of Python and was verified in the previous -step. - -To install DataJoint simply run `pip install datajoint`: - -![install-datajoint-1](../images/install-datajoint-1.png){: style="align:left"} - -This will proceed to install DataJoint, along with several other required packages from -the PIP repository. -When finished, a summary of the activity should be presented: - -![install-datajoint-2](../images/install-datajoint-2.png){: style="align:left"} - -Note: You can find out more about the packages installed here and many other freely -available open source packages via [pypi](https://pypi.python.org/pypi), the Python -package index site. - -### (Optional) step 4: install packages for ERD support - -To draw diagrams of your DataJoint schema, the following additional steps should be -followed. - -#### Install Graphviz - -DataJoint currently utilizes [Graphviz](http://graphviz.org) to generate the ERD -visualizations. -Although a Windows version of Graphviz is available from the main site, it is an older -and out of date 32-bit version. -The recommended pre-release builds of the 64 bit version are available here: - -https://ci.appveyor.com/project/ellson/graphviz-pl238 - -More specifically, the build artifacts from the `Win64; Configuration: Release` are -recommended, available -[here](https://ci.appveyor.com/api/buildjobs/hlkclpfhf6gnakjq/artifacts/build%2FGraphviz-install.exe). - -This is a regular Windows installer executable, and will present a dialog when starting: - -![install-graphviz-1](../images/install-graphviz-1.png){: style="align:left"} - -It is important that an option to place Graphviz in the `PATH` be selected. - -For a personal installation: - -![install-graphviz-2a](../images/install-graphviz-2a.png){: style="align:left"} - -To install system wide: - -![install-graphviz-2b](../images/install-graphviz-2b.png){: style="align:left"} - -Once installed, Graphviz can be verified from a fresh command window as follows: - -![install-verify-graphviz](../images/install-verify-graphviz.png){: style="align:left"} - -If you receive the error message that the `dot` program is not a recognized command, -please uninstall Graphviz and ensure that the -option to add Python to the PATH variable was properly configured. - -Important: in some cases, running the `dot -c` command in a command prompt is required -to properly initialize the Graphviz installation. - -#### Install PyDotPlus - -The PyDotPlus library links the Graphviz installation to DataJoint and is easily -installed via `pip`: - -![install-pydotplus](../images/install-pydotplus.png){: style="align:left"} - -#### Install Matplotlib - -The Matplotlib library provides useful plotting utilities which are also used by -DataJoint's `Diagram` drawing facility. -The package is easily installed via `pip`: - -![install-matplotlib](../images/install-matplotlib.png){: style="align:left"} - -### (Optional) step 5: install Jupyter Notebook - -As described on the www.jupyter.org website: - -''' -The Jupyter Notebook is an open-source web application that allows -you to create and share documents that contain live code, equations, -visualizations and narrative text. -''' - -Although not a part of DataJoint, Jupyter Notebook can be a very useful tool for -building and interacting with DataJoint pipelines. -It is easily installed from `pip` as well: - -![install-jupyter-1](../images/install-jupyter-1.png){: style="align:left"} -![install-jupyter-2](../images/install-jupyter-2.png){: style="align:left"} - -Once installed, Jupyter Notebook can be started via the `jupyter notebook` command, -which should now be on your path: - -![install-verify-jupyter](../images/install-verify-jupyter.png){: style="align:left"} - -By default Jupyter Notebook will start a local private web server session from the -directory where it was started and start a web browser session connected to the session. - -![install-run-jupyter-1](../images/install-run-jupyter-1.png){: style="align:left"} -![install-run-jupyter-2](../images/install-run-jupyter-2.png){: style="align:left"} - -You now should be able to use the notebook viewer to navigate the filesystem and to -create new project folders and interactive Jupyter/Python/DataJoint notebooks. - -### Git for Windows - -The [Git](https://git-scm.com/) version control system is not a part of DataJoint but -is recommended for interacting with the broader Python/Git/GitHub sharing ecosystem. - -The Git for Windows installer is available from https://git-scm.com/download/win. - -![install-git-1](../images/install-git-1.png){: style="align:left"} - -The default settings should be sufficient and correct in most cases. - -### MySQL for Windows - -For hosting pipelines locally, the MySQL server package is required. - -MySQL for windows can be installed via the installers available from the -[MySQL website](https://dev.mysql.com/downloads/windows/). -Please note that although DataJoint should be fully compatible with a Windows MySQL -server installation, this mode of operation is not tested by the DataJoint team. diff --git a/docs/src/archive/client/settings.md b/docs/src/archive/client/settings.md deleted file mode 100644 index 40f4a6893..000000000 --- a/docs/src/archive/client/settings.md +++ /dev/null @@ -1,220 +0,0 @@ -# Configuration Settings - -DataJoint uses a type-checked configuration system built on [pydantic-settings](https://docs.pydantic.dev/latest/concepts/pydantic_settings/). - -## Configuration Sources - -Settings are loaded from the following sources (in priority order): - -1. **Environment variables** (`DJ_*`) -2. **Secrets directory** (`.secrets/` or `/run/secrets/datajoint/`) -3. **Project config file** (`datajoint.json`, searched recursively) -4. **Default values** - -## Project Structure - -``` -myproject/ -├── .git/ -├── datajoint.json # Project config (commit this) -├── .secrets/ # Local secrets (add to .gitignore) -│ ├── database.password -│ └── aws.secret_access_key -└── src/ - └── analysis.py # Config found via parent search -``` - -## Config File - -Create a `datajoint.json` file in your project root: - -```json -{ - "database": { - "host": "db.example.com", - "port": 3306 - }, - "stores": { - "raw": { - "protocol": "file", - "location": "/data/raw" - } - }, - "display": { - "limit": 20 - }, - "safemode": true -} -``` - -DataJoint searches for this file starting from the current directory and moving up through parent directories, stopping at the first `.git` or `.hg` directory (project boundary) or filesystem root. - -## Credentials - -**Never store credentials in config files.** Use one of these methods: - -### Environment Variables (Recommended) - -```bash -export DJ_USER=alice -export DJ_PASS=secret -export DJ_HOST=db.example.com -``` - -### Secrets Directory - -Create files in `.secrets/` next to your `datajoint.json`: - -``` -.secrets/ -├── database.password # Contains: secret -├── database.user # Contains: alice -├── aws.access_key_id -└── aws.secret_access_key -``` - -Add `.secrets/` to your `.gitignore`. - -For Docker/Kubernetes, secrets can be mounted at `/run/secrets/datajoint/`. - -## Accessing Settings - -```python -import datajoint as dj - -# Attribute access (preferred) -dj.config.database.host -dj.config.safemode - -# Dict-style access -dj.config["database.host"] -dj.config["safemode"] -``` - -## Temporary Overrides - -Use the context manager for temporary changes: - -```python -with dj.config.override(safemode=False): - # safemode is False here - table.delete() -# safemode is restored -``` - -For nested settings, use double underscores: - -```python -with dj.config.override(database__host="test.example.com"): - # database.host is temporarily changed - pass -``` - -## Available Settings - -### Database Connection - -| Setting | Environment Variable | Default | Description | -|---------|---------------------|---------|-------------| -| `database.host` | `DJ_HOST` | `localhost` | Database server hostname | -| `database.port` | `DJ_PORT` | `3306` | Database server port | -| `database.user` | `DJ_USER` | `None` | Database username | -| `database.password` | `DJ_PASS` | `None` | Database password (use env/secrets) | -| `database.reconnect` | — | `True` | Auto-reconnect on connection loss | -| `database.use_tls` | — | `None` | TLS mode: `True`, `False`, or `None` (auto) | - -### Display - -| Setting | Default | Description | -|---------|---------|-------------| -| `display.limit` | `12` | Max rows to display in previews | -| `display.width` | `14` | Column width in previews | -| `display.show_tuple_count` | `True` | Show total count in previews | - -### Other Settings - -| Setting | Default | Description | -|---------|---------|-------------| -| `safemode` | `True` | Prompt before destructive operations | -| `loglevel` | `INFO` | Logging level | -| `fetch_format` | `array` | Default fetch format (`array` or `frame`) | -| `enable_python_native_blobs` | `True` | Use Python-native blob serialization | - -## TLS Configuration - -DataJoint uses TLS by default if available. Control this with: - -```python -dj.config.database.use_tls = True # Require TLS -dj.config.database.use_tls = False # Disable TLS -dj.config.database.use_tls = None # Auto (default) -``` - -## External Storage - -Configure external stores in the `stores` section. See [External Storage](../sysadmin/external-store.md) for details. - -```json -{ - "stores": { - "raw": { - "protocol": "file", - "location": "/data/external" - } - } -} -``` - -## Object Storage - -Configure object storage for the [`object` type](../design/tables/object.md) in the `object_storage` section. This provides managed file and folder storage with fsspec backend support. - -### Local Filesystem - -```json -{ - "object_storage": { - "project_name": "my_project", - "protocol": "file", - "location": "/data/my_project" - } -} -``` - -### Amazon S3 - -```json -{ - "object_storage": { - "project_name": "my_project", - "protocol": "s3", - "bucket": "my-bucket", - "location": "my_project", - "endpoint": "s3.amazonaws.com" - } -} -``` - -### Object Storage Settings - -| Setting | Environment Variable | Required | Description | -|---------|---------------------|----------|-------------| -| `object_storage.project_name` | `DJ_OBJECT_STORAGE_PROJECT_NAME` | Yes | Unique project identifier | -| `object_storage.protocol` | `DJ_OBJECT_STORAGE_PROTOCOL` | Yes | Backend: `file`, `s3`, `gcs`, `azure` | -| `object_storage.location` | `DJ_OBJECT_STORAGE_LOCATION` | Yes | Base path or bucket prefix | -| `object_storage.bucket` | `DJ_OBJECT_STORAGE_BUCKET` | For cloud | Bucket name | -| `object_storage.endpoint` | `DJ_OBJECT_STORAGE_ENDPOINT` | For S3 | S3 endpoint URL | -| `object_storage.partition_pattern` | `DJ_OBJECT_STORAGE_PARTITION_PATTERN` | No | Path pattern with `{attr}` placeholders | -| `object_storage.token_length` | `DJ_OBJECT_STORAGE_TOKEN_LENGTH` | No | Random suffix length (default: 8) | -| `object_storage.access_key` | — | For cloud | Access key (use secrets) | -| `object_storage.secret_key` | — | For cloud | Secret key (use secrets) | - -### Object Storage Secrets - -Store cloud credentials in the secrets directory: - -``` -.secrets/ -├── object_storage.access_key -└── object_storage.secret_key -``` diff --git a/docs/src/archive/compute/autopopulate2.0-spec.md b/docs/src/archive/compute/autopopulate2.0-spec.md deleted file mode 100644 index 03382b06b..000000000 --- a/docs/src/archive/compute/autopopulate2.0-spec.md +++ /dev/null @@ -1,842 +0,0 @@ -# Autopopulate 2.0 Specification - -## Overview - -This specification redesigns the DataJoint job handling system to provide better visibility, control, and scalability for distributed computing workflows. The new system replaces the schema-level `~jobs` table with per-table job tables that offer richer status tracking, proper referential integrity, and dashboard-friendly monitoring. - -## Problem Statement - -### Current Jobs Table Limitations - -The existing `~jobs` table has significant limitations: - -1. **Limited status tracking**: Only supports `reserved`, `error`, and `ignore` statuses -2. **Functions as an error log**: Cannot efficiently track pending or completed jobs -3. **Poor dashboard visibility**: No way to monitor pipeline progress without querying multiple tables -4. **Key hashing obscures data**: Primary keys are stored as hashes, making debugging difficult -5. **No referential integrity**: Jobs table is independent of computed tables; orphaned jobs can accumulate - -### Key Source Limitations - -1. **Frequent manual modifications**: Subset operations require modifying `key_source` property -2. **Local visibility only**: Custom key sources are not accessible database-wide -3. **Performance bottleneck**: Multiple workers querying `key_source` simultaneously creates contention -4. **Codebase dependency**: Requires full pipeline codebase to determine pending work - -## Proposed Solution - -### Terminology - -- **Stale job**: A job (any status) whose key no longer exists in `key_source`. The upstream records have been deleted. Stale jobs are cleaned up by `refresh()` based on the `stale_timeout` parameter. - -- **Orphaned job**: A `reserved` job whose worker is no longer running. The process that reserved the job crashed, was terminated, or lost connection. The job remains `reserved` indefinitely. Orphaned jobs can be cleaned up by `refresh(orphan_timeout=...)` or manually deleted. - -- **Completed job**: A job with status `success`. Only exists when `keep_completed=True`. Represents historical record of successful computation. - -### Core Design Principles - -1. **Per-table jobs**: Each computed table gets its own hidden jobs table -2. **FK-only primary keys**: Auto-populated tables must have primary keys composed entirely of foreign key references. Non-FK primary key attributes are prohibited in new tables (legacy tables are supported with degraded granularity) -3. **No FK constraints on jobs**: Jobs tables omit foreign key constraints for performance; stale jobs are cleaned by `refresh()` -4. **Rich status tracking**: Extended status values for full lifecycle visibility -5. **Automatic refresh**: `populate()` automatically refreshes the jobs queue (adding new jobs, removing stale ones) -6. **Backward compatible**: When `reserve_jobs=False` (default), 1.0 behavior is preserved - -## Architecture - -### Jobs Table Structure - -Each `dj.Imported` or `dj.Computed` table `MyTable` will have an associated hidden jobs table `~~my_table` with the following structure: - -``` -# Job queue for MyTable -subject_id : int -session_id : int -... # Only FK-derived primary key attributes (NO foreign key constraints) ---- -status : enum('pending', 'reserved', 'success', 'error', 'ignore') -priority : uint8 # Lower = more urgent (0 = highest), set by refresh() -created_time=CURRENT_TIMESTAMP : timestamp # When job was added to queue -scheduled_time=CURRENT_TIMESTAMP : timestamp # Process on or after this time -reserved_time=null : timestamp # When job was reserved -completed_time=null : timestamp # When job completed -duration=null : float64 # Execution duration in seconds -error_message="" : varchar(2047) # Truncated error message -error_stack=null : # Full error traceback -user="" : varchar(255) # Database user who reserved/completed job -host="" : varchar(255) # Hostname of worker -pid=0 : uint32 # Process ID of worker -connection_id=0 : uint64 # MySQL connection ID -version="" : varchar(255) # Code version (git hash, package version, etc.) -``` - -**Important**: The jobs table primary key includes only those attributes that come through foreign keys in the target table's primary key. Additional primary key attributes (if any) are excluded. This means: -- If a target table has primary key `(-> Subject, -> Session, method)`, the jobs table has primary key `(subject_id, session_id)` only -- Multiple target rows may map to a single job entry when additional PK attributes exist -- Jobs tables have **no foreign key constraints** for performance (stale jobs handled by `refresh()`) - -### Access Pattern - -Jobs are accessed as a property of the computed table: - -```python -# Current pattern (schema-level) -schema.jobs - -# New pattern (per-table) -MyTable.jobs - -# Examples -FilteredImage.jobs # Access jobs table -FilteredImage.jobs & 'status="error"' # Query errors -FilteredImage.jobs.refresh() # Refresh job queue -``` - -### Status Values - -| Status | Description | -|--------|-------------| -| `pending` | Job is queued and ready to be processed | -| `reserved` | Job is currently being processed by a worker | -| `success` | Job completed successfully (optional, depends on settings) | -| `error` | Job failed with an error | -| `ignore` | Job should be skipped (manually set, not part of automatic transitions) | - -### Status Transitions - -```mermaid -stateDiagram-v2 - state "(none)" as none1 - state "(none)" as none2 - none1 --> pending : refresh() - none1 --> ignore : ignore() - pending --> reserved : reserve() - reserved --> none2 : complete() - reserved --> success : complete()* - reserved --> error : error() - success --> pending : refresh()* - error --> none2 : delete() - success --> none2 : delete() - ignore --> none2 : delete() -``` - -- `complete()` deletes the job entry (default when `jobs.keep_completed=False`) -- `complete()*` keeps the job as `success` (when `jobs.keep_completed=True`) -- `refresh()*` re-pends a `success` job if its key is in `key_source` but not in target - -**Transition methods:** -- `refresh()` — Adds new jobs as `pending`; also re-pends `success` jobs if key is in `key_source` but not in target -- `ignore()` — Marks a key as `ignore` (can be called on keys not yet in jobs table) -- `reserve()` — Marks a pending job as `reserved` before calling `make()` -- `complete()` — Marks reserved job as `success`, or deletes it (based on `jobs.keep_completed` setting) -- `error()` — Marks reserved job as `error` with message and stack trace -- `delete()` — Inherited from `delete_quick()`; use `(jobs & condition).delete()` pattern - -**Manual status control:** -- `ignore` is set manually via `jobs.ignore(key)` and is not part of automatic transitions -- Jobs with `status='ignore'` are skipped by `populate()` and `refresh()` -- To reset an ignored job, delete it and call `refresh()`: `jobs.ignored.delete(); jobs.refresh()` - -## API Design - -### JobsTable Class - -```python -class JobsTable(Table): - """Hidden table managing job queue for a computed table.""" - - @property - def definition(self) -> str: - """Dynamically generated based on parent table's primary key.""" - ... - - def refresh( - self, - *restrictions, - delay: float = 0, - priority: int = None, - stale_timeout: float = None, - orphan_timeout: float = None - ) -> dict: - """ - Refresh the jobs queue: add new jobs and clean up stale/orphaned jobs. - - Operations performed: - 1. Add new jobs: (key_source & restrictions) - target - jobs → insert as 'pending' - 2. Re-pend success jobs: if keep_completed=True and key in key_source but not in target - 3. Remove stale jobs: jobs older than stale_timeout whose keys no longer in key_source - 4. Remove orphaned jobs: reserved jobs older than orphan_timeout (if specified) - - Args: - restrictions: Conditions to filter key_source (for adding new jobs) - delay: Seconds from now until new jobs become available for processing. - Default: 0 (immediately available). Uses database server time. - priority: Priority for new jobs (lower = more urgent). - Default from config: jobs.default_priority (5) - stale_timeout: Seconds after which jobs are checked for staleness. - Jobs older than this are removed if key not in key_source. - Default from config: jobs.stale_timeout (3600s) - Set to 0 to skip stale cleanup. - orphan_timeout: Seconds after which reserved jobs are considered orphaned. - Reserved jobs older than this are deleted and re-added as pending. - Default: None (no orphan cleanup - must be explicit). - Typical value: 3600 (1 hour) or based on expected job duration. - - Returns: - { - 'added': int, # New pending jobs added - 'removed': int, # Stale jobs removed - 'orphaned': int, # Orphaned jobs reset to pending - 're_pended': int # Success jobs re-pended (keep_completed mode) - } - """ - ... - - def reserve(self, key: dict) -> bool: - """ - Attempt to reserve a job for processing. - - Updates status to 'reserved' if currently 'pending' and scheduled_time <= now. - No locking is used; rare conflicts are resolved by the make() transaction. - - Returns: - True if reservation successful, False if job not found or not pending. - """ - ... - - def complete(self, key: dict, duration: float = None) -> None: - """ - Mark a job as successfully completed. - - Updates status to 'success', records duration and completion time. - """ - ... - - def error(self, key: dict, error_message: str, error_stack: str = None) -> None: - """ - Mark a job as failed with error details. - - Updates status to 'error', records error message and stack trace. - """ - ... - - def ignore(self, key: dict) -> None: - """ - Mark a job to be ignored (skipped during populate). - - To reset an ignored job, delete it and call refresh(). - """ - ... - - # delete() is inherited from delete_quick() - no confirmation required - # Usage: (jobs & condition).delete() or jobs.errors.delete() - - @property - def pending(self) -> QueryExpression: - """Return query for pending jobs.""" - return self & 'status="pending"' - - @property - def reserved(self) -> QueryExpression: - """Return query for reserved jobs.""" - return self & 'status="reserved"' - - @property - def errors(self) -> QueryExpression: - """Return query for error jobs.""" - return self & 'status="error"' - - @property - def ignored(self) -> QueryExpression: - """Return query for ignored jobs.""" - return self & 'status="ignore"' - - @property - def completed(self) -> QueryExpression: - """Return query for completed jobs.""" - return self & 'status="success"' - - def progress(self) -> dict: - """ - Return job status breakdown. - - Returns: - { - 'pending': int, # Jobs waiting to be processed - 'reserved': int, # Jobs currently being processed - 'success': int, # Completed jobs (if keep_completed=True) - 'error': int, # Failed jobs - 'ignore': int, # Ignored jobs - 'total': int # Total jobs in table - } - """ - ... -``` - -### AutoPopulate Integration - -The `populate()` method is updated to use the new jobs table: - -```python -def populate( - self, - *restrictions, - suppress_errors: bool = False, - return_exception_objects: bool = False, - reserve_jobs: bool = False, - max_calls: int = None, - display_progress: bool = False, - processes: int = 1, - make_kwargs: dict = None, - # New parameters - priority: int = None, # Only process jobs at this priority or more urgent (lower values) - refresh: bool = None, # Refresh jobs queue before processing (default from config) -) -> dict: - """ - Populate the table by calling make() for each missing entry. - - Behavior depends on reserve_jobs parameter: - - When reserve_jobs=False (default, 1.0 compatibility mode): - - Jobs table is NOT used - - Keys computed directly from: (key_source & restrictions) - target - - No job reservation, no status tracking - - Suitable for single-worker scenarios - - When reserve_jobs=True (distributed mode): - 1. If refresh=True (or config['jobs.auto_refresh'] when refresh=None): - Call self.jobs.refresh(*restrictions) to sync jobs queue - 2. Fetch pending jobs ordered by (priority ASC, scheduled_time ASC) - Apply max_calls limit to fetched keys (total across all processes) - 3. For each pending job where scheduled_time <= now: - a. Mark job as 'reserved' - b. Call make(key) - c. On success: mark job as 'success' or delete (based on keep_completed) - d. On error: mark job as 'error' with message/stack - 4. Continue until all fetched jobs processed - - Args: - restrictions: Conditions to filter key_source - suppress_errors: If True, collect errors instead of raising - return_exception_objects: Return exception objects vs strings - reserve_jobs: Enable job reservation for distributed processing - max_calls: Maximum number of make() calls (total across all processes) - display_progress: Show progress bar - processes: Number of worker processes - make_kwargs: Non-computation kwargs passed to make() - priority: Only process jobs at this priority or more urgent (lower values) - refresh: Refresh jobs queue before processing. Default from config['jobs.auto_refresh'] - - Deprecated parameters (removed in 2.0): - - 'order': Job ordering now controlled by priority. Use refresh(priority=N). - - 'limit': Use max_calls instead. The distinction was confusing (see #1203). - - 'keys': Use restrictions instead. Direct key specification bypassed job tracking. - """ - ... -``` - -### Progress and Monitoring - -```python -# Current progress reporting -remaining, total = MyTable.progress() - -# Enhanced progress with jobs table -MyTable.jobs.progress() # Returns detailed status breakdown - -# Example output: -# { -# 'pending': 150, -# 'reserved': 3, -# 'success': 847, -# 'error': 12, -# 'ignore': 5, -# 'total': 1017 -# } -``` - -### Priority and Scheduling - -Priority and scheduling are handled via `refresh()` parameters. Lower priority values are more urgent (0 = highest priority). Scheduling uses relative time (seconds from now) based on database server time. - -```python -# Add urgent jobs (priority=0 is most urgent) -MyTable.jobs.refresh(priority=0) - -# Add normal jobs (default priority=5) -MyTable.jobs.refresh() - -# Add low-priority background jobs -MyTable.jobs.refresh(priority=10) - -# Schedule jobs for future processing (2 hours from now) -MyTable.jobs.refresh(delay=2*60*60) # 7200 seconds - -# Schedule jobs for tomorrow (24 hours from now) -MyTable.jobs.refresh(delay=24*60*60) - -# Combine: urgent jobs with 1-hour delay -MyTable.jobs.refresh(priority=0, delay=3600) - -# Add urgent jobs for specific subjects -MyTable.jobs.refresh(Subject & 'priority="urgent"', priority=0) -``` - -## Implementation Details - -### Table Naming Convention - -Jobs tables use the `~~` prefix (double tilde): -- Table `FilteredImage` (stored as `__filtered_image`) -- Jobs table: `~~filtered_image` (stored as `~~filtered_image`) - -The `~~` prefix distinguishes jobs tables from other hidden tables (`~jobs`, `~lineage`) while keeping names short. - -### Primary Key Constraint - -**New tables**: Auto-populated tables (`dj.Computed`, `dj.Imported`) must have primary keys composed entirely of foreign key references. Non-FK primary key attributes are prohibited. - -```python -# ALLOWED - all PK attributes come from foreign keys -@schema -class FilteredImage(dj.Computed): - definition = """ - -> Image - --- - filtered_image : - """ - -# ALLOWED - multiple FKs in primary key -@schema -class Analysis(dj.Computed): - definition = """ - -> Recording - -> AnalysisMethod # method comes from FK to lookup table - --- - result : float64 - """ - -# NOT ALLOWED - raises error on table declaration -@schema -class Analysis(dj.Computed): - definition = """ - -> Recording - method : varchar(32) # ERROR: non-FK primary key attribute - --- - result : float64 - """ -``` - -**Rationale**: This constraint ensures 1:1 correspondence between jobs and target rows, simplifying job status tracking and eliminating ambiguity. - -**Legacy table support**: Existing tables with non-FK primary key attributes continue to work. The jobs table uses only the FK-derived attributes, treating additional PK attributes as if they were secondary attributes. This means: -- One job entry may correspond to multiple target rows -- Job marked `success` when ANY matching target row exists -- Job marked `pending` only when NO matching target rows exist - -```python -# Legacy table (created before 2.0) -# Jobs table primary key: (recording_id) only -# One job covers all 'method' values for a given recording -@schema -class LegacyAnalysis(dj.Computed): - definition = """ - -> Recording - method : varchar(32) # Non-FK attribute (legacy, not recommended) - --- - result : float64 - """ -``` - -The jobs table has **no foreign key constraints** for performance reasons. - -### Stale Job Handling - -Stale jobs are jobs (any status except `ignore`) whose keys no longer exist in `key_source`. Since there are no FK constraints on jobs tables, these jobs remain until cleaned up by `refresh()`: - -```python -# refresh() handles stale jobs automatically -result = FilteredImage.jobs.refresh() -# Returns: {'added': 10, 'removed': 3, 'orphaned': 0, 're_pended': 0} - -# Stale detection logic: -# 1. Find jobs where created_time < (now - stale_timeout) -# 2. Check if their keys still exist in key_source -# 3. Remove jobs (pending, reserved, success, error) whose keys no longer exist -# 4. Jobs with status='ignore' are never removed (permanent until manual delete) -``` - -**Why not use foreign key cascading deletes?** -- FK constraints add overhead on every insert/update/delete operation -- Jobs tables are high-traffic (frequent reservations and status updates) -- Stale jobs are harmless until refresh—they simply won't match key_source -- The `refresh()` approach is more efficient for batch cleanup - -### Orphaned Job Handling - -Orphaned jobs are `reserved` jobs whose worker is no longer running. Unlike stale jobs, orphaned jobs reference valid keys—only the worker has disappeared. - -```python -# Automatic orphan cleanup (use with caution) -result = FilteredImage.jobs.refresh(orphan_timeout=3600) # 1 hour -# Jobs reserved more than 1 hour ago are deleted and re-added as pending -# Returns: {'added': 0, 'removed': 0, 'orphaned': 5, 're_pended': 0} - -# Manual orphan cleanup (more control) -(FilteredImage.jobs.reserved & 'reserved_time < NOW() - INTERVAL 2 HOUR').delete() -FilteredImage.jobs.refresh() # Re-adds as pending if key still in key_source -``` - -**When to use orphan_timeout**: -- In automated pipelines where job duration is predictable -- When workers are known to have failed (cluster node died) -- Set timeout > expected max job duration to avoid killing active jobs - -**When NOT to use orphan_timeout**: -- When job durations are highly variable -- When you need to coordinate with external orchestration -- Default is None (disabled) for safety - -### Table Drop and Alter Behavior - -When an auto-populated table is **dropped**, its associated jobs table is automatically dropped: - -```python -# Dropping FilteredImage also drops ~~filtered_image -FilteredImage.drop() -``` - -When an auto-populated table is **altered** (e.g., primary key changes), the jobs table is dropped and can be recreated via `refresh()`: - -```python -# Alter that changes primary key structure -# Jobs table is dropped since its structure no longer matches -FilteredImage.alter() - -# Recreate jobs table with new structure -FilteredImage.jobs.refresh() -``` - -### Lazy Table Creation - -Jobs tables are created automatically on first use: - -```python -# First call to populate with reserve_jobs=True creates the jobs table -FilteredImage.populate(reserve_jobs=True) -# Creates ~~filtered_image if it doesn't exist, then populates - -# Alternatively, explicitly create/refresh the jobs table -FilteredImage.jobs.refresh() -``` - -The jobs table is created with a primary key derived from the target table's foreign key attributes. - -### Conflict Resolution - -Conflict resolution relies on the transaction surrounding each `make()` call: - -- With `reserve_jobs=False`: Workers query `key_source` directly and may attempt the same key -- With `reserve_jobs=True`: Job reservation reduces conflicts but doesn't eliminate them entirely - -When two workers attempt to populate the same key: -1. Both workers attempt to reserve the same job (near-simultaneous) -2. Both reservation attempts succeed (no locking used) -3. Both call `make()` for the same key -4. First worker's `make()` transaction commits successfully -5. Second worker's `make()` transaction fails with duplicate key error -6. Second worker silently moves to next job (no status update) -7. First worker marks job `success` or deletes it - -**Important**: Only errors inside `make()` are logged with `error` status. Duplicate key errors from collisions are coordination artifacts handled silently—the first worker's completion takes precedence. - -**Edge case - first worker crashes after insert**: -- Job stays `reserved` (orphaned) -- Row exists in table (insert succeeded) -- Resolution: `refresh(orphan_timeout=...)` sees key exists in table, removes orphaned job - -**Why this is acceptable**: -- The `make()` transaction guarantees data integrity -- Duplicate key error is a clean, expected signal (not a real error) -- With `reserve_jobs=True`, conflicts are rare -- Wasted computation is minimal compared to locking complexity - -### Job Reservation vs Pre-Partitioning - -The job reservation mechanism (`reserve_jobs=True`) allows workers to dynamically claim jobs from a shared queue. However, some orchestration systems may prefer to **pre-partition** jobs before distributing them to workers: - -```python -# Pre-partitioning example: orchestrator divides work explicitly -all_pending = FilteredImage.jobs.pending.fetch("KEY") - -# Split jobs among workers (e.g., by worker index) -n_workers = 4 -for worker_id in range(n_workers): - worker_keys = all_pending[worker_id::n_workers] # Round-robin assignment - # Send worker_keys to worker via orchestration system (Slurm, K8s, etc.) - -# Worker receives its assigned keys and processes them directly -# Pass keys as restrictions to filter key_source -for key in assigned_keys: - FilteredImage.populate(key) # key acts as restriction, reserve_jobs=False by default -``` - -**When to use each approach**: - -| Approach | Use Case | -|----------|----------| -| **Dynamic reservation** (`reserve_jobs=True`) | Simple setups, variable job durations, workers that start/stop dynamically | -| **Pre-partitioning** | Batch schedulers (Slurm, PBS), predictable job counts, avoiding reservation overhead | - -Both approaches benefit from the same transaction-based conflict resolution as a safety net. - -## Configuration Options - -New configuration settings for job management: - -```python -# In datajoint config -dj.config['jobs.auto_refresh'] = True # Auto-refresh on populate (default: True) -dj.config['jobs.keep_completed'] = False # Keep success records (default: False) -dj.config['jobs.stale_timeout'] = 3600 # Seconds before pending job is considered stale (default: 3600) -dj.config['jobs.default_priority'] = 5 # Default priority for new jobs (lower = more urgent) -dj.config['jobs.version'] = None # Version string for jobs (default: None) - # Special values: 'git' = auto-detect git hash -``` - -### Config vs Parameter Precedence - -When both config and method parameters are available, **explicit parameters override config values**: - -```python -# Config sets defaults -dj.config['jobs.auto_refresh'] = True -dj.config['jobs.default_priority'] = 5 - -# Parameter overrides config -MyTable.populate(reserve_jobs=True, refresh=False) # refresh=False wins -MyTable.jobs.refresh(priority=0) # priority=0 wins -``` - -Parameters set to `None` use the config default. This allows per-call customization while maintaining global defaults. - -## Usage Examples - -### Basic Distributed Computing - -```python -# Worker 1 -FilteredImage.populate(reserve_jobs=True) - -# Worker 2 (can run simultaneously) -FilteredImage.populate(reserve_jobs=True) - -# Monitor progress -print(FilteredImage.jobs.progress()) -``` - -### Priority-Based Processing - -```python -# Add urgent jobs (priority=0 is most urgent) -urgent_subjects = Subject & 'priority="urgent"' -FilteredImage.jobs.refresh(urgent_subjects, priority=0) - -# Workers will process lowest-priority-value jobs first -FilteredImage.populate(reserve_jobs=True) -``` - -### Scheduled Processing - -```python -# Schedule jobs for overnight processing (8 hours from now) -FilteredImage.jobs.refresh('subject_id > 100', delay=8*60*60) - -# Only jobs whose scheduled_time <= now will be processed -FilteredImage.populate(reserve_jobs=True) -``` - -### Error Recovery - -```python -# View errors -errors = FilteredImage.jobs.errors.fetch(as_dict=True) -for err in errors: - print(f"Key: {err['subject_id']}, Error: {err['error_message']}") - -# Delete specific error jobs after fixing the issue -(FilteredImage.jobs & 'subject_id=42').delete() - -# Delete all error jobs -FilteredImage.jobs.errors.delete() - -# Re-add deleted jobs as pending (if keys still in key_source) -FilteredImage.jobs.refresh() -``` - -### Dashboard Queries - -```python -# Get pipeline-wide status using schema.jobs -def pipeline_status(schema): - return { - jt.table_name: jt.progress() - for jt in schema.jobs - } - -# Example output: -# { -# 'FilteredImage': {'pending': 150, 'reserved': 3, 'success': 847, 'error': 12}, -# 'Analysis': {'pending': 500, 'reserved': 0, 'success': 0, 'error': 0}, -# } - -# Refresh all jobs tables in the schema -for jobs_table in schema.jobs: - jobs_table.refresh() - -# Get all errors across the pipeline -all_errors = [] -for jt in schema.jobs: - errors = jt.errors.fetch(as_dict=True) - for err in errors: - err['_table'] = jt.table_name - all_errors.append(err) -``` - -## Backward Compatibility - -### Migration - -This is a major release. The legacy schema-level `~jobs` table is replaced by per-table jobs tables: - -- **Legacy `~jobs` table**: No longer used; can be dropped manually if present -- **New jobs tables**: Created automatically on first `populate(reserve_jobs=True)` call -- **No parallel support**: Teams should migrate cleanly to the new system - -### API Compatibility - -The `schema.jobs` property returns a list of all jobs table objects for auto-populated tables in the schema: - -```python -# Returns list of JobsTable objects -schema.jobs -# [FilteredImage.jobs, Analysis.jobs, ...] - -# Iterate over all jobs tables -for jobs_table in schema.jobs: - print(f"{jobs_table.table_name}: {jobs_table.progress()}") - -# Query all errors across the schema -all_errors = [job for jt in schema.jobs for job in jt.errors.fetch(as_dict=True)] - -# Refresh all jobs tables -for jobs_table in schema.jobs: - jobs_table.refresh() -``` - -This replaces the legacy single `~jobs` table with direct access to per-table jobs. - -## Hazard Analysis - -This section identifies potential hazards and their mitigations. - -### Race Conditions - -| Hazard | Description | Mitigation | -|--------|-------------|------------| -| **Simultaneous reservation** | Two workers reserve the same pending job at nearly the same time | Acceptable: duplicate `make()` calls are resolved by transaction—second worker gets duplicate key error | -| **Reserve during refresh** | Worker reserves a job while another process is running `refresh()` | No conflict: `refresh()` adds new jobs and removes stale ones; reservation updates existing rows | -| **Concurrent refresh calls** | Multiple processes call `refresh()` simultaneously | Acceptable: may result in duplicate insert attempts, but primary key constraint prevents duplicates | -| **Complete vs delete race** | One process completes a job while another deletes it | Acceptable: one operation succeeds, other becomes no-op (row not found) | - -### State Transitions - -| Hazard | Description | Mitigation | -|--------|-------------|------------| -| **Invalid state transition** | Code attempts illegal transition (e.g., pending → success) | Implementation enforces valid transitions; invalid attempts raise error | -| **Stuck in reserved** | Worker crashes while job is reserved (orphaned job) | Manual intervention required: `jobs.reserved.delete()` (see Orphaned Job Handling) | -| **Success re-pended unexpectedly** | `refresh()` re-pends a success job when user expected it to stay | Only occurs if `keep_completed=True` AND key exists in `key_source` but not in target; document clearly | -| **Ignore not respected** | Ignored jobs get processed anyway | Implementation must skip `status='ignore'` in `populate()` job fetching | - -### Data Integrity - -| Hazard | Description | Mitigation | -|--------|-------------|------------| -| **Stale job processed** | Job references deleted upstream data | `make()` will fail or produce invalid results; `refresh()` cleans stale jobs before processing | -| **Jobs table out of sync** | Jobs table doesn't match `key_source` | `refresh()` synchronizes; call periodically or rely on `populate(refresh=True)` | -| **Partial make failure** | `make()` partially succeeds then fails | DataJoint transaction rollback ensures atomicity; job marked as error | -| **Error message truncation** | Error details exceed `varchar(2047)` | Full stack stored in `error_stack` (mediumblob); `error_message` is summary only | - -### Performance - -| Hazard | Description | Mitigation | -|--------|-------------|------------| -| **Large jobs table** | Jobs table grows very large with `keep_completed=True` | Default is `keep_completed=False`; provide guidance on periodic cleanup | -| **Slow refresh on large key_source** | `refresh()` queries entire `key_source` | Can restrict refresh to subsets: `jobs.refresh(Subject & 'lab="smith"')` | -| **Many jobs tables per schema** | Schema with many computed tables has many jobs tables | Jobs tables are lightweight; only created on first use | - -### Operational - -| Hazard | Description | Mitigation | -|--------|-------------|------------| -| **Accidental job deletion** | User runs `jobs.delete()` without restriction | `delete()` inherits from `delete_quick()` (no confirmation); users must apply restrictions carefully | -| **Clearing active jobs** | User clears reserved jobs while workers are still running | May cause duplicated work if job is refreshed and picked up again; coordinate with orchestrator | -| **Priority confusion** | User expects higher number = higher priority | Document clearly: lower values are more urgent (0 = highest priority) | - -### Migration - -| Hazard | Description | Mitigation | -|--------|-------------|------------| -| **Legacy ~jobs table conflict** | Old `~jobs` table exists alongside new per-table jobs | Systems are independent; legacy table can be dropped manually | -| **Mixed version workers** | Some workers use old system, some use new | Major release; do not support mixed operation—require full migration | -| **Lost error history** | Migrating loses error records from legacy table | Document migration procedure; users can export legacy errors before migration | - -## Future Extensions - -- [ ] Web-based dashboard for job monitoring -- [ ] Webhook notifications for job completion/failure -- [ ] Job dependencies (job B waits for job A) -- [ ] Resource tagging (GPU required, high memory, etc.) -- [ ] Retry policies (max retries, exponential backoff) -- [ ] Job grouping/batching for efficiency -- [ ] Integration with external schedulers (Slurm, PBS, etc.) - -## Rationale - -### Why Not External Orchestration? - -The team considered integrating external tools like Airflow or Flyte but rejected this approach because: - -1. **Deployment complexity**: External orchestrators require significant infrastructure -2. **Maintenance burden**: Additional systems to maintain and monitor -3. **Accessibility**: Not all DataJoint users have access to orchestration platforms -4. **Tight integration**: DataJoint's transaction model requires close coordination - -The built-in jobs system provides 80% of the value with minimal additional complexity. - -### Why Per-Table Jobs? - -Per-table jobs tables provide: - -1. **Better isolation**: Jobs for one table don't affect others -2. **Simpler queries**: No need to filter by table_name -3. **Native keys**: Primary keys are readable, not hashed -4. **High performance**: No FK constraints means minimal overhead on job operations -5. **Scalability**: Each table's jobs can be indexed independently - -### Why Remove Key Hashing? - -The current system hashes primary keys to support arbitrary key types. The new system uses native keys because: - -1. **Readability**: Debugging is much easier with readable keys -2. **Query efficiency**: Native keys can use table indexes -3. **Foreign keys**: Hash-based keys cannot participate in foreign key relationships -4. **Simplicity**: No need for hash computation and comparison - -### Why FK-Derived Primary Keys Only? - -The jobs table primary key includes only attributes derived from foreign keys in the target table's primary key. This design: - -1. **Aligns with key_source**: The `key_source` query naturally produces keys matching the FK-derived attributes -2. **Simplifies job identity**: A job's identity is determined by its upstream dependencies -3. **Handles additional PK attributes**: When targets have additional PK attributes (e.g., `method`), one job covers all values for that attribute diff --git a/docs/src/archive/compute/distributed.md b/docs/src/archive/compute/distributed.md deleted file mode 100644 index 68c31f093..000000000 --- a/docs/src/archive/compute/distributed.md +++ /dev/null @@ -1,166 +0,0 @@ -# Distributed Computing - -## Job reservations - -Running `populate` on the same table on multiple computers will causes them to attempt -to compute the same data all at once. -This will not corrupt the data since DataJoint will reject any duplication. -One solution could be to cause the different computing nodes to populate the tables in -random order. -This would reduce some collisions but not completely prevent them. - -To allow efficient distributed computing, DataJoint provides a built-in job reservation -process. -When `dj.Computed` tables are auto-populated using job reservation, a record of each -ongoing computation is kept in a schema-wide `jobs` table, which is used internally by -DataJoint to coordinate the auto-population effort among multiple computing processes. - -Job reservations are activated by setting the keyword argument `reserve_jobs=True` in -`populate` calls. - -With job management enabled, the `make` method of each table class will also consult -the `jobs` table for reserved jobs as part of determining the next record to compute -and will create an entry in the `jobs` table as part of the attempt to compute the -resulting record for that key. -If the operation is a success, the record is removed. -In the event of failure, the job reservation entry is updated to indicate the details -of failure. -Using this simple mechanism, multiple processes can participate in the auto-population -effort without duplicating computational effort, and any errors encountered during the -course of the computation can be individually inspected to determine the cause of the -issue. - -As part of DataJoint, the jobs table can be queried using native DataJoint syntax. For -example, to list the jobs currently being run: - -```python -In [1]: schema.jobs -Out[1]: -*table_name *key_hash status error_message user host pid connection_id timestamp key error_stack -+------------+ +------------+ +----------+ +------------+ +------------+ +------------+ +-------+ +------------+ +------------+ +--------+ +------------+ -__job_results e4da3b7fbbce23 reserved datajoint@localhos localhost 15571 59 2017-09-04 14: -(2 tuples) -``` - -The above output shows that a record for the `JobResults` table is currently reserved -for computation, along with various related details of the reservation, such as the -MySQL connection ID, client user and host, process ID on the remote system, timestamp, -and the key for the record that the job is using for its computation. -Since DataJoint table keys can be of varying types, the key is stored in a binary -format to allow the table to store arbitrary types of record key data. -The subsequent sections will discuss querying the jobs table for key data. - -As mentioned above, jobs encountering errors during computation will leave their record -reservations in place, and update the reservation record with details of the error. - -For example, if a Python process is interrupted via the keyboard, a KeyboardError will -be logged to the database as follows: - -```python -In [2]: schema.jobs -Out[2]: -*table_name *key_hash status error_message user host pid connection_id timestamp key error_stack -+------------+ +------------+ +--------+ +------------+ +------------+ +------------+ +-------+ +------------+ +------------+ +--------+ +------------+ -__job_results 3416a75f4cea91 error KeyboardInterr datajoint@localhos localhost 15571 59 2017-09-04 14: -(1 tuples) -``` - -By leaving the job reservation record in place, the error can be inspected, and if -necessary the corresponding `dj.Computed` update logic can be corrected. -From there the jobs entry can be cleared, and the computation can then be resumed. -In the meantime, the presence of the job reservation will prevent this particular -record from being processed during subsequent auto-population calls. -Inspecting the job record for failure details can proceed much like any other DataJoint -query. - -For example, given the above table, errors can be inspected as follows: - -```python -In [3]: (schema.jobs & 'status="error"' ).fetch(as_dict=True) -Out[3]: -[OrderedDict([('table_name', '__job_results'), - ('key_hash', 'c81e728d9d4c2f636f067f89cc14862c'), - ('status', 'error'), - ('key', rec.array([(2,)], - dtype=[('id', 'O')])), - ('error_message', 'KeyboardInterrupt'), - ('error_stack', None), - ('user', 'datajoint@localhost'), - ('host', 'localhost'), - ('pid', 15571), - ('connection_id', 59), - ('timestamp', datetime.datetime(2017, 9, 4, 15, 3, 53))])] -``` - -This particular error occurred when processing the record with ID `2`, resulted from a -`KeyboardInterrupt`, and has no additional -error trace. - -After any system or code errors have been resolved, the table can simply be cleaned of -errors and the computation rerun. - -For example: - -```python -In [4]: (schema.jobs & 'status="error"' ).delete() -``` - -In some cases, it may be preferable to inspect the jobs table records using populate -keys. -Since job keys are hashed and stored as a blob in the jobs table to support the varying -types of keys, we need to query using the key hash instead of simply using the raw key -data. - -This can be done by using `dj.key_hash` to convert the key as follows: - -```python -In [4]: jk = {'table_name': JobResults.table_name, 'key_hash' : dj.key_hash({'id': 2})} - -In [5]: schema.jobs & jk -Out[5]: -*table_name *key_hash status key error_message error_stac user host pid connection_id timestamp -+------------+ +------------+ +--------+ +--------+ +------------+ +--------+ +------------+ +-------+ +--------+ +------------+ +------------+ -__job_results c81e728d9d4c2f error =BLOB= KeyboardInterr =BLOB= datajoint@localhost localhost 15571 59 2017-09-04 14: -(Total: 1) - -In [6]: (schema.jobs & jk).delete() - -In [7]: schema.jobs & jk -Out[7]: -*table_name *key_hash status key error_message error_stac user host pid connection_id timestamp -+------------+ +----------+ +--------+ +--------+ +------------+ +--------+ +------+ +------+ +-----+ +------------+ +-----------+ - -(Total: 0) -``` - -## Managing connections - -The DataJoint method `dj.kill` allows for viewing and termination of database -connections. -Restrictive conditions can be used to identify specific connections. -Restrictions are specified as strings and can involve any of the attributes of -`information_schema.processlist`: `ID`, `USER`, `HOST`, `DB`, `COMMAND`, `TIME`, -`STATE`, and `INFO`. - -Examples: - - `dj.kill('HOST LIKE "%compute%"')` lists only connections from hosts containing "compute". - `dj.kill('TIME > 600')` lists only connections older than 10 minutes. - -A list of connections meeting the restriction conditions (if present) are presented to -the user, along with the option to kill processes. By default, output is ordered by -ascending connection ID. To change the output order of dj.kill(), an additional -order_by argument can be provided. - -For example, to sort the output by hostname in descending order: - -```python -In [3]: dj.kill(None, None, 'host desc') -Out[3]: - ID USER HOST STATE TIME INFO -+--+ +----------+ +-----------+ +-----------+ +-----+ - 33 chris localhost:54840 1261 None - 17 chris localhost:54587 3246 None - 4 event_scheduler localhost Waiting on empty queue 187180 None -process to kill or "q" to quit > q -``` diff --git a/docs/src/archive/compute/key-source.md b/docs/src/archive/compute/key-source.md deleted file mode 100644 index c9b5d2ce7..000000000 --- a/docs/src/archive/compute/key-source.md +++ /dev/null @@ -1,51 +0,0 @@ -# Key Source - -## Default key source - -**Key source** refers to the set of primary key values over which -[autopopulate](./populate.md) iterates, calling the `make` method at each iteration. -Each `key` from the key source is passed to the table's `make` call. -By default, the key source for a table is the [join](../query/join.md) of its primary -[dependencies](../design/tables/dependencies.md). - -For example, consider a schema with three tables. -The `Stimulus` table contains one attribute `stimulus_type` with one of two values, -"Visual" or "Auditory". -The `Modality` table contains one attribute `modality` with one of three values, "EEG", -"fMRI", and "PET". -The `Protocol` table has primary dependencies on both the `Stimulus` and `Modality` tables. - -The key source for `Protocol` will then be all six combinations of `stimulus_type` and -`modality` as shown in the figure below. - -![Combination of stimulus_type and modality](../images/key_source_combination.png){: style="align:center"} - -## Custom key source - -A custom key source can be configured by setting the `key_source` property within a -table class, after the `definition` string. - -Any [query object](../query/fetch.md) can be used as the key source. -In most cases the new key source will be some alteration of the default key source. -Custom key sources often involve restriction to limit the key source to only relevant -entities. -Other designs may involve using only one of a table's primary dependencies. - -In the example below, the `EEG` table depends on the `Recording` table that lists all -recording sessions. -However, the `populate` method of `EEG` should only ingest recordings where the -`recording_type` is `EEG`. -Setting a custom key source prevents the `populate` call from iterating over recordings -of the wrong type. - -```python -@schema -class EEG(dj.Imported): -definition = """ --> Recording ---- -sample_rate : float -eeg_data : -""" -key_source = Recording & 'recording_type = "EEG"' -``` diff --git a/docs/src/archive/compute/make.md b/docs/src/archive/compute/make.md deleted file mode 100644 index 390be3b7b..000000000 --- a/docs/src/archive/compute/make.md +++ /dev/null @@ -1,215 +0,0 @@ -# Transactions in Make - -Each call of the [make](../compute/make.md) method is enclosed in a transaction. -DataJoint users do not need to explicitly manage transactions but must be aware of -their use. - -Transactions produce two effects: - -First, the state of the database appears stable within the `make` call throughout the -transaction: -two executions of the same query will yield identical results within the same `make` -call. - -Second, any changes to the database (inserts) produced by the `make` method will not -become visible to other processes until the `make` call completes execution. -If the `make` method raises an exception, all changes made so far will be discarded and -will never become visible to other processes. - -Transactions are particularly important in maintaining -[group integrity](../design/integrity.md#group-integrity) with -[master-part relationships](../design/tables/master-part.md). -The `make` call of a master table first inserts the master entity and then inserts all -the matching part entities in the part tables. -None of the entities become visible to other processes until the entire `make` call -completes, at which point they all become visible. - -### Three-Part Make Pattern for Long Computations - -For long-running computations, DataJoint provides an advanced pattern called the -**three-part make** that separates the `make` method into three distinct phases. -This pattern is essential for maintaining database performance and data integrity -during expensive computations. - -#### The Problem: Long Transactions - -Traditional `make` methods perform all operations within a single database transaction: - -```python -def make(self, key): - # All within one transaction - data = (ParentTable & key).fetch1() # Fetch - result = expensive_computation(data) # Compute (could take hours) - self.insert1(dict(key, result=result)) # Insert -``` - -This approach has significant limitations: -- **Database locks**: Long transactions hold locks on tables, blocking other operations -- **Connection timeouts**: Database connections may timeout during long computations -- **Memory pressure**: All fetched data must remain in memory throughout the computation -- **Failure recovery**: If computation fails, the entire transaction is rolled back - -#### The Solution: Three-Part Make Pattern - -The three-part make pattern splits the `make` method into three distinct phases, -allowing the expensive computation to occur outside of database transactions: - -```python -def make_fetch(self, key): - """Phase 1: Fetch all required data from parent tables""" - fetched_data = ((ParentTable1 & key).fetch1(), (ParentTable2 & key).fetch1()) - return fetched_data # must be a sequence, eg tuple or list - -def make_compute(self, key, *fetched_data): - """Phase 2: Perform expensive computation (outside transaction)""" - computed_result = expensive_computation(*fetched_data) - return computed_result # must be a sequence, eg tuple or list - -def make_insert(self, key, *computed_result): - """Phase 3: Insert results into the current table""" - self.insert1(dict(key, result=computed_result)) -``` - -#### Execution Flow - -To achieve data intensity without long transactions, the three-part make pattern follows this sophisticated execution sequence: - -```python -# Step 1: Fetch data outside transaction -fetched_data1 = self.make_fetch(key) -computed_result = self.make_compute(key, *fetched_data1) - -# Step 2: Begin transaction and verify data consistency -begin transaction: - fetched_data2 = self.make_fetch(key) - if fetched_data1 != fetched_data2: # deep comparison - cancel transaction # Data changed during computation - else: - self.make_insert(key, *computed_result) - commit_transaction -``` - -#### Key Benefits - -1. **Reduced Database Lock Time**: Only the fetch and insert operations occur within transactions, minimizing lock duration -2. **Connection Efficiency**: Database connections are only used briefly for data transfer -3. **Memory Management**: Fetched data can be processed and released during computation -4. **Fault Tolerance**: Computation failures don't affect database state -5. **Scalability**: Multiple computations can run concurrently without database contention - -#### Referential Integrity Protection - -The pattern includes a critical safety mechanism: **referential integrity verification**. -Before inserting results, the system: - -1. Re-fetches the source data within the transaction -2. Compares it with the originally fetched data using deep hashing -3. Only proceeds with insertion if the data hasn't changed - -This prevents the "phantom read" problem where source data changes during long computations, -ensuring that results remain consistent with their inputs. - -#### Implementation Details - -The pattern is implemented using Python generators in the `AutoPopulate` class: - -```python -def make(self, key): - # Step 1: Fetch data from parent tables - fetched_data = self.make_fetch(key) - computed_result = yield fetched_data - - # Step 2: Compute if not provided - if computed_result is None: - computed_result = self.make_compute(key, *fetched_data) - yield computed_result - - # Step 3: Insert the computed result - self.make_insert(key, *computed_result) - yield -``` -Therefore, it is possible to override the `make` method to implement the three-part make pattern by using the `yield` statement to return the fetched data and computed result as above. - -#### Use Cases - -This pattern is particularly valuable for: - -- **Machine learning model training**: Hours-long training sessions -- **Image processing pipelines**: Large-scale image analysis -- **Statistical computations**: Complex statistical analyses -- **Data transformations**: ETL processes with heavy computation -- **Simulation runs**: Time-consuming simulations - -#### Example: Long-Running Image Analysis - -Here's an example of how to implement the three-part make pattern for a -long-running image analysis task: - -```python -@schema -class ImageAnalysis(dj.Computed): - definition = """ - # Complex image analysis results - -> Image - --- - analysis_result : - processing_time : float - """ - - def make_fetch(self, key): - """Fetch the image data needed for analysis""" - image_data = (Image & key).fetch1('image') - params = (Params & key).fetch1('params') - return (image_data, params) # pack fetched_data - - def make_compute(self, key, image_data, params): - """Perform expensive image analysis outside transaction""" - import time - start_time = time.time() - - # Expensive computation that could take hours - result = complex_image_analysis(image_data, params) - processing_time = time.time() - start_time - return result, processing_time - - def make_insert(self, key, analysis_result, processing_time): - """Insert the analysis results""" - self.insert1(dict(key, - analysis_result=analysis_result, - processing_time=processing_time)) -``` - -The exact same effect may be achieved by overriding the `make` method as a generator function using the `yield` statement to return the fetched data and computed result as above: - -```python -@schema -class ImageAnalysis(dj.Computed): - definition = """ - # Complex image analysis results - -> Image - --- - analysis_result : - processing_time : float - """ - - def make(self, key): - image_data = (Image & key).fetch1('image') - params = (Params & key).fetch1('params') - computed_result = yield (image, params) # pack fetched_data - - if computed_result is None: - # Expensive computation that could take hours - import time - start_time = time.time() - result = complex_image_analysis(image_data, params) - processing_time = time.time() - start_time - computed_result = result, processing_time #pack - yield computed_result - - result, processing_time = computed_result # unpack - self.insert1(dict(key, - analysis_result=result, - processing_time=processing_time)) - yield # yield control back to the caller -``` -We expect that most users will prefer to use the three-part implementation over the generator function implementation due to its conceptual complexity. \ No newline at end of file diff --git a/docs/src/archive/compute/populate.md b/docs/src/archive/compute/populate.md deleted file mode 100644 index 91db7b176..000000000 --- a/docs/src/archive/compute/populate.md +++ /dev/null @@ -1,317 +0,0 @@ -# Auto-populate - -Auto-populated tables are used to define, execute, and coordinate computations in a -DataJoint pipeline. - -Tables in the initial portions of the pipeline are populated from outside the pipeline. -In subsequent steps, computations are performed automatically by the DataJoint pipeline -in auto-populated tables. - -Computed tables belong to one of the two auto-populated -[data tiers](../design/tables/tiers.md): `dj.Imported` and `dj.Computed`. -DataJoint does not enforce the distinction between imported and computed tables: the -difference is purely semantic, a convention for developers to follow. -If populating a table requires access to external files such as raw storage that is not -part of the database, the table is designated as **imported**. -Otherwise it is **computed**. - -Auto-populated tables are defined and queried exactly as other tables. -(See [Manual Tables](../design/tables/manual.md).) -Their data definition follows the same [definition syntax](../design/tables/declare.md). - -## Make - -For auto-populated tables, data should never be entered using -[insert](../manipulation/insert.md) directly. -Instead these tables must define the callback method `make(self, key)`. -The `insert` method then can only be called on `self` inside this callback method. - -Imagine that there is a table `test.Image` that contains 2D grayscale images in its -`image` attribute. -Let us define the computed table, `test.FilteredImage` that filters the image in some -way and saves the result in its `filtered_image` attribute. - -The class will be defined as follows. - -```python -@schema -class FilteredImage(dj.Computed): - definition = """ - # Filtered image - -> Image - --- - filtered_image : - """ - - def make(self, key): - img = (test.Image & key).fetch1('image') - key['filtered_image'] = myfilter(img) - self.insert1(key) -``` - -The `make` method receives one argument: the dict `key` containing the primary key -value of an element of [key source](key-source.md) to be worked on. - -The key represents the partially filled entity, usually already containing the -[primary key](../design/tables/primary.md) attributes of the key source. - -The `make` callback does three things: - -1. [Fetches](../query/fetch.md) data from tables upstream in the pipeline using the -`key` for [restriction](../query/restrict.md). -2. Computes and adds any missing attributes to the fields already in `key`. -3. Inserts the entire entity into `self`. - -A single `make` call may populate multiple entities when `key` does not specify the -entire primary key of the populated table, when the definition adds new attributes to the primary key. -This design is uncommon and not recommended. -The standard practice for autopopulated tables is to have its primary key composed of -foreign keys pointing to parent tables. - -### Three-Part Make Pattern for Long Computations - -For long-running computations, DataJoint provides an advanced pattern called the -**three-part make** that separates the `make` method into three distinct phases. -This pattern is essential for maintaining database performance and data integrity -during expensive computations. - -#### The Problem: Long Transactions - -Traditional `make` methods perform all operations within a single database transaction: - -```python -def make(self, key): - # All within one transaction - data = (ParentTable & key).fetch1() # Fetch - result = expensive_computation(data) # Compute (could take hours) - self.insert1(dict(key, result=result)) # Insert -``` - -This approach has significant limitations: -- **Database locks**: Long transactions hold locks on tables, blocking other operations -- **Connection timeouts**: Database connections may timeout during long computations -- **Memory pressure**: All fetched data must remain in memory throughout the computation -- **Failure recovery**: If computation fails, the entire transaction is rolled back - -#### The Solution: Three-Part Make Pattern - -The three-part make pattern splits the `make` method into three distinct phases, -allowing the expensive computation to occur outside of database transactions: - -```python -def make_fetch(self, key): - """Phase 1: Fetch all required data from parent tables""" - fetched_data = ((ParentTable & key).fetch1(),) - return fetched_data # must be a sequence, eg tuple or list - -def make_compute(self, key, *fetched_data): - """Phase 2: Perform expensive computation (outside transaction)""" - computed_result = expensive_computation(*fetched_data) - return computed_result # must be a sequence, eg tuple or list - -def make_insert(self, key, *computed_result): - """Phase 3: Insert results into the current table""" - self.insert1(dict(key, result=computed_result)) -``` - -#### Execution Flow - -To achieve data intensity without long transactions, the three-part make pattern follows this sophisticated execution sequence: - -```python -# Step 1: Fetch data outside transaction -fetched_data1 = self.make_fetch(key) -computed_result = self.make_compute(key, *fetched_data1) - -# Step 2: Begin transaction and verify data consistency -begin transaction: - fetched_data2 = self.make_fetch(key) - if fetched_data1 != fetched_data2: # deep comparison - cancel transaction # Data changed during computation - else: - self.make_insert(key, *computed_result) - commit_transaction -``` - -#### Key Benefits - -1. **Reduced Database Lock Time**: Only the fetch and insert operations occur within transactions, minimizing lock duration -2. **Connection Efficiency**: Database connections are only used briefly for data transfer -3. **Memory Management**: Fetched data can be processed and released during computation -4. **Fault Tolerance**: Computation failures don't affect database state -5. **Scalability**: Multiple computations can run concurrently without database contention - -#### Referential Integrity Protection - -The pattern includes a critical safety mechanism: **referential integrity verification**. -Before inserting results, the system: - -1. Re-fetches the source data within the transaction -2. Compares it with the originally fetched data using deep hashing -3. Only proceeds with insertion if the data hasn't changed - -This prevents the "phantom read" problem where source data changes during long computations, -ensuring that results remain consistent with their inputs. - -#### Implementation Details - -The pattern is implemented using Python generators in the `AutoPopulate` class: - -```python -def make(self, key): - # Step 1: Fetch data from parent tables - fetched_data = self.make_fetch(key) - computed_result = yield fetched_data - - # Step 2: Compute if not provided - if computed_result is None: - computed_result = self.make_compute(key, *fetched_data) - yield computed_result - - # Step 3: Insert the computed result - self.make_insert(key, *computed_result) - yield -``` -Therefore, it is possible to override the `make` method to implement the three-part make pattern by using the `yield` statement to return the fetched data and computed result as above. - -#### Use Cases - -This pattern is particularly valuable for: - -- **Machine learning model training**: Hours-long training sessions -- **Image processing pipelines**: Large-scale image analysis -- **Statistical computations**: Complex statistical analyses -- **Data transformations**: ETL processes with heavy computation -- **Simulation runs**: Time-consuming simulations - -#### Example: Long-Running Image Analysis - -Here's an example of how to implement the three-part make pattern for a -long-running image analysis task: - -```python -@schema -class ImageAnalysis(dj.Computed): - definition = """ - # Complex image analysis results - -> Image - --- - analysis_result : - processing_time : float - """ - - def make_fetch(self, key): - """Fetch the image data needed for analysis""" - return (Image & key).fetch1('image'), - - def make_compute(self, key, image_data): - """Perform expensive image analysis outside transaction""" - import time - start_time = time.time() - - # Expensive computation that could take hours - result = complex_image_analysis(image_data) - processing_time = time.time() - start_time - return result, processing_time - - def make_insert(self, key, analysis_result, processing_time): - """Insert the analysis results""" - self.insert1(dict(key, - analysis_result=analysis_result, - processing_time=processing_time)) -``` - -The exact same effect may be achieved by overriding the `make` method as a generator function using the `yield` statement to return the fetched data and computed result as above: - -```python -@schema -class ImageAnalysis(dj.Computed): - definition = """ - # Complex image analysis results - -> Image - --- - analysis_result : - processing_time : float - """ - - def make(self, key): - image_data = (Image & key).fetch1('image') - computed_result = yield (image_data, ) # pack fetched_data - - if computed_result is None: - # Expensive computation that could take hours - import time - start_time = time.time() - result = complex_image_analysis(image_data) - processing_time = time.time() - start_time - computed_result = result, processing_time #pack - yield computed_result - - result, processing_time = computed_result # unpack - self.insert1(dict(key, - analysis_result=result, - processing_time=processing_time)) - yield # yield control back to the caller -``` -We expect that most users will prefer to use the three-part implementation over the generator function implementation due to its conceptual complexity. - -## Populate - -The inherited `populate` method of `dj.Imported` and `dj.Computed` automatically calls -`make` for every key for which the auto-populated table is missing data. - -The `FilteredImage` table can be populated as - -```python -FilteredImage.populate() -``` - -The progress of long-running calls to `populate()` in datajoint-python can be -visualized by adding the `display_progress=True` argument to the populate call. - -Note that it is not necessary to specify which data needs to be computed. -DataJoint will call `make`, one-by-one, for every key in `Image` for which -`FilteredImage` has not yet been computed. - -Chains of auto-populated tables form computational pipelines in DataJoint. - -## Populate options - -The `populate` method accepts a number of optional arguments that provide more features -and allow greater control over the method's behavior. - -- `restrictions` - A list of restrictions, restricting as -`(tab.key_source & AndList(restrictions)) - tab.proj()`. - Here `target` is the table to be populated, usually `tab` itself. -- `suppress_errors` - If `True`, encountering an error will cancel the current `make` -call, log the error, and continue to the next `make` call. - Error messages will be logged in the job reservation table (if `reserve_jobs` is - `True`) and returned as a list. - See also `return_exception_objects` and `reserve_jobs`. - Defaults to `False`. -- `return_exception_objects` - If `True`, error objects are returned instead of error - messages. - This applies only when `suppress_errors` is `True`. - Defaults to `False`. -- `reserve_jobs` - If `True`, reserves job to indicate to other distributed processes. - The job reservation table may be access as `schema.jobs`. - Errors are logged in the jobs table. - Defaults to `False`. -- `order` - The order of execution, either `"original"`, `"reverse"`, or `"random"`. - Defaults to `"original"`. -- `display_progress` - If `True`, displays a progress bar. - Defaults to `False`. -- `limit` - If not `None`, checks at most this number of keys. - Defaults to `None`. -- `max_calls` - If not `None`, populates at most this many keys. - Defaults to `None`, which means no limit. - -## Progress - -The method `table.progress` reports how many `key_source` entries have been populated -and how many remain. -Two optional parameters allow more advanced use of the method. -A parameter of restriction conditions can be provided, specifying which entities to -consider. -A Boolean parameter `display` (default is `True`) allows disabling the output, such -that the numbers of remaining and total entities are returned but not printed. diff --git a/docs/src/archive/concepts/data-model.md b/docs/src/archive/concepts/data-model.md deleted file mode 100644 index 90460361a..000000000 --- a/docs/src/archive/concepts/data-model.md +++ /dev/null @@ -1,172 +0,0 @@ -# Data Model - -## What is a data model? - -A **data model** is a conceptual framework that defines how data is organized, -represented, and transformed. It gives us the components for creating blueprints for the -structure and operations of data management systems, ensuring consistency and efficiency -in data handling. - -Data management systems are built to accommodate these models, allowing us to manage -data according to the principles laid out by the model. If you’re studying data science -or engineering, you’ve likely encountered different data models, each providing a unique -approach to organizing and manipulating data. - -A data model is defined by considering the following key aspects: - -+ What are the fundamental elements used to structure the data? -+ What operations are available for defining, creating, and manipulating the data? -+ What mechanisms exist to enforce the structure and rules governing valid data interactions? - -## Types of data models - -Among the most familiar data models are those based on files and folders: data of any -kind are lumped together into binary strings called **files**, files are collected into -folders, and folders can be nested within other folders to create a folder hierarchy. - -Another family of data models are various **tabular models**. -For example, items in CSV files are listed in rows, and the attributes of each item are -stored in columns. -Various **spreadsheet** models allow forming dependencies between cells and groups of -cells, including complex calculations. - -The **object data model** is common in programming, where data are represented as -objects in memory with properties and methods for transformations of such data. - -## Relational data model - -The **relational model** is a way of thinking about data as sets and operations on sets. -Formalized almost a half-century ago ([Codd, -1969](https://dl.acm.org/citation.cfm?doid=362384.362685)). The relational data model is -one of the most powerful and precise ways to store and manage structured data. At its -core, this model organizes all data into tables--representing mathematical -relations---where each table consists of rows (representing mathematical tuples) and -columns (often called attributes). - -### Core principles of the relational data model - -**Data representation:** - Data are represented and manipulated in the form of relations. - A relation is a set (i.e. an unordered collection) of entities of values for each of - the respective named attributes of the relation. - Base relations represent stored data while derived relations are formed from base - relations through query expressions. - A collection of base relations with their attributes, domain constraints, uniqueness - constraints, and referential constraints is called a schema. - -**Domain constraints:** - Each attribute (column) in a table is associated with a specific attribute domain (or - datatype, a set of possible values), ensuring that the data entered is valid. - Attribute domains may not include relations, which keeps the data model - flat, i.e. free of nested structures. - -**Uniqueness constraints:** - Entities within relations are addressed by values of their attributes. - To identify and relate data elements, uniqueness constraints are imposed on subsets - of attributes. - Such subsets are then referred to as keys. - One key in a relation is designated as the primary key used for referencing its elements. - -**Referential constraints:** - Associations among data are established by means of referential constraints with the - help of foreign keys. - A referential constraint on relation A referencing relation B allows only those - entities in A whose foreign key attributes match the key attributes of an entity in B. - -**Declarative queries:** - Data queries are formulated through declarative, as opposed to imperative, - specifications of sought results. - This means that query expressions convey the logic for the result rather than the - procedure for obtaining it. - Formal languages for query expressions include relational algebra, relational - calculus, and SQL. - -The relational model has many advantages over both hierarchical file systems and -tabular models for maintaining data integrity and providing flexible access to -interesting subsets of the data. - -Popular implementations of the relational data model rely on the Structured Query -Language (SQL). -SQL comprises distinct sublanguages for schema definition, data manipulation, and data -queries. -SQL thoroughly dominates in the space of relational databases and is often conflated -with the relational data model in casual discourse. -Various terminologies are used to describe related concepts from the relational data -model. -Similar to spreadsheets, relations are often visualized as tables with *attributes* -corresponding to *columns* and *entities* corresponding to *rows*. -In particular, SQL uses the terms *table*, *column*, and *row*. - -## The DataJoint Model - -DataJoint is a conceptual refinement of the relational data model offering a more -expressive and rigorous framework for database programming ([Yatsenko et al., -2018](https://arxiv.org/abs/1807.11104)). The DataJoint model facilitates conceptual -clarity, efficiency, workflow management, and precise and flexible data -queries. By enforcing entity normalization, -simplifying dependency declarations, offering a rich query algebra, and visualizing -relationships through schema diagrams, DataJoint makes relational database programming -more intuitive and robust for complex data pipelines. - -The model has emerged over a decade of continuous development of complex data -pipelines for neuroscience experiments ([Yatsenko et al., -2015](https://www.biorxiv.org/content/early/2015/11/14/031658)). DataJoint has allowed -researchers with no prior knowledge of databases to collaborate effectively on common -data pipelines sustaining data integrity and supporting flexible access. DataJoint is -currently implemented as client libraries in MATLAB and Python. These libraries work by -transpiling DataJoint queries into SQL before passing them on to conventional relational -database systems that serve as the backend, in combination with bulk storage systems for -storing large contiguous data objects. - -DataJoint comprises: - -+ a schema [definition](../design/tables/declare.md) language -+ a data [manipulation](../manipulation/index.md) language -+ a data [query](../query/principles.md) language -+ a [diagramming](../design/diagrams.md) notation for visualizing relationships between -modeled entities - -The key refinement of DataJoint over other relational data models and their -implementations is DataJoint's support of -[entity normalization](../design/normalization.md). - -### Core principles of the DataJoint model - -**Entity Normalization** - DataJoint enforces entity normalization, ensuring that every entity set (table) is - well-defined, with each element belonging to the same type, sharing the same - attributes, and distinguished by the same primary key. This principle reduces - redundancy and avoids data anomalies, similar to Boyce-Codd Normal Form, but with a - more intuitive structure than traditional SQL. - -**Simplified Schema Definition and Dependency Management** - DataJoint introduces a schema definition language that is more expressive and less - error-prone than SQL. Dependencies are explicitly declared using arrow notation - (->), making referential constraints easier to understand and visualize. The - dependency structure is enforced as an acyclic directed graph, which simplifies - workflows by preventing circular dependencies. - -**Integrated Query Operators producing a Relational Algebra** - DataJoint introduces five query operators (restrict, join, project, aggregate, and - union) with algebraic closure, allowing them to be combined seamlessly. These - operators are designed to maintain operational entity normalization, ensuring query - outputs remain valid entity sets. - -**Diagramming Notation for Conceptual Clarity** - DataJoint’s schema diagrams simplify the representation of relationships between - entity sets compared to ERM diagrams. Relationships are expressed as dependencies - between entity sets, which are visualized using solid or dashed lines for primary - and secondary dependencies, respectively. - -**Unified Logic for Binary Operators** - DataJoint simplifies binary operations by requiring attributes involved in joins or - comparisons to be homologous (i.e., sharing the same origin). This avoids the - ambiguity and pitfalls of natural joins in SQL, ensuring more predictable query - results. - -**Optimized Data Pipelines for Scientific Workflows** - DataJoint treats the database as a data pipeline where each entity set defines a - step in the workflow. This makes it ideal for scientific experiments and complex - data processing, such as in neuroscience. Its MATLAB and Python libraries transpile - DataJoint queries into SQL, bridging the gap between scientific programming and - relational databases. diff --git a/docs/src/archive/concepts/data-pipelines.md b/docs/src/archive/concepts/data-pipelines.md deleted file mode 100644 index cf20b075b..000000000 --- a/docs/src/archive/concepts/data-pipelines.md +++ /dev/null @@ -1,166 +0,0 @@ -# Data Pipelines - -## What is a data pipeline? - -A scientific **data pipeline** is a collection of processes and systems for organizing -the data, computations, and workflows used by a research group as they jointly perform -complex sequences of data acquisition, processing, and analysis. - -A variety of tools can be used for supporting shared data pipelines: - -Data repositories - Research teams set up a shared **data repository**. - This minimal data management tool allows depositing and retrieving data and managing - user access. - For example, this may include a collection of files with standard naming conventions - organized into folders and sub-folders. - Or a data repository might reside on the cloud, for example in a collection of S3 - buckets. - This image of data management -- where files are warehoused and retrieved from a - hierarchically-organized system of folders -- is an approach that is likely familiar - to most scientists. - -Database systems - **Databases** are a form of data repository providing additional capabilities: - - 1. Defining, communicating, and enforcing structure in the stored data. - 2. Maintaining data integrity: correct identification of data and consistent cross-references, dependencies, and groupings among the data. - 3. Supporting queries that retrieve various cross-sections and transformation of the deposited data. - - Most scientists have some familiarity with these concepts, for example the notion of maintaining consistency between data and the metadata that describes it, or applying a filter to an Excel spreadsheet to retrieve specific subsets of information. - However, usually the more advanced concepts involved in building and using relational databases fall under the specific expertise of data scientists. - -Data pipelines - **Data pipeline** frameworks may include all the features of a database system along - with additional functionality: - - 1. Integrating computations to perform analyses and manage intermediate results in a principled way. - 2. Supporting distributed computations without conflict. - 3. Defining, communicating, and enforcing **workflow**, making clear the sequence of steps that must be performed for data entry, acquisition, and processing. - - Again, the informal notion of an analysis "workflow" will be familiar to most scientists, along with the logistical difficulties associated with managing a workflow that is shared by multiple scientists within or across labs. - - Therefore, a full-featured data pipeline framework may also be described as a [scientific workflow system](https://en.wikipedia.org/wiki/Scientific_workflow_system). - -Major features of data management frameworks: data repositories, databases, and data pipelines. - -![data pipelines vs databases vs data repositories](../images/pipeline-database.png){: style="align:center"} - -## What is DataJoint? - -DataJoint is a free open-source framework for creating scientific data pipelines -directly from MATLAB or Python (or any mixture of the two). -The data are stored in a language-independent way that allows interoperability between -MATLAB and Python, with additional languages in the works. -DataJoint pipelines become the central tool in the operations of data-intensive labs or -consortia as they organize participants with different roles and skills around a common -framework. - -In DataJoint, a data pipeline is a sequence of steps (more generally, a directed -acyclic graph) with integrated data storage at each step. -The pipeline may have some nodes requiring manual data entry or import from external -sources, some that read from raw data files, and some that perform computations on data -stored in other database nodes. -In a typical scenario, experimenters and acquisition instruments feed data into nodes -at the head of the pipeline, while downstream nodes perform automated computations for -data processing and analysis. - -For example, this is the pipeline for a simple mouse experiment involving calcium -imaging in mice. - -![A data pipeline](../images/pipeline.png){: style="width:250px; align:center"} - -In this example, the experimenter first enters information about a mouse, then enters -information about each imaging session in that mouse, and then each scan performed in -each imaging session. -Next the automated portion of the pipeline takes over to import the raw imaging data, -perform image alignment to compensate for motion, image segmentation to identify cells -in the images, and extraction of calcium traces. -Finally, the receptive field (RF) computation is performed by relating the calcium -signals to the visual stimulus information. - -## How DataJoint works - -DataJoint enables data scientists to build and operate scientific data pipelines. - -Conceptual overview of DataJoint operation. - -![DataJoint operation](../images/how-it-works.png){: style="align:center"} - -DataJoint provides a simple and powerful data model, which is detailed more formally in [Yatsenko D, Walker EY, Tolias AS (2018). DataJoint: A Simpler Relational Data Model.](https://arxiv.org/abs/1807.11104). -Put most generally, a "data model" defines how to think about data and the operations -that can be performed on them. -DataJoint's model is a refinement of the relational data model: all nodes in the -pipeline are simple tables storing data, tables are related by their shared attributes, -and query operations can combine the contents of multiple tables. -DataJoint enforces specific constraints on the relationships between tables that help -maintain data integrity and enable flexible access. -DataJoint uses a succinct data definition language, a powerful data query language, and -expressive visualizations of the pipeline. -A well-defined and principled approach to data organization and computation enables -teams of scientists to work together efficiently. -The data become immediately available to all participants with appropriate access privileges. -Some of the "participants" may be computational agents that perform processing and -analysis, and so DataJoint features a built-in distributed job management process to -allow distributing analysis between any number of computers. - -From a practical point of view, the back-end data architecture may vary depending on -project requirements. -Typically, the data architecture includes a relational database server (e.g. MySQL) and -a bulk data storage system (e.g. [AWS S3](https://aws.amazon.com/s3/) or a filesystem). -However, users need not interact with the database directly, but via MATLAB or Python -objects that are each associated with an individual table in the database. -One of the main advantages of this approach is that DataJoint clearly separates the -data model facing the user from the data architecture implementing data management and -computing. DataJoint works well in combination with good code sharing (e.g. with -[git](https://git-scm.com/)) and environment sharing (e.g. with -[Docker](https://www.docker.com/)). - -DataJoint is designed for quick prototyping and continuous exploration as experimental -designs change or evolve. -New analysis methods can be added or removed at any time, and the structure of the -workflow itself can change over time, for example as new data acquisition methods are -developed. - -With DataJoint, data sharing and publishing is no longer a separate step at the end of -the project. -Instead data sharing is an inherent feature of the process: to share data with other -collaborators or to publish the data to the world, one only needs to set the access -privileges. - -## Real-life example - -The [Mesoscale Activity Project](https://www.simonsfoundation.org/funded-project/%20multi-regional-neuronal-dynamics-of-memory-guided-flexible-behavior/) -(MAP) is a collaborative project between four neuroscience labs. -MAP uses DataJoint for data acquisition, processing, analysis, interfaces, and external sharing. - -The DataJoint pipeline for the MAP project. - -![A data pipeline for the MAP project](../images/map-dataflow.png){: style="align:center"} - -The pipeline is hosted in the cloud through [Amazon Web Services](https://aws.amazon.com/) (AWS). -MAP data scientists at the Janelia Research Campus and Baylor College of Medicine -defined the data pipeline. -Experimental scientists enter manual data directly into the pipeline using the -[Helium web interface](https://github.com/mattbdean/Helium). -The raw data are preprocessed using the DataJoint client libraries in MATLAB and Python; -the preprocessed data are ingested into the pipeline while the bulky and raw data are -shared using [Globus](https://globus.org) transfer through the -[PETREL](https://www.alcf.anl.gov/petrel) storage servers provided by the Argonne -National Lab. -Data are made immediately available for exploration and analysis to collaborating labs, -and the analysis results are also immediately shared. -Analysis data may be visualized through web interfaces. -Intermediate results may be exported into the [NWB](https://nwb.org) format for sharing -with external groups. - -## Summary of DataJoint features - -1. A free, open-source framework for scientific data pipelines and workflow management -2. Data hosting in cloud or in-house -3. MySQL, filesystems, S3, and Globus for data management -4. Define, visualize, and query data pipelines from MATLAB or Python -5. Enter and view data through GUIs -6. Concurrent access by multiple users and computational agents -7. Data integrity: identification, dependencies, groupings -8. Automated distributed computation diff --git a/docs/src/archive/concepts/principles.md b/docs/src/archive/concepts/principles.md deleted file mode 100644 index 2bf491590..000000000 --- a/docs/src/archive/concepts/principles.md +++ /dev/null @@ -1,136 +0,0 @@ -# Principles - -## Theoretical Foundations - -*DataJoint Core* implements a systematic framework for the joint management of -structured scientific data and its associated computations. -The framework builds on the theoretical foundations of the -[Relational Model](https://en.wikipedia.org/wiki/Relational_model) and -the [Entity-Relationship Model](https://en.wikipedia.org/wiki/Entity%E2%80%93relationship_model), -introducing a number of critical clarifications for the effective use of databases as -scientific data pipelines. -Notably, DataJoint introduces the concept of *computational dependencies* as a native -first-class citizen of the data model. -This integration of data structure and computation into a single model, defines a new -class of *computational scientific databases*. - -This page defines the key principles of this model without attachment to a specific -implementation while a more complete description of the model can be found in -[Yatsenko et al, 2018](https://doi.org/10.48550/arXiv.1807.11104). - -DataJoint developers are developing these principles into an -[open standard](https://en.wikipedia.org/wiki/Open_standard) to allow multiple -alternative implementations. - -## Data Representation - -### Tables = Entity Sets - -DataJoint uses only one data structure in all its operations—the *entity set*. - -1. All data are represented in the form of *entity sets*, i.e. an ordered collection of -*entities*. -2. All entities of an entity set belong to the same well-defined entity class and have -the same set of named attributes. -3. Attributes in an entity set has a *data type* (or *domain*), representing the set of -its valid values. -4. Each entity in an entity set provides the *attribute values* for all of the -attributes of its entity class. -5. Each entity set has a *primary key*, *i.e.* a subset of attributes that, jointly, -uniquely identify any entity in the set. - -These formal terms have more common (even if less precise) variants: - -| formal | common | -|:-:|:--:| -| entity set | *table* | -| attribute | *column* | -| attribute value | *field* | - -A collection of *stored tables* make up a *database*. -*Derived tables* are formed through *query expressions*. - -### Table Definition - -DataJoint introduces a streamlined syntax for defining a stored table. - -Each line in the definition defines an attribute with its name, data type, an optional -default value, and an optional comment in the format: - -```python -name [=default] : type [# comment] -``` - -Primary attributes come first and are separated from the rest of the attributes with -the divider `---`. - -For example, the following code defines the entity set for entities of class `Employee`: - -```python -employee_id : int ---- -ssn = null : int # optional social security number -date_of_birth : date -gender : enum('male', 'female', 'other') -home_address="" : varchar(1000) -primary_phone="" : varchar(12) -``` - -### Data Tiers - -Stored tables are designated into one of four *tiers* indicating how their data -originates. - -| table tier | data origin | -| --- | --- | -| lookup | contents are part of the table definition, defined *a priori* rather than entered externally. Typical stores general facts, parameters, options, *etc.* | -| manual | contents are populated by external mechanisms such as manual entry through web apps or by data ingest scripts | -| imported | contents are populated automatically by pipeline computations accessing data from upstream in the pipeline **and** from external data sources such as raw data stores.| -| computed | contents are populated automatically by pipeline computations accessing data from upstream in the pipeline. | - -### Object Serialization - -### Data Normalization - -A collection of data is considered normalized when organized into a collection of -entity sets, where each entity set represents a well-defined entity class with all its -attributes applicable to each entity in the set and the same primary key identifying - -The normalization procedure often includes splitting data from one table into several -tables, one for each proper entity set. - -### Databases and Schemas - -Stored tables are named and grouped into namespaces called *schemas*. -A collection of schemas make up a *database*. -A *database* has a globally unique address or name. -A *schema* has a unique name within its database. -Within a *connection* to a particular database, a stored table is identified as -`schema.Table`. -A schema typically groups tables that are logically related. - -## Dependencies - -Entity sets can form referential dependencies that express and - -### Diagramming - -## Data integrity - -### Entity integrity - -*Entity integrity* is the guarantee made by the data management process of the 1:1 -mapping between real-world entities and their digital representations. -In practice, entity integrity is ensured when it is made clear - -### Referential integrity - -### Group integrity - -## Data manipulations - -## Data queries - -### Query Operators - -## Pipeline computations diff --git a/docs/src/archive/concepts/teamwork.md b/docs/src/archive/concepts/teamwork.md deleted file mode 100644 index a0a782dde..000000000 --- a/docs/src/archive/concepts/teamwork.md +++ /dev/null @@ -1,97 +0,0 @@ -# Teamwork - -## Data management in a science project - -Science labs organize their projects as a sequence of activities of experiment design, -data acquisition, and processing and analysis. - -![data science in a science lab](../images/data-science-before.png){: style="width:510px; display:block; margin: 0 auto;"} - -
Workflow and dataflow in a common findings-centered approach to data science in a science lab.
- -Many labs lack a uniform data management strategy that would span longitudinally across -the entire project lifecycle as well as laterally across different projects. - -Prior to publishing their findings, the research team may need to publish the data to -support their findings. -Without a data management system, this requires custom repackaging of the data to -conform to the [FAIR principles](https://www.nature.com/articles/sdata201618) for -scientific data management. - -## Data-centric project organization - -DataJoint is designed to support a data-centric approach to large science projects in -which data are viewed as a principal output of the research project and are managed -systematically throughout in a single framework through the entire process. - -This approach requires formulating a general data science plan and upfront investment -for setting up resources and processes and training the teams. -The team uses DataJoint to build data pipelines to support multiple projects. - -![data science in a science lab](../images/data-science-after.png){: style="width:510px; display:block; margin: 0 auto;"} - -
Workflow and dataflow in a data pipeline-centered approach.
- -Data pipelines support project data across their entire lifecycle, including the -following functions - -- experiment design -- animal colony management -- electronic lab book: manual data entry during experiments through graphical user interfaces. -- acquisition from instrumentation in the course of experiments -- ingest from raw acquired data -- computations for data analysis -- visualization of analysis results -- export for sharing and publishing - -Through all these activities, all these data are made accessible to all authorized -participants and distributed computations can be done in parallel without compromising -data integrity. - -## Team roles - -The adoption of a uniform data management framework allows separation of roles and -division of labor among team members, leading to greater efficiency and better scaling. - -![data science in a science lab](../images/data-engineering.png){: style="width:510px; display:block; margin: 0 auto;"} - -
Distinct responsibilities of data science and data engineering.
- -### Scientists - -Design and conduct experiments, collecting data. -They interact with the data pipeline through graphical user interfaces designed by -others. -They understand what analysis is used to test their hypotheses. - -### Data scientists - -Have the domain expertise and select and implement the processing and analysis -methods for experimental data. -Data scientists are in charge of defining and managing the data pipeline using -DataJoint's data model, but they may not know the details of the underlying -architecture. -They interact with the pipeline using client programming interfaces directly from -languages such as MATLAB and Python. - -The bulk of this manual is written for working data scientists, except for System -Administration. - -### Data engineers - -Work with the data scientists to support the data pipeline. -They rely on their understanding of the DataJoint data model to configure and -administer the required IT resources such as database servers, data storage -servers, networks, cloud instances, [Globus](https://globus.org) endpoints, etc. -Data engineers can provide general solutions such as web hosting, data publishing, -interfaces, exports and imports. - -The System Administration section of this tutorial contains materials helpful in -accomplishing these tasks. - -DataJoint is designed to delineate a clean boundary between **data science** and **data -engineering**. -This allows data scientists to use the same uniform data model for data pipelines -backed by a variety of information technologies. -This delineation also enables economies of scale as a single data engineering team can -support a wide spectrum of science projects. diff --git a/docs/src/archive/concepts/terminology.md b/docs/src/archive/concepts/terminology.md deleted file mode 100644 index 0fdc41e96..000000000 --- a/docs/src/archive/concepts/terminology.md +++ /dev/null @@ -1,127 +0,0 @@ - - -# Terminology - -DataJoint introduces a principled data model, which is described in detail in -[Yatsenko et al., 2018](https://arxiv.org/abs/1807.11104). -This data model is a conceptual refinement of the Relational Data Model and also draws -on the Entity-Relationship Model (ERM). - -The Relational Data Model was inspired by the concepts of relations in Set Theory. -When the formal relational data model was formulated, it introduced additional -terminology (e.g. *relation*, *attribute*, *tuple*, *domain*). -Practical programming languages such as SQL do not precisely follow the relational data -model and introduce other terms to approximate relational concepts (e.g. *table*, -*column*, *row*, *datatype*). -Subsequent data models (e.g. ERM) refined the relational data model and introduced -their own terminology to describe analogous concepts (e.g. *entity set*, -*relationship set*, *attribute set*). -As a result, similar concepts may be described using different sets of terminologies, -depending on the context and the speaker's background. - -For example, what is known as a **relation** in the formal relational model is called a -**table** in SQL; the analogous concept in ERM and DataJoint is called an **entity -set**. - -The DataJoint documentation follows the terminology defined in -[Yatsenko et al, 2018](https://arxiv.org/abs/1807.11104), except *entity set* is -replaced with the more colloquial *table* or *query result* in most cases. - -The table below summarizes the terms used for similar concepts across the related data -models. - -Data model terminology -| Relational | ERM | SQL | DataJoint (formal) | This manual | -| -- | -- | -- | -- | -- | -| relation | entity set | table | entity set | table | -| tuple | entity | row | entity | entity | -| domain | value set | datatype | datatype | datatype | -| attribute | attribute | column | attribute | attribute | -| attribute value | attribute value | field value | attribute value | attribute value | -| primary key | primary key | primary key | primary key | primary key | -| foreign key | foreign key | foreign key | foreign key | foreign key | -| schema | schema | schema or database | schema | schema | -| relational expression | data query | `SELECT` statement | query expression | query expression | - -## DataJoint: databases, schemas, packages, and modules - -A **database** is collection of tables on the database server. -DataJoint users do not interact with it directly. - -A **DataJoint schema** is - - - a database on the database server containing tables with data *and* - - a collection of classes (in MATLAB or Python) associated with the database, one - class for each table. - -In MATLAB, the collection of classes is organized as a **package**, i.e. a file folder -starting with a `+`. - -In Python, the collection of classes is any set of classes decorated with the -appropriate `schema` object. -Very commonly classes for tables in one database are organized as a distinct Python -module. -Thus, typical DataJoint projects have one module per database. -However, this organization is up to the user's discretion. - -## Base tables - -**Base tables** are tables stored in the database, and are often referred to simply as -*tables* in DataJoint. -Base tables are distinguished from **derived tables**, which result from relational -[operators](../query/operators.md). - -## Relvars and relation values - -Early versions of the DataJoint documentation referred to the relation objects as -[relvars](https://en.wikipedia.org/wiki/Relvar). -This term emphasizes the fact that relational variables and expressions do not contain -actual data but are rather symbolic representations of data to be retrieved from the -database. -The specific value of a relvar would then be referred to as the **relation value**. -The value of a relvar can change with changes in the state of the database. - -The more recent iteration of the documentation has grown less pedantic and more often -uses the term *table* instead. - -## Metadata - -The vocabulary of DataJoint does not include this term. - -In data science, the term **metadata** commonly means "data about the data" rather than -the data themselves. -For example, metadata could include data sizes, timestamps, data types, indexes, -keywords. - -In contrast, neuroscientists often use the term to refer to conditions and annotations -about experiments. -This distinction arose when such information was stored separately from experimental -recordings, such as in physical notebooks. -Such "metadata" are used to search and to classify the data and are in fact an integral -part of the *actual* data. - -In DataJoint, all data other than blobs can be used in searches and categorization. -These fields may originate from manual annotations, preprocessing, or analyses just as -easily as from recordings or behavioral performance. -Since "metadata" in the neuroscience sense are not distinguished from any other data in -a pipeline, DataJoint avoids the term entirely. -Instead, DataJoint differentiates data into [data tiers](../design/tables/tiers.md). - -## Glossary - -We've taken careful consideration to use consistent terminology. - - - -| Term | Definition | -| --- | --- | -| DAG | directed acyclic graph (DAG) is a set of nodes and connected with a set of directed edges that form no cycles. This means that there is never a path back to a node after passing through it by following the directed edges. Formal workflow management systems represent workflows in the form of DAGs. | -| data pipeline | A sequence of data transformation steps from data sources through multiple intermediate structures. More generally, a data pipeline is a directed acyclic graph. In DataJoint, each step is represented by a table in a relational database. | -| DataJoint | a software framework for database programming directly from matlab and python. Thanks to its support of automated computational dependencies, DataJoint serves as a workflow management system. | -| DataJoint Elements | software modules implementing portions of experiment workflows designed for ease of integration into diverse custom workflows. | -| DataJoint pipeline | the data schemas and transformations underlying a DataJoint workflow. DataJoint allows defining code that specifies both the workflow and the data pipeline, and we have used the words "pipeline" and "workflow" almost interchangeably. | -| DataJoint schema | a software module implementing a portion of an experiment workflow. Includes database table definitions, dependencies, and associated computations. | -| foreign key | a field that is linked to another table's primary key. | -| primary key | the subset of table attributes that uniquely identify each entity in the table. | -| secondray attribute | any field in a table not in the primary key. | -| workflow | a formal representation of the steps for executing an experiment from data collection to analysis. Also the software configured for performing these steps. A typical workflow is composed of tables with inter-dependencies and processes to compute and insert data into the tables. | diff --git a/docs/src/archive/design/alter.md b/docs/src/archive/design/alter.md deleted file mode 100644 index 70ed39341..000000000 --- a/docs/src/archive/design/alter.md +++ /dev/null @@ -1,53 +0,0 @@ -# Altering Populated Pipelines - -Tables can be altered after they have been declared and populated. This is useful when -you want to add new secondary attributes or change the data type of existing attributes. -Users can use the `definition` property to update a table's attributes and then use -`alter` to apply the changes in the database. Currently, `alter` does not support -changes to primary key attributes. - -Let's say we have a table `Student` with the following attributes: - -```python -@schema -class Student(dj.Manual): - definition = """ - student_id: int - --- - first_name: varchar(40) - last_name: varchar(40) - home_address: varchar(100) - """ -``` - -We can modify the table to include a new attribute `email`: - -```python -Student.definition = """ -student_id: int ---- -first_name: varchar(40) -last_name: varchar(40) -home_address: varchar(100) -email: varchar(100) -""" -Student.alter() -``` - -The `alter` method will update the table in the database to include the new attribute -`email` added by the user in the table's `definition` property. - -Similarly, you can modify the data type or length of an existing attribute. For example, -to alter the `home_address` attribute to have a length of 200 characters: - -```python -Student.definition = """ -student_id: int ---- -first_name: varchar(40) -last_name: varchar(40) -home_address: varchar(200) -email: varchar(100) -""" -Student.alter() -``` diff --git a/docs/src/archive/design/diagrams.md b/docs/src/archive/design/diagrams.md deleted file mode 100644 index 826f78926..000000000 --- a/docs/src/archive/design/diagrams.md +++ /dev/null @@ -1,110 +0,0 @@ -# Diagrams - -Diagrams are a great way to visualize the pipeline and understand the flow -of data. DataJoint diagrams are based on **entity relationship diagram** (ERD). -Objects of type `dj.Diagram` allow visualizing portions of the data pipeline in -graphical form. -Tables are depicted as nodes and [dependencies](./tables/dependencies.md) as directed -edges between them. -The `draw` method plots the graph. - -## Diagram notation - -Consider the following diagram - -![mp-diagram](../images/mp-diagram.png){: style="align:center"} - -DataJoint uses the following conventions: - -- Tables are indicated as nodes in the graph. - The corresponding class name is indicated by each node. -- [Data tiers](./tables/tiers.md) are indicated as colors and symbols: - - Lookup=gray rectangle - - Manual=green rectangle - - Imported=blue oval - - Computed=red circle - - Part=black text - The names of [part tables](./tables/master-part.md) are indicated in a smaller font. -- [Dependencies](./tables/dependencies.md) are indicated as edges in the graph and -always directed downward, forming a **directed acyclic graph**. -- Foreign keys contained within the primary key are indicated as solid lines. - This means that the referenced table becomes part of the primary key of the dependent table. -- Foreign keys that are outside the primary key are indicated by dashed lines. -- If the primary key of the dependent table has no other attributes besides the foreign -key, the foreign key is a thick solid line, indicating a 1:{0,1} relationship. -- Foreign keys made without renaming the foreign key attributes are in black whereas -foreign keys that rename the attributes are indicated in red. - -## Diagramming an entire schema - -To plot the Diagram for an entire schema, an Diagram object can be initialized with the -schema object (which is normally used to decorate table objects) - -```python -import datajoint as dj -schema = dj.Schema('my_database') -dj.Diagram(schema).draw() -``` - -or alternatively an object that has the schema object as an attribute, such as the -module defining a schema: - -```python -import datajoint as dj -import seq # import the sequence module defining the seq database -dj.Diagram(seq).draw() # draw the Diagram -``` - -Note that calling the `.draw()` method is not necessary when working in a Jupyter -notebook. -You can simply let the object display itself, for example by entering `dj.Diagram(seq)` -in a notebook cell. -The Diagram will automatically render in the notebook by calling its `_repr_html_` -method. -A Diagram displayed without `.draw()` will be rendered as an SVG, and hovering the -mouse over a table will reveal a compact version of the output of the `.describe()` -method. - -### Initializing with a single table - -A `dj.Diagram` object can be initialized with a single table. - -```python -dj.Diagram(seq.Genome).draw() -``` - -A single node makes a rather boring graph but ERDs can be added together or subtracted -from each other using graph algebra. - -### Adding diagrams together - -However two graphs can be added, resulting in new graph containing the union of the -sets of nodes from the two original graphs. -The corresponding foreign keys will be automatically - -```python -# plot the Diagram with tables Genome and Species from module seq. -(dj.Diagram(seq.Genome) + dj.Diagram(seq.Species)).draw() -``` - -### Expanding diagrams upstream and downstream - -Adding a number to an Diagram object adds nodes downstream in the pipeline while -subtracting a number from Diagram object adds nodes upstream in the pipeline. - -Examples: - -```python -# Plot all the tables directly downstream from `seq.Genome` -(dj.Diagram(seq.Genome)+1).draw() -``` - -```python -# Plot all the tables directly upstream from `seq.Genome` -(dj.Diagram(seq.Genome)-1).draw() -``` - -```python -# Plot the local neighborhood of `seq.Genome` -(dj.Diagram(seq.Genome)+1-1+1-1).draw() -``` diff --git a/docs/src/archive/design/drop.md b/docs/src/archive/design/drop.md deleted file mode 100644 index 35a9ac513..000000000 --- a/docs/src/archive/design/drop.md +++ /dev/null @@ -1,23 +0,0 @@ -# Drop - -The `drop` method completely removes a table from the database, including its -definition. -It also removes all dependent tables, recursively. -DataJoint will first display the tables being dropped and the number of entities in -each before prompting the user for confirmation to proceed. - -The `drop` method is often used during initial design to allow altered table -definitions to take effect. - -```python -# drop the Person table from its schema -Person.drop() -``` - -## Dropping part tables - -A [part table](../design/tables/master-part.md) is usually removed as a consequence of -calling `drop` on its master table. -To enforce this workflow, calling `drop` directly on a part table produces an error. -In some cases, it may be necessary to override this behavior. -To remove a part table without removing its master, use the argument `force=True`. diff --git a/docs/src/archive/design/fetch-api-2.0-spec.md b/docs/src/archive/design/fetch-api-2.0-spec.md deleted file mode 100644 index a996a5f08..000000000 --- a/docs/src/archive/design/fetch-api-2.0-spec.md +++ /dev/null @@ -1,302 +0,0 @@ -# DataJoint 2.0 Fetch API Specification - -## Overview - -DataJoint 2.0 replaces the complex `fetch()` method with a set of explicit, composable output methods. This provides better discoverability, clearer intent, and more efficient iteration. - -## Design Principles - -1. **Explicit over implicit**: Each output format has its own method -2. **Composable**: Use existing `.proj()` for column selection -3. **Lazy iteration**: Single cursor streaming instead of fetch-all-keys -4. **Modern formats**: First-class support for polars and Arrow - ---- - -## New API Reference - -### Output Methods - -| Method | Returns | Description | -|--------|---------|-------------| -| `to_dicts()` | `list[dict]` | All rows as list of dictionaries | -| `to_pandas()` | `DataFrame` | pandas DataFrame with primary key as index | -| `to_polars()` | `polars.DataFrame` | polars DataFrame (requires `datajoint[polars]`) | -| `to_arrow()` | `pyarrow.Table` | PyArrow Table (requires `datajoint[arrow]`) | -| `to_arrays()` | `np.ndarray` | numpy structured array (recarray) | -| `to_arrays('a', 'b')` | `tuple[array, array]` | Tuple of arrays for specific columns | -| `keys()` | `list[dict]` | Primary key values only | -| `fetch1()` | `dict` | Single row as dict (raises if not exactly 1) | -| `fetch1('a', 'b')` | `tuple` | Single row attribute values | - -### Common Parameters - -All output methods accept these optional parameters: - -```python -table.to_dicts( - order_by=None, # str or list: column(s) to sort by, e.g. "KEY", "name DESC" - limit=None, # int: maximum rows to return - offset=None, # int: rows to skip - squeeze=False, # bool: remove singleton dimensions from arrays - download_path="." # str: path for downloading external data -) -``` - -### Iteration - -```python -# Lazy streaming - yields one dict per row from database cursor -for row in table: - process(row) # row is a dict -``` - ---- - -## Migration Guide - -### Basic Fetch Operations - -| Old Pattern (1.x) | New Pattern (2.0) | -|-------------------|-------------------| -| `table.fetch()` | `table.to_arrays()` or `table.to_dicts()` | -| `table.fetch(format="array")` | `table.to_arrays()` | -| `table.fetch(format="frame")` | `table.to_pandas()` | -| `table.fetch(as_dict=True)` | `table.to_dicts()` | - -### Attribute Fetching - -| Old Pattern (1.x) | New Pattern (2.0) | -|-------------------|-------------------| -| `table.fetch('a')` | `table.to_arrays('a')` | -| `a, b = table.fetch('a', 'b')` | `a, b = table.to_arrays('a', 'b')` | -| `table.fetch('a', 'b', as_dict=True)` | `table.proj('a', 'b').to_dicts()` | - -### Primary Key Fetching - -| Old Pattern (1.x) | New Pattern (2.0) | -|-------------------|-------------------| -| `table.fetch('KEY')` | `table.keys()` | -| `table.fetch(dj.key)` | `table.keys()` | -| `keys, a = table.fetch('KEY', 'a')` | See note below | - -For mixed KEY + attribute fetch: -```python -# Old: keys, a = table.fetch('KEY', 'a') -# New: Combine keys() with to_arrays() -keys = table.keys() -a = table.to_arrays('a') -# Or use to_dicts() which includes all columns -``` - -### Ordering, Limiting, Offset - -| Old Pattern (1.x) | New Pattern (2.0) | -|-------------------|-------------------| -| `table.fetch(order_by='name')` | `table.to_arrays(order_by='name')` | -| `table.fetch(limit=10)` | `table.to_arrays(limit=10)` | -| `table.fetch(order_by='KEY', limit=10, offset=5)` | `table.to_arrays(order_by='KEY', limit=10, offset=5)` | - -### Single Row Fetch (fetch1) - -| Old Pattern (1.x) | New Pattern (2.0) | -|-------------------|-------------------| -| `table.fetch1()` | `table.fetch1()` (unchanged) | -| `a, b = table.fetch1('a', 'b')` | `a, b = table.fetch1('a', 'b')` (unchanged) | -| `table.fetch1('KEY')` | `table.fetch1()` then extract pk columns | - -### Configuration - -| Old Pattern (1.x) | New Pattern (2.0) | -|-------------------|-------------------| -| `dj.config['fetch_format'] = 'frame'` | Use `.to_pandas()` explicitly | -| `with dj.config.override(fetch_format='frame'):` | Use `.to_pandas()` in the block | - -### Iteration - -| Old Pattern (1.x) | New Pattern (2.0) | -|-------------------|-------------------| -| `for row in table:` | `for row in table:` (same syntax, now lazy!) | -| `list(table)` | `table.to_dicts()` | - -### Column Selection with proj() - -Use `.proj()` for column selection, then apply output method: - -```python -# Select specific columns -table.proj('col1', 'col2').to_pandas() -table.proj('col1', 'col2').to_dicts() - -# Computed columns -table.proj(total='price * quantity').to_pandas() -``` - ---- - -## Removed Features - -### Removed Methods and Parameters - -- `fetch()` method - use explicit output methods -- `fetch('KEY')` - use `keys()` -- `dj.key` class - use `keys()` method -- `format=` parameter - use explicit methods -- `as_dict=` parameter - use `to_dicts()` -- `config['fetch_format']` setting - use explicit methods - -### Removed Imports - -```python -# Old (removed) -from datajoint import key -result = table.fetch(dj.key) - -# New -result = table.keys() -``` - ---- - -## Examples - -### Example 1: Basic Data Retrieval - -```python -# Get all data as DataFrame -df = Experiment().to_pandas() - -# Get all data as list of dicts -rows = Experiment().to_dicts() - -# Get all data as numpy array -arr = Experiment().to_arrays() -``` - -### Example 2: Filtered and Sorted Query - -```python -# Get recent experiments, sorted by date -recent = (Experiment() & 'date > "2024-01-01"').to_pandas( - order_by='date DESC', - limit=100 -) -``` - -### Example 3: Specific Columns - -```python -# Fetch specific columns as arrays -names, dates = Experiment().to_arrays('name', 'date') - -# Or with primary key included -names, dates = Experiment().to_arrays('name', 'date', include_key=True) -``` - -### Example 4: Primary Keys for Iteration - -```python -# Get keys for restriction -keys = Experiment().keys() -for key in keys: - process(Session() & key) -``` - -### Example 5: Single Row - -```python -# Get one row as dict -row = (Experiment() & key).fetch1() - -# Get specific attributes -name, date = (Experiment() & key).fetch1('name', 'date') -``` - -### Example 6: Lazy Iteration - -```python -# Stream rows efficiently (single database cursor) -for row in Experiment(): - if should_process(row): - process(row) - if done: - break # Early termination - no wasted fetches -``` - -### Example 7: Modern DataFrame Libraries - -```python -# Polars (fast, modern) -import polars as pl -df = Experiment().to_polars() -result = df.filter(pl.col('value') > 100).group_by('category').agg(pl.mean('value')) - -# PyArrow (zero-copy interop) -table = Experiment().to_arrow() -# Can convert to pandas or polars with zero copy -``` - ---- - -## Performance Considerations - -### Lazy Iteration - -The new iteration is significantly more efficient: - -```python -# Old (1.x): N+1 queries -# 1. fetch("KEY") gets ALL keys -# 2. fetch1() for EACH key - -# New (2.0): Single query -# Streams rows from one cursor -for row in table: - ... -``` - -### Memory Efficiency - -- `to_dicts()`: Returns full list in memory -- `for row in table:`: Streams one row at a time -- `to_arrays(limit=N)`: Fetches only N rows - -### Format Selection - -| Use Case | Recommended Method | -|----------|-------------------| -| Data analysis | `to_pandas()` or `to_polars()` | -| JSON API responses | `to_dicts()` | -| Numeric computation | `to_arrays()` | -| Large datasets | `for row in table:` (streaming) | -| Interop with other tools | `to_arrow()` | - ---- - -## Error Messages - -When attempting to use removed methods, users see helpful error messages: - -```python ->>> table.fetch() -AttributeError: fetch() has been removed in DataJoint 2.0. -Use to_dicts(), to_pandas(), to_arrays(), or keys() instead. -See table.fetch.__doc__ for details. -``` - ---- - -## Optional Dependencies - -Install optional dependencies for additional output formats: - -```bash -# For polars support -pip install datajoint[polars] - -# For PyArrow support -pip install datajoint[arrow] - -# For both -pip install datajoint[polars,arrow] -``` diff --git a/docs/src/archive/design/hidden-job-metadata-spec.md b/docs/src/archive/design/hidden-job-metadata-spec.md deleted file mode 100644 index a33a8d51d..000000000 --- a/docs/src/archive/design/hidden-job-metadata-spec.md +++ /dev/null @@ -1,355 +0,0 @@ -# Hidden Job Metadata in Computed Tables - -## Overview - -Job execution metadata (start time, duration, code version) should be persisted in computed tables themselves, not just in ephemeral job entries. This is accomplished using hidden attributes. - -## Motivation - -The current job table (`~~table_name`) tracks execution metadata, but: -1. Job entries are deleted after completion (unless `keep_completed=True`) -2. Users often need to know when and with what code version each row was computed -3. This metadata should be transparent - not cluttering the user-facing schema - -Hidden attributes (prefixed with `_`) provide the solution: stored in the database but filtered from user-facing APIs. - -## Hidden Job Metadata Attributes - -| Attribute | Type | Description | -|-----------|------|-------------| -| `_job_start_time` | datetime(3) | When computation began | -| `_job_duration` | float32 | Computation duration in seconds | -| `_job_version` | varchar(64) | Code version (e.g., git commit hash) | - -**Design notes:** -- `_job_duration` (elapsed time) rather than `_job_completed_time` because duration is more informative for performance analysis -- `varchar(64)` for version is sufficient for git hashes (40 chars for SHA-1, 7-8 for short hash) -- `datetime(3)` provides millisecond precision - -## Configuration - -### Settings Structure - -Job metadata is controlled via `config.jobs` settings: - -```python -class JobsSettings(BaseSettings): - """Job queue configuration for AutoPopulate 2.0.""" - - model_config = SettingsConfigDict( - env_prefix="DJ_JOBS_", - case_sensitive=False, - extra="forbid", - validate_assignment=True, - ) - - # Existing settings - auto_refresh: bool = Field(default=True, ...) - keep_completed: bool = Field(default=False, ...) - stale_timeout: int = Field(default=3600, ...) - default_priority: int = Field(default=5, ...) - version_method: Literal["git", "none"] | None = Field(default=None, ...) - allow_new_pk_fields_in_computed_tables: bool = Field(default=False, ...) - - # New setting for hidden job metadata - add_job_metadata: bool = Field( - default=False, - description="Add hidden job metadata attributes (_job_start_time, _job_duration, _job_version) " - "to Computed and Imported tables during declaration. Tables created without this setting " - "will not receive metadata updates during populate." - ) -``` - -### Access Patterns - -```python -import datajoint as dj - -# Read setting -dj.config.jobs.add_job_metadata # False (default) - -# Enable programmatically -dj.config.jobs.add_job_metadata = True - -# Enable via environment variable -# DJ_JOBS_ADD_JOB_METADATA=true - -# Enable in config file (dj_config.yaml) -# jobs: -# add_job_metadata: true - -# Temporary override -with dj.config.override(jobs={"add_job_metadata": True}): - schema(MyComputedTable) # Declared with metadata columns -``` - -### Setting Interactions - -| Setting | Effect on Job Metadata | -|---------|----------------------| -| `add_job_metadata=True` | New Computed/Imported tables get hidden metadata columns | -| `add_job_metadata=False` | Tables declared without metadata columns (default) | -| `version_method="git"` | `_job_version` populated with git short hash | -| `version_method="none"` | `_job_version` left empty | -| `version_method=None` | `_job_version` left empty (same as "none") | - -### Behavior at Declaration vs Populate - -| `add_job_metadata` at declare | `add_job_metadata` at populate | Result | -|------------------------------|-------------------------------|--------| -| True | True | Metadata columns created and populated | -| True | False | Metadata columns exist but not populated | -| False | True | No metadata columns, populate skips silently | -| False | False | No metadata columns, normal behavior | - -### Retrofitting Existing Tables - -Tables created before enabling `add_job_metadata` do not have the hidden metadata columns. -To add metadata columns to existing tables, use the migration utility (not automatic): - -```python -from datajoint.migrate import add_job_metadata_columns - -# Add hidden metadata columns to specific table -add_job_metadata_columns(MyComputedTable) - -# Add to all Computed/Imported tables in a schema -add_job_metadata_columns(schema) -``` - -This utility: -- ALTERs the table to add the three hidden columns -- Does NOT populate existing rows (metadata remains NULL) -- Future `populate()` calls will populate metadata for new rows - -## Behavior - -### Declaration-time - -When `config.jobs.add_job_metadata=True` and a Computed/Imported table is declared: -- Hidden metadata columns are added to the table definition -- Only master tables receive metadata columns; Part tables never get them - -### Population-time - -After `make()` completes successfully: -1. Check if the table has hidden metadata columns -2. If yes: UPDATE the just-inserted rows with start_time, duration, version -3. If no: Silently skip (no error, no ALTER) - -This applies to both: -- **Direct mode** (`reserve_jobs=False`): Single-process populate -- **Distributed mode** (`reserve_jobs=True`): Multi-worker with job table coordination - -## Excluding Hidden Attributes from Binary Operators - -### Problem Statement - -If two tables have hidden attributes with the same name (e.g., both have `_job_start_time`), SQL's NATURAL JOIN would incorrectly match on them: - -```sql --- NATURAL JOIN matches ALL common attributes including hidden -SELECT * FROM table_a NATURAL JOIN table_b --- Would incorrectly match on _job_start_time! -``` - -### Solution: Replace NATURAL JOIN with USING Clause - -Hidden attributes must be excluded from all binary operator considerations. The result of a join does not preserve hidden attributes from its operands. - -**Current implementation:** -```python -def from_clause(self): - clause = next(support) - for s, left in zip(support, self._left): - clause += " NATURAL{left} JOIN {clause}".format(...) -``` - -**Proposed implementation:** -```python -def from_clause(self): - clause = next(support) - for s, (left, using_attrs) in zip(support, self._joins): - if using_attrs: - using = "USING ({})".format(", ".join(f"`{a}`" for a in using_attrs)) - clause += " {left}JOIN {s} {using}".format( - left="LEFT " if left else "", - s=s, - using=using - ) - else: - # Cross join (no common non-hidden attributes) - clause += " CROSS JOIN " + s if not left else " LEFT JOIN " + s + " ON TRUE" - return clause -``` - -### Changes Required - -#### 1. `QueryExpression._left` → `QueryExpression._joins` - -Replace `_left: List[bool]` with `_joins: List[Tuple[bool, List[str]]]` - -Each join stores: -- `left`: Whether it's a left join -- `using_attrs`: Non-hidden common attributes to join on - -```python -# Before -result._left = self._left + [left] + other._left - -# After -join_attributes = [n for n in self.heading.names if n in other.heading.names] -result._joins = self._joins + [(left, join_attributes)] + other._joins -``` - -#### 2. `heading.names` (existing behavior) - -Already filters out hidden attributes: -```python -@property -def names(self): - return [k for k in self.attributes] # attributes excludes is_hidden=True -``` - -This ensures join attribute computation automatically excludes hidden attributes. - -### Behavior Summary - -| Scenario | Hidden Attributes | Result | -|----------|-------------------|--------| -| `A * B` (join) | Same hidden attr in both | NOT matched - excluded from USING | -| `A & B` (semijoin) | Same hidden attr in both | NOT matched | -| `A - B` (antijoin) | Same hidden attr in both | NOT matched | -| `A.proj()` | Hidden attrs in A | NOT projected (unless explicitly named) | -| `A.fetch()` | Hidden attrs in A | NOT returned by default | - -## Implementation Details - -### 1. Declaration (declare.py) - -```python -def declare(full_table_name, definition, context): - # ... existing code ... - - # Add hidden job metadata for auto-populated tables - if config.jobs.add_job_metadata and table_tier in (TableTier.COMPUTED, TableTier.IMPORTED): - # Only for master tables, not parts - if not is_part_table: - job_metadata_sql = [ - "`_job_start_time` datetime(3) DEFAULT NULL", - "`_job_duration` float DEFAULT NULL", - "`_job_version` varchar(64) DEFAULT ''", - ] - attribute_sql.extend(job_metadata_sql) -``` - -### 2. Population (autopopulate.py) - -```python -def _populate1(self, key, callback, use_jobs, jobs): - start_time = datetime.now() - version = _get_job_version() - - # ... call make() ... - - duration = time.time() - start_time.timestamp() - - # Update job metadata if table has the hidden attributes - if self._has_job_metadata_attrs(): - self._update_job_metadata( - key, - start_time=start_time, - duration=duration, - version=version - ) - -def _has_job_metadata_attrs(self): - """Check if table has hidden job metadata columns.""" - hidden_attrs = self.heading._attributes # includes hidden - return '_job_start_time' in hidden_attrs - -def _update_job_metadata(self, key, start_time, duration, version): - """Update hidden job metadata for the given key.""" - # UPDATE using primary key - pk_condition = make_condition(self, key, set()) - self.connection.query( - f"UPDATE {self.full_table_name} SET " - f"`_job_start_time`=%s, `_job_duration`=%s, `_job_version`=%s " - f"WHERE {pk_condition}", - args=(start_time, duration, version[:64]) - ) -``` - -### 3. Job table (jobs.py) - -Update version field length: -```python -version="" : varchar(64) -``` - -### 4. Version helper - -```python -def _get_job_version() -> str: - """Get version string, truncated to 64 chars.""" - from .settings import config - - method = config.jobs.version_method - if method is None or method == "none": - return "" - elif method == "git": - try: - result = subprocess.run( - ["git", "rev-parse", "--short", "HEAD"], - capture_output=True, - text=True, - timeout=5, - ) - return result.stdout.strip()[:64] if result.returncode == 0 else "" - except Exception: - return "" - return "" -``` - -## Example Usage - -```python -# Enable job metadata for new tables -dj.config.jobs.add_job_metadata = True - -@schema -class ProcessedData(dj.Computed): - definition = """ - -> RawData - --- - result : float - """ - - def make(self, key): - # User code - unaware of hidden attributes - self.insert1({**key, 'result': compute(key)}) - -# Job metadata automatically added and populated: -# _job_start_time, _job_duration, _job_version - -# User-facing API unaffected: -ProcessedData().heading.names # ['raw_data_id', 'result'] -ProcessedData().fetch() # Returns only visible attributes - -# Access hidden attributes explicitly if needed: -ProcessedData().fetch('_job_start_time', '_job_duration', '_job_version') -``` - -## Summary of Design Decisions - -| Decision | Resolution | -|----------|------------| -| Configuration | `config.jobs.add_job_metadata` (default False) | -| Environment variable | `DJ_JOBS_ADD_JOB_METADATA` | -| Existing tables | No automatic ALTER - silently skip metadata if columns absent | -| Retrofitting | Manual via `datajoint.migrate.add_job_metadata_columns()` utility | -| Populate modes | Record metadata in both direct and distributed modes | -| Part tables | No metadata columns - only master tables | -| Version length | varchar(64) in both jobs table and computed tables | -| Binary operators | Hidden attributes excluded via USING clause instead of NATURAL JOIN | -| Failed makes | N/A - transaction rolls back, no rows to update | diff --git a/docs/src/archive/design/integrity.md b/docs/src/archive/design/integrity.md deleted file mode 100644 index 393103522..000000000 --- a/docs/src/archive/design/integrity.md +++ /dev/null @@ -1,218 +0,0 @@ -# Data Integrity - -The term **data integrity** describes guarantees made by the data management process -that prevent errors and corruption in data due to technical failures and human errors -arising in the course of continuous use by multiple agents. -DataJoint pipelines respect the following forms of data integrity: **entity -integrity**, **referential integrity**, and **group integrity** as described in more -detail below. - -## Entity integrity - -In a proper relational design, each table represents a collection of discrete -real-world entities of some kind. -**Entity integrity** is the guarantee made by the data management process that entities -from the real world are reliably and uniquely represented in the database system. -Entity integrity states that the data management process must prevent duplicate -representations or misidentification of entities. -DataJoint enforces entity integrity through the use of -[primary keys](./tables/primary.md). - -Entity integrity breaks down when a process allows data pertaining to the same -real-world entity to be entered into the database system multiple times. -For example, a school database system may use unique ID numbers to distinguish students. -Suppose the system automatically generates an ID number each time a student record is -entered into the database without checking whether a record already exists for that -student. -Such a system violates entity integrity, because the same student may be assigned -multiple ID numbers. -The ID numbers succeed in uniquely identifying each student record but fail to do so -for the actual students. - -Note that a database cannot guarantee or enforce entity integrity by itself. -Entity integrity is a property of the entire data management process as a whole, -including institutional practices and user actions in addition to database -configurations. - -## Referential integrity - -**Referential integrity** is the guarantee made by the data management process that -related data across the database remain present, correctly associated, and mutually -consistent. -Guaranteeing referential integrity means enforcing the constraint that no entity can -exist in the database without all the other entities on which it depends. -Referential integrity cannot exist without entity integrity: references to entity -cannot be validated if the identity of the entity itself is not guaranteed. - -Referential integrity fails when a data management process allows new data to be -entered that refers to other data missing from the database. -For example, assume that each electrophysiology recording must refer to the mouse -subject used during data collection. -Perhaps an experimenter attempts to insert ephys data into the database that refers to -a nonexistent mouse, due to a misspelling. -A system guaranteeing referential integrity, such as DataJoint, will refuse the -erroneous data. - -Enforcement of referential integrity does not stop with data ingest. -[Deleting](../manipulation/delete.md) data in DataJoint also deletes any dependent -downstream data. -Such cascading deletions are necessary to maintain referential integrity. -Consider the deletion of a mouse subject without the deletion of the experimental -sessions involving that mouse. -A database that allows such deletion will break referential integrity, as the -experimental sessions for the removed mouse depend on missing data. -Any data management process that allows data to be deleted with no consideration of -dependent data cannot maintain referential integrity. - -[Updating](../manipulation/update.md) data already present in a database system also -jeopardizes referential integrity. -For this reason, the DataJoint workflow does not include updates to entities once they -have been ingested into a pipeline. -Allowing updates to upstream entities would break the referential integrity of any -dependent data downstream. -For example, permitting a user to change the name of a mouse subject would invalidate -any experimental sessions that used that mouse, presuming the mouse name was part of -the primary key. -The proper way to change data in DataJoint is to delete the existing entities and to -insert corrected ones, preserving referential integrity. - -## Group integrity - -**Group integrity** denotes the guarantee made by the data management process that -entities composed of multiple parts always appear in their complete form. -Group integrity in DataJoint is formalized through -[master-part](./tables/master-part.md) relationships. -The master-part relationship has important implications for dependencies, because a -downstream entity depending on a master entity set may be considered to depend on the -parts as well. - -## Relationships - -In DataJoint, the term **relationship** is used rather generally to describe the -effects of particular configurations of [dependencies](./tables/dependencies.md) -between multiple entity sets. -It is often useful to classify relationships as one-to-one, many-to-one, one-to-many, -and many-to-many. - -In a **one-to-one relationship**, each entity in a downstream table has exactly one -corresponding entity in the upstream table. -A dependency of an entity set containing the death dates of mice on an entity set -describing the mice themselves would obviously be a one-to-one relationship, as in the -example below. - -```python -@schema -class Mouse(dj.Manual): -definition = """ -mouse_name : varchar(64) ---- -mouse_dob : datetime -""" - -@schema -class MouseDeath(dj.Manual): -definition = """ --> Mouse ---- -death_date : datetime -""" -``` - -![doc_1-1](../images/doc_1-1.png){: style="align:center"} - -In a **one-to-many relationship**, multiple entities in a downstream table may depend -on the same entity in the upstream table. -The example below shows a table containing individual channel data from multi-channel -recordings, representing a one-to-many relationship. - -```python -@schema -class EEGRecording(dj.Manual): -definition = """ --> Session -eeg_recording_id : int ---- -eeg_system : varchar(64) -num_channels : int -""" - -@schema -class ChannelData(dj.Imported): -definition = """ --> EEGRecording -channel_idx : int ---- -channel_data : -""" -``` -![doc_1-many](../images/doc_1-many.png){: style="align:center"} - -In a **many-to-one relationship**, each entity in a table is associated with multiple -entities from another table. -Many-to-one relationships between two tables are usually established using a separate -membership table. -The example below includes a table of mouse subjects, a table of subject groups, and a -membership [part table](./tables/master-part.md) listing the subjects in each group. -A many-to-one relationship exists between the `Mouse` table and the `SubjectGroup` -table, with is expressed through entities in `GroupMember`. - -```python -@schema -class Mouse(dj.Manual): -definition = """ -mouse_name : varchar(64) ---- -mouse_dob : datetime -""" - -@schema -class SubjectGroup(dj.Manual): -definition = """ -group_number : int ---- -group_name : varchar(64) -""" - -class GroupMember(dj.Part): - definition = """ - -> master - -> Mouse - """ -``` - -![doc_many-1](../images/doc_many-1.png){: style="align:center"} - -In a **many-to-many relationship**, multiple entities in one table may each relate to -multiple entities in another upstream table. -Many-to-many relationships between two tables are usually established using a separate -association table. -Each entity in the association table links one entity from each of the two upstream -tables it depends on. -The below example of a many-to-many relationship contains a table of recording -modalities and a table of multimodal recording sessions. -Entities in a third table represent the modes used for each session. - -```python -@schema -class RecordingModality(dj.Lookup): -definition = """ -modality : varchar(64) -""" - -@schema -class MultimodalSession(dj.Manual): -definition = """ --> Session -modes : int -""" -class SessionMode(dj.Part): - definition = """ - -> master - -> RecordingModality - """ -``` - -![doc_many-many](../images/doc_many-many.png){: style="align:center"} - -The types of relationships between entity sets are expressed in the -[Diagram](diagrams.md) of a schema. diff --git a/docs/src/archive/design/normalization.md b/docs/src/archive/design/normalization.md deleted file mode 100644 index 000028396..000000000 --- a/docs/src/archive/design/normalization.md +++ /dev/null @@ -1,117 +0,0 @@ -# Entity Normalization - -DataJoint uses a uniform way of representing any data. -It does so in the form of **entity sets**, unordered collections of entities of the -same type. -The term **entity normalization** describes the commitment to represent all data as -well-formed entity sets. -Entity normalization is a conceptual refinement of the -[relational data model](../concepts/data-model.md) and is the central principle of the -DataJoint model ([Yatsenko et al., 2018](https://arxiv.org/abs/1807.11104)). -Entity normalization leads to clear and logical database designs and to easily -comprehensible data queries. - -Entity sets are a type of **relation** -(from the [relational data model](../concepts/data-model.md)) and are often visualized -as **tables**. -Hence the terms **relation**, **entity set**, and **table** can be used interchangeably -when entity normalization is assumed. - -## Criteria of a well-formed entity set - -1. All elements of an entity set belong to the same well-defined and readily identified -**entity type** from the model world. -2. All attributes of an entity set are applicable directly to each of its elements, -although some attribute values may be missing (set to null). -3. All elements of an entity set must be distinguishable form each other by the same -primary key. -4. Primary key attribute values cannot be missing, i.e. set to null. -5. All elements of an entity set participate in the same types of relationships with -other entity sets. - -## Entity normalization in schema design - -Entity normalization applies to schema design in that the designer is responsible for -the identification of the essential entity types in their model world and of the -dependencies among the entity types. - -The term entity normalization may also apply to a procedure for refactoring a schema -design that does not meet the above criteria into one that does. -In some cases, this may require breaking up some entity sets into multiple entity sets, -which may cause some entities to be represented across multiple entity sets. -In other cases, this may require converting attributes into their own entity sets. -Technically speaking, entity normalization entails compliance with the -[Boyce-Codd normal form](https://en.wikipedia.org/wiki/Boyce%E2%80%93Codd_normal_form) -while lacking the representational power for the applicability of more complex normal -forms ([Kent, 1983](https://dl.acm.org/citation.cfm?id=358054)). -Adherence to entity normalization prevents redundancies in storage and data -manipulation anomalies. -The same criteria originally motivated the formulation of the classical relational -normal forms. - -## Entity normalization in data queries - -Entity normalization applies to data queries as well. -DataJoint's [query operators](../query/operators.md) are designed to preserve the -entity normalization of their inputs. -For example, the outputs of operators [restriction](../query/restrict.md), -[proj](../query/project.md), and [aggr](../query/aggregation.md) retain the same entity -type as the (first) input. -The [join](../query/join.md) operator produces a new entity type comprising the pairing -of the entity types of its inputs. -[Universal sets](../query/universals.md) explicitly introduce virtual entity sets when -necessary to accomplish a query. - -## Examples of poor normalization - -Design choices lacking entity normalization may lead to data inconsistencies or -anomalies. -Below are several examples of poorly normalized designs and their normalized -alternatives. - -### Indirect attributes - -All attributes should apply to the entity itself. -Avoid attributes that actually apply to one of the entity's other attributes. -For example, consider the table `Author` with attributes `author_name`, `institution`, -and `institution_address`. -The attribute `institution_address` should really be held in a separate `Institution` -table that `Author` depends on. - -### Repeated attributes - -Avoid tables with repeated attributes of the same category. -A better solution is to create a separate table that depends on the first (often a -[part table](../design/tables/master-part.md)), with multiple individual entities -rather than repeated attributes. -For example, consider the table `Protocol` that includes the attributes `equipment1`, -`equipment2`, and `equipment3`. -A better design would be to create a `ProtocolEquipment` table that links each entity -in `Protocol` with multiple entities in `Equipment` through -[dependencies](../design/tables/dependencies.md). - -### Attributes that do not apply to all entities - -All attributes should be relevant to every entity in a table. -Attributes that apply only to a subset of entities in a table likely belong in a -separate table containing only that subset of entities. -For example, a table `Protocol` should include the attribute `stimulus` only if all -experiment protocols include stimulation. -If the not all entities in `Protocol` involve stimulation, then the `stimulus` -attribute should be moved to a part table that has `Protocol` as its master. -Only protocols using stimulation will have an entry in this part table. - -### Transient attributes - -Attributes should be relevant to all entities in a table at all times. -Attributes that do not apply to all entities should be moved to another dependent table -containing only the appropriate entities. -This principle also applies to attributes that have not yet become meaningful for some -entities or that will not remain meaningful indefinitely. -For example, consider the table `Mouse` with attributes `birth_date` and `death_date`, -where `death_date` is set to `NULL` for living mice. -Since the `death_date` attribute is not meaningful for mice that are still living, -the proper design would include a separate table `DeceasedMouse` that depends on -`Mouse`. -`DeceasedMouse` would only contain entities for dead mice, which improves integrity and -averts the need for [updates](../manipulation/update.md). diff --git a/docs/src/archive/design/pk-rules-spec.md b/docs/src/archive/design/pk-rules-spec.md deleted file mode 100644 index c6e2dc8ea..000000000 --- a/docs/src/archive/design/pk-rules-spec.md +++ /dev/null @@ -1,318 +0,0 @@ -# Primary Key Rules in Relational Operators - -In DataJoint, the result of each query operator produces a valid **entity set** with a well-defined **entity type** and **primary key**. This section specifies how the primary key is determined for each relational operator. - -## General Principle - -The primary key of a query result identifies unique entities in that result. For most operators, the primary key is preserved from the left operand. For joins, the primary key depends on the functional dependencies between the operands. - -## Integration with Semantic Matching - -Primary key determination is applied **after** semantic compatibility is verified. The evaluation order is: - -1. **Semantic Check**: `assert_join_compatibility()` ensures all namesakes are homologous (same lineage) -2. **PK Determination**: The "determines" relationship is computed using attribute names -3. **Left Join Validation**: If `left=True`, verify A → B - -This ordering is important because: -- After semantic matching passes, namesakes represent semantically equivalent attributes -- The name-based "determines" check is therefore semantically valid -- Attribute names in the context of a semantically-valid join represent the same entity - -The "determines" relationship uses attribute **names** (not lineages directly) because: -- Lineage ensures namesakes are homologous -- Once verified, checking by name is equivalent to checking by semantic identity -- Aliased attributes (same lineage, different names) don't participate in natural joins anyway - -## Notation - -In the examples below, `*` marks primary key attributes: -- `A(x*, y*, z)` means A has primary key `{x, y}` and secondary attribute `z` -- `A → B` means "A determines B" (defined below) - -### Rules by Operator - -| Operator | Primary Key Rule | -|----------|------------------| -| `A & B` (restriction) | PK(A) — preserved from left operand | -| `A - B` (anti-restriction) | PK(A) — preserved from left operand | -| `A.proj(...)` (projection) | PK(A) — preserved from left operand | -| `A.aggr(B, ...)` (aggregation) | PK(A) — preserved from left operand | -| `A.extend(B)` (extension) | PK(A) — requires A → B | -| `A * B` (join) | Depends on functional dependencies (see below) | - -### Join Primary Key Rule - -The join operator requires special handling because it combines two entity sets. The primary key of `A * B` depends on the **functional dependency relationship** between the operands. - -#### Definitions - -**A determines B** (written `A → B`): Every attribute in PK(B) is in A. - -``` -A → B iff ∀b ∈ PK(B): b ∈ A -``` - -Since `PK(A) ∪ secondary(A) = all attributes in A`, this is equivalent to saying every attribute in B's primary key exists somewhere in A (as either a primary key or secondary attribute). - -Intuitively, `A → B` means that knowing A's primary key is sufficient to determine B's primary key through the functional dependencies implied by A's structure. - -**B determines A** (written `B → A`): Every attribute in PK(A) is in B. - -``` -B → A iff ∀a ∈ PK(A): a ∈ B -``` - -#### Join Primary Key Algorithm - -For `A * B`: - -| Condition | PK(A * B) | Attribute Order | -|-----------|-----------|-----------------| -| A → B | PK(A) | A's attributes first | -| B → A (and not A → B) | PK(B) | B's attributes first | -| Neither | PK(A) ∪ PK(B) | PK(A) first, then PK(B) − PK(A) | - -When both `A → B` and `B → A` hold, the left operand takes precedence (use PK(A)). - -#### Examples - -**Example 1: B → A** -``` -A: x*, y* -B: x*, z*, y (y is secondary in B, so z → y) -``` -- A → B? PK(B) = {x, z}. Is z in PK(A) or secondary in A? No (z not in A). **No.** -- B → A? PK(A) = {x, y}. Is y in PK(B) or secondary in B? Yes (secondary). **Yes.** -- Result: **PK(A * B) = {x, z}** with B's attributes first. - -**Example 2: Both directions (bijection-like)** -``` -A: x*, y*, z (z is secondary in A) -B: y*, z*, x (x is secondary in B) -``` -- A → B? PK(B) = {y, z}. Is z in PK(A) or secondary in A? Yes (secondary). **Yes.** -- B → A? PK(A) = {x, y}. Is x in PK(B) or secondary in B? Yes (secondary). **Yes.** -- Both hold, prefer left operand: **PK(A * B) = {x, y}** with A's attributes first. - -**Example 3: Neither direction** -``` -A: x*, y* -B: z*, x (x is secondary in B) -``` -- A → B? PK(B) = {z}. Is z in PK(A) or secondary in A? No. **No.** -- B → A? PK(A) = {x, y}. Is y in PK(B) or secondary in B? No (y not in B). **No.** -- Result: **PK(A * B) = {x, y, z}** (union) with A's attributes first. - -**Example 4: A → B (subordinate relationship)** -``` -Session: session_id* -Trial: session_id*, trial_num* (references Session) -``` -- A → B? PK(Trial) = {session_id, trial_num}. Is trial_num in PK(Session) or secondary? No. **No.** -- B → A? PK(Session) = {session_id}. Is session_id in PK(Trial)? Yes. **Yes.** -- Result: **PK(Session * Trial) = {session_id, trial_num}** with Trial's attributes first. - -**Join primary key determination**: - - `A * B` where `A → B`: result has PK(A) - - `A * B` where `B → A` (not `A → B`): result has PK(B), B's attributes first - - `A * B` where both `A → B` and `B → A`: result has PK(A) (left preference) - - `A * B` where neither direction: result has PK(A) ∪ PK(B) - - Verify attribute ordering matches primary key source - - Verify non-commutativity: `A * B` vs `B * A` may differ in PK and order - -### Design Tradeoff: Predictability vs. Minimality - -The join primary key rule prioritizes **predictability** over **minimality**. In some cases, the resulting primary key may not be minimal (i.e., it may contain functionally redundant attributes). - -**Example of non-minimal result:** -``` -A: x*, y* -B: z*, x (x is secondary in B, so z → x) -``` - -The mathematically minimal primary key for `A * B` would be `{y, z}` because: -- `z → x` (from B's structure) -- `{y, z} → {x, y, z}` (z gives us x, and we have y) - -However, `{y, z}` is problematic: -- It is **not the primary key of either operand** (A has `{x, y}`, B has `{z}`) -- It is **not the union** of the primary keys -- It represents a **novel entity type** that doesn't correspond to A, B, or their natural pairing - -This creates confusion: what kind of entity does `{y, z}` identify? - -**The simplified rule produces `{x, y, z}`** (the union), which: -- Is immediately recognizable as "one A entity paired with one B entity" -- Contains A's full primary key and B's full primary key -- May have redundancy (`x` is determined by `z`) but is semantically clear - -**Rationale:** Users can always project away redundant attributes if they need the minimal key. But starting with a predictable, interpretable primary key reduces confusion and errors. - -### Attribute Ordering - -The primary key attributes always appear **first** in the result's attribute list, followed by secondary attributes. When `B → A` (and not `A → B`), the join is conceptually reordered as `B * A` to maintain this invariant: - -- If PK = PK(A): A's attributes appear first -- If PK = PK(B): B's attributes appear first -- If PK = PK(A) ∪ PK(B): PK(A) attributes first, then PK(B) − PK(A), then secondaries - -### Non-Commutativity - -With these rules, join is **not commutative** in terms of: -1. **Primary key selection**: `A * B` may have a different PK than `B * A` when one direction determines but not the other -2. **Attribute ordering**: The left operand's attributes appear first (unless B → A) - -The **result set** (the actual rows returned) remains the same regardless of order, but the **schema** (primary key and attribute order) may differ. - -### Left Join Constraint - -For left joins (`A.join(B, left=True)`), the functional dependency **A → B is required**. - -**Why this constraint exists:** - -In a left join, all rows from A are retained even if there's no matching row in B. For unmatched rows, B's attributes are NULL. This creates a problem for primary key validity: - -| Scenario | PK by inner join rule | Left join problem | -|----------|----------------------|-------------------| -| A → B | PK(A) | ✅ Safe — A's attrs always present | -| B → A | PK(B) | ❌ B's PK attrs could be NULL | -| Neither | PK(A) ∪ PK(B) | ❌ B's PK attrs could be NULL | - -**Example of invalid left join:** -``` -A: x*, y* PK(A) = {x, y} -B: x*, z*, y PK(B) = {x, z}, y is secondary - -Inner join: PK = {x, z} (B → A rule) -Left join attempt: FAILS because z could be NULL for unmatched A rows -``` - -**Valid left join example:** -``` -Session: session_id*, date -Trial: session_id*, trial_num*, stimulus (references Session) - -Session.join(Trial, left=True) # OK: Session → Trial -# PK = {session_id}, all sessions retained even without trials -``` - -**Error message:** -``` -DataJointError: Left join requires the left operand to determine the right operand (A → B). -The following attributes from the right operand's primary key are not determined by -the left operand: ['z']. Use an inner join or restructure the query. -``` - -### Conceptual Note: Left Join as Extension - -When `A → B`, the left join `A.join(B, left=True)` is conceptually distinct from the general join operator `A * B`. It is better understood as an **extension** operation rather than a join: - -| Aspect | General Join (A * B) | Left Join when A → B | -|--------|---------------------|----------------------| -| Conceptual model | Cartesian product restricted to matching rows | Extend A with attributes from B | -| Row count | May increase, decrease, or stay same | Always equals len(A) | -| Primary key | Depends on functional dependencies | Always PK(A) | -| Relation to projection | Different operation | Variation of projection | - -**The extension perspective:** - -The operation `A.join(B, left=True)` when `A → B` is closer to **projection** than to **join**: -- It adds new attributes to A (like `A.proj(..., new_attr=...)`) -- It preserves all rows of A -- It preserves A's primary key -- It lacks the Cartesian product aspect that defines joins - -DataJoint provides an explicit `extend()` method for this pattern: - -```python -# These are equivalent when A → B: -A.join(B, left=True) -A.extend(B) # clearer intent: extend A with B's attributes -``` - -The `extend()` method: -- Requires `A → B` (raises `DataJointError` otherwise) -- Does not expose `allow_nullable_pk` (that's an internal mechanism) -- Expresses the semantic intent: "add B's attributes to A's entities" - -**Relationship to aggregation:** - -A similar argument applies to `A.aggr(B, ...)`: -- It preserves A's primary key -- It adds computed attributes derived from B -- It's conceptually a variation of projection with grouping - -Both `A.join(B, left=True)` (when A → B) and `A.aggr(B, ...)` can be viewed as **projection-like operations** that extend A's attributes while preserving its entity identity. - -### Bypassing the Left Join Constraint - -For special cases where the user takes responsibility for handling the potentially nullable primary key, the constraint can be bypassed using `allow_nullable_pk=True`: - -```python -# Normally blocked - A does not determine B -A.join(B, left=True) # Error: A → B not satisfied - -# Bypass the constraint - user takes responsibility -A.join(B, left=True, allow_nullable_pk=True) # Allowed, PK = PK(A) ∪ PK(B) -``` - -When bypassed, the resulting primary key is the union of both operands' primary keys (PK(A) ∪ PK(B)). The user must ensure that subsequent operations (such as `GROUP BY` or projection) establish a valid primary key. The parameter name `allow_nullable_pk` reflects the specific issue: primary key attributes from the right operand could be NULL for unmatched rows. - -This mechanism is used internally by aggregation (`aggr`) with `keep_all_rows=True`, which resets the primary key via the `GROUP BY` clause. - -### Aggregation Exception - -`A.aggr(B, keep_all_rows=True)` uses a left join internally but has the **opposite requirement**: **B → A** (the group expression B must have all of A's primary key attributes). - -This apparent contradiction is resolved by the `GROUP BY` clause: - -1. Aggregation requires B → A so that B can be grouped by A's primary key -2. The intermediate left join `A LEFT JOIN B` would have an invalid PK under the normal left join rules -3. Aggregation internally allows the invalid PK, producing PK(A) ∪ PK(B) -4. The `GROUP BY PK(A)` clause then **resets** the primary key to PK(A) -5. The final result has PK(A), which consists entirely of non-NULL values from A - -Note: The semantic check (homologous namesake validation) is still performed for aggregation's internal join. Only the primary key validity constraint is bypassed. - -**Example:** -``` -Session: session_id*, date -Trial: session_id*, trial_num*, response_time (references Session) - -# Aggregation with keep_all_rows=True -Session.aggr(Trial, keep_all_rows=True, avg_rt='avg(response_time)') - -# Internally: Session LEFT JOIN Trial (with invalid PK allowed) -# Intermediate PK would be {session_id} ∪ {session_id, trial_num} = {session_id, trial_num} -# But GROUP BY session_id resets PK to {session_id} -# Result: All sessions, with avg_rt=NULL for sessions without trials -``` - -## Universal Set `dj.U` - -`dj.U()` or `dj.U('attr1', 'attr2', ...)` represents the universal set of all possible values and lineages. - -### Homology with `dj.U` -Since `dj.U` conceptually contains all possible lineages, its attributes are **homologous to any namesake attribute** in other expressions. - -### Valid Operations - -```python -# Restriction: promotes a, b to PK; lineage transferred from A -dj.U('a', 'b') & A - -# Aggregation: groups by a, b -dj.U('a', 'b').aggr(A, count='count(*)') -``` - -### Invalid Operations - -```python -# Anti-restriction: produces infinite set -dj.U('a', 'b') - A # DataJointError - -# Join: deprecated, use & instead -dj.U('a', 'b') * A # DataJointError with migration guidance -``` - diff --git a/docs/src/archive/design/recall.md b/docs/src/archive/design/recall.md deleted file mode 100644 index 56226cabd..000000000 --- a/docs/src/archive/design/recall.md +++ /dev/null @@ -1,207 +0,0 @@ -# Work with Existing Pipelines - -## Loading Classes - -This section describes how to work with database schemas without access to the -original code that generated the schema. These situations often arise when the -database is created by another user who has not shared the generating code yet -or when the database schema is created from a programming language other than -Python. - -```python -import datajoint as dj -``` - -### Working with schemas and their modules - -Typically a DataJoint schema is created as a dedicated Python module. This -module defines a schema object that is used to link classes declared in the -module to tables in the database schema. As an example, examine the university -module: [university.py](https://github.com/datajoint-company/db-programming-with-datajoint/blob/master/notebooks/university.py). - -You may then import the module to interact with its tables: - -```python -import university as uni -dj.Diagram(uni) -``` - -![query object preview](../images/virtual-module-ERD.svg){: style="align:center"} - -Note that dj.Diagram can extract the diagram from a schema object or from a -Python module containing its schema object, lending further support to the -convention of one-to-one correspondence between database schemas and Python -modules in a DataJoint project: - -`dj.Diagram(uni)` - -is equivalent to - -`dj.Diagram(uni.schema)` - -```python -# students without majors -uni.Student - uni.StudentMajor -``` - -![query object preview](../images/StudentTable.png){: style="align:center"} - -### Spawning missing classes - -Now imagine that you do not have access to `university.py` or you do not have -its latest version. You can still connect to the database schema but you will -not have classes declared to interact with it. - -So let's start over in this scenario. - -You may use the `dj.list_schemas` function (new in DataJoint 0.12.0) to -list the names of database schemas available to you. - -```python -import datajoint as dj -dj.list_schemas() -``` - -```text -*['dimitri_alter','dimitri_attach','dimitri_blob','dimitri_blobs', -'dimitri_nphoton','dimitri_schema','dimitri_university','dimitri_uuid', -'university']* -``` - -Just as with a new schema, we start by creating a schema object to connect to -the chosen database schema: - -```python -schema = dj.Schema('dimitri_university') -``` - -If the schema already exists, `dj.Schema` is initialized as usual and you may plot -the schema diagram. But instead of seeing class names, you will see the raw -table names as they appear in the database. - -```python -# let's plot its diagram -dj.Diagram(schema) -``` - -![query object preview](../images/dimitri-ERD.svg){: style="align:center"} - -You may view the diagram but, at this point, there is no way to interact with -these tables. A similar situation arises when another developer has added new -tables to the schema but has not yet shared the updated module code with you. -Then the diagram will show a mixture of class names and database table names. - -Now you may use the `spawn_missing_classes` method to spawn classes into -the local namespace for any tables missing their classes: - -```python -schema.spawn_missing_classes() -dj.Diagram(schema) -``` - -![query object preview](../images/spawned-classes-ERD.svg){: style="align:center"} - -Now you may interact with these tables as if they were declared right here in -this namespace: - -```python -# students without majors -Student - StudentMajor -``` - -![query object preview](../images/StudentTable.png){: style="align:center"} - -### Creating a virtual module - -Virtual modules provide a way to access the classes corresponding to tables in a -DataJoint schema without having to create local files. - -`spawn_missing_classes` creates the new classes in the local namespace. -However, it is often more convenient to import a schema with its Python module, -equivalent to the Python command: - -```python -import university as uni -``` - -We can mimic this import without having access to `university.py` using the -`VirtualModule` class object: - -```python -import datajoint as dj - -uni = dj.VirtualModule(module_name='university.py', schema_name='dimitri_university') -``` - -Now `uni` behaves as an imported module complete with the schema object and all -the table classes. - -```python -dj.Diagram(uni) -``` - -![query object preview](../images/added-example-ERD.svg){: style="align:center"} - -```python -uni.Student - uni.StudentMajor -``` - -![query object preview](../images/StudentTable.png){: style="align:center"} - -`dj.VirtualModule` takes required arguments - -- `module_name`: displayed module name. - -- `schema_name`: name of the database in MySQL. - -And `dj.VirtualModule` takes optional arguments. - -First, `create_schema=False` assures that an error is raised when the schema -does not already exist. Set it to `True` if you want to create an empty schema. - -```python -dj.VirtualModule('what', 'nonexistent') -``` - -Returns - -```python ---------------------------------------------------------------------------- -DataJointError Traceback (most recent call last) -. -. -. -DataJointError: Database named `nonexistent` was not defined. Set argument create_schema=True to create it. -``` - -The other optional argument, `create_tables=False` is passed to the schema -object. It prevents the use of the schema object of the virtual module for -creating new tables in the existing schema. This is a precautionary measure -since virtual modules are often used for completed schemas. You may set this -argument to `True` if you wish to add new tables to the existing schema. A -more common approach in this scenario would be to create a new schema object and -to use the `spawn_missing_classes` function to make the classes available. - -However, you if do decide to create new tables in an existing tables using the -virtual module, you may do so by using the schema object from the module as the -decorator for declaring new tables: - -```python -uni = dj.VirtualModule('university.py', 'dimitri_university', create_tables=True) -``` - -```python -@uni.schema -class Example(dj.Manual): - definition = """ - -> uni.Student - --- - example : varchar(255) - """ -``` - -```python -dj.Diagram(uni) -``` - -![query object preview](../images/added-example-ERD.svg){: style="align:center"} diff --git a/docs/src/archive/design/schema.md b/docs/src/archive/design/schema.md deleted file mode 100644 index 94bf6cdcc..000000000 --- a/docs/src/archive/design/schema.md +++ /dev/null @@ -1,49 +0,0 @@ -# Schema Creation - -## Schemas - -On the database server, related tables are grouped into a named collection called a **schema**. -This grouping organizes the data and allows control of user access. -A database server may contain multiple schemas each containing a subset of the tables. -A single pipeline may comprise multiple schemas. -Tables are defined within a schema, so a schema must be created before the creation of -any tables. - -By convention, the `datajoint` package is imported as `dj`. - The documentation refers to the package as `dj` throughout. - -Create a new schema using the `dj.Schema` class object: - -```python -import datajoint as dj -schema = dj.Schema('alice_experiment') -``` - -This statement creates the database schema `alice_experiment` on the server. - -The returned object `schema` will then serve as a decorator for DataJoint classes, as -described in [table declaration syntax](./tables/declare.md). - -It is a common practice to have a separate Python module for each schema. -Therefore, each such module has only one `dj.Schema` object defined and is usually -named `schema`. - -The `dj.Schema` constructor can take a number of optional parameters after the schema -name. - -- `context` - Dictionary for looking up foreign key references. - Defaults to `None` to use local context. -- `connection` - Specifies the DataJoint connection object. - Defaults to `dj.conn()`. -- `create_schema` - When `False`, the schema object will not create a schema on the -database and will raise an error if one does not already exist. - Defaults to `True`. -- `create_tables` - When `False`, the schema object will not create tables on the -database and will raise errors when accessing missing tables. - Defaults to `True`. - -## Working with existing data - -See the chapter [recall](recall.md) for how to work with data in -existing pipelines, including accessing a pipeline from one language when the pipeline -was developed using another. diff --git a/docs/src/archive/design/semantic-matching-spec.md b/docs/src/archive/design/semantic-matching-spec.md deleted file mode 100644 index b3333a873..000000000 --- a/docs/src/archive/design/semantic-matching-spec.md +++ /dev/null @@ -1,540 +0,0 @@ -# Semantic Matching for Joins - Specification - -## Overview - -This document specifies **semantic matching** for joins in DataJoint 2.0, replacing the current name-based matching rules. Semantic matching ensures that attributes are only matched when they share both the same name and the same **lineage** (origin), preventing accidental joins on unrelated attributes that happen to share names. - -### Goals - -1. **Prevent incorrect joins** on attributes that share names but represent different entities -2. **Enable valid joins** that are currently blocked due to overly restrictive rules -3. **Maintain backward compatibility** for well-designed schemas -4. **Provide clear error messages** when semantic conflicts are detected - ---- - -## User Guide - -### Quick Start - -Semantic matching is enabled by default in DataJoint 2.0. For most well-designed schemas, no changes are required. - -#### When You Might See Errors - -```python -# Two tables with generic 'id' attribute -class Student(dj.Manual): - definition = """ - id : uint32 - --- - name : varchar(100) - """ - -class Course(dj.Manual): - definition = """ - id : uint32 - --- - title : varchar(100) - """ - -# This will raise an error because 'id' has different lineages -Student() * Course() # DataJointError! -``` - -#### How to Resolve - -**Option 1: Rename attributes using projection** -```python -Student() * Course().proj(course_id='id') # OK -``` - -**Option 2: Bypass semantic check (use with caution)** -```python -Student().join(Course(), semantic_check=False) # OK, but be careful! -``` - -**Option 3: Use descriptive names (best practice)** -```python -class Student(dj.Manual): - definition = """ - student_id : uint32 - --- - name : varchar(100) - """ -``` - -### Migrating from DataJoint 1.x - -#### Removed Operators - -| Old Syntax | New Syntax | -|------------|------------| -| `A @ B` | `A.join(B, semantic_check=False)` | -| `A ^ B` | `A.restrict(B, semantic_check=False)` | -| `dj.U('a') * B` | `dj.U('a') & B` | - -#### Rebuilding Lineage for Existing Schemas - -If you have existing schemas created before DataJoint 2.0, rebuild their lineage tables: - -```python -import datajoint as dj - -# Connect and get your schema -schema = dj.Schema('my_database') - -# Rebuild lineage (do this once per schema) -schema.rebuild_lineage() - -# Restart Python kernel to pick up changes -``` - -**Important**: If your schema references tables in other schemas, rebuild those upstream schemas first. - ---- - -## API Reference - -### Schema Methods - -#### `schema.rebuild_lineage()` - -Rebuild the `~lineage` table for all tables in this schema. - -```python -schema.rebuild_lineage() -``` - -**Description**: Recomputes lineage for all attributes by querying FK relationships from the database's `information_schema`. Use this to restore lineage for schemas that predate the lineage system or after corruption. - -**Requirements**: -- Schema must exist -- Upstream schemas (referenced via cross-schema FKs) must have their lineage rebuilt first - -**Side Effects**: -- Creates `~lineage` table if it doesn't exist -- Deletes and repopulates all lineage entries for tables in the schema - -**Post-Action**: Restart Python kernel and reimport to pick up new lineage information. - -#### `schema.lineage_table_exists` - -Property indicating whether the `~lineage` table exists in this schema. - -```python -if schema.lineage_table_exists: - print("Lineage tracking is enabled") -``` - -**Returns**: `bool` - `True` if `~lineage` table exists, `False` otherwise. - -#### `schema.lineage` - -Property returning all lineage entries for the schema. - -```python -schema.lineage -# {'myschema.session.session_id': 'myschema.session.session_id', -# 'myschema.trial.session_id': 'myschema.session.session_id', -# 'myschema.trial.trial_num': 'myschema.trial.trial_num'} -``` - -**Returns**: `dict` - Maps `'schema.table.attribute'` to its lineage origin - -### Join Methods - -#### `expr.join(other, semantic_check=True)` - -Join two expressions with optional semantic checking. - -```python -result = A.join(B) # semantic_check=True (default) -result = A.join(B, semantic_check=False) # bypass semantic check -``` - -**Parameters**: -- `other`: Another query expression to join with -- `semantic_check` (bool): If `True` (default), raise error on non-homologous namesakes. If `False`, perform natural join without lineage checking. - -**Raises**: `DataJointError` if `semantic_check=True` and namesake attributes have different lineages. - -#### `expr.restrict(other, semantic_check=True)` - -Restrict expression with optional semantic checking. - -```python -result = A.restrict(B) # semantic_check=True (default) -result = A.restrict(B, semantic_check=False) # bypass semantic check -``` - -**Parameters**: -- `other`: Restriction condition (expression, dict, string, etc.) -- `semantic_check` (bool): If `True` (default), raise error on non-homologous namesakes when restricting by another expression. If `False`, no lineage checking. - -**Raises**: `DataJointError` if `semantic_check=True` and namesake attributes have different lineages. - -### Operators - -#### `A * B` (Join) - -Equivalent to `A.join(B, semantic_check=True)`. - -#### `A & B` (Restriction) - -Equivalent to `A.restrict(B, semantic_check=True)`. - -#### `A - B` (Anti-restriction) - -Restriction with negation. Semantic checking applies. - -To bypass semantic checking: `A.restrict(dj.Not(B), semantic_check=False)` - -#### `A + B` (Union) - -Union of expressions. Requires all namesake attributes to have matching lineage. - -### Removed Operators - -#### `A @ B` (Removed) - -Raises `DataJointError` with migration guidance to use `.join(semantic_check=False)`. - -#### `A ^ B` (Removed) - -Raises `DataJointError` with migration guidance to use `.restrict(semantic_check=False)`. - -#### `dj.U(...) * A` (Removed) - -Raises `DataJointError` with migration guidance to use `dj.U(...) & A`. - -### Universal Set (`dj.U`) - -#### Valid Operations - -```python -dj.U('a', 'b') & A # Restriction: promotes a, b to PK -dj.U('a', 'b').aggr(A, ...) # Aggregation: groups by a, b -dj.U() & A # Distinct primary keys of A -``` - -#### Invalid Operations - -```python -dj.U('a', 'b') - A # DataJointError: produces infinite set -dj.U('a', 'b') * A # DataJointError: use & instead -``` - ---- - -## Concepts - -### Attribute Lineage - -Lineage identifies the **origin** of an attribute - where it was first defined. It is represented as a string: - -``` -schema_name.table_name.attribute_name -``` - -#### Lineage Assignment Rules - -| Attribute Type | Lineage Value | -|----------------|---------------| -| Native primary key | `this_schema.this_table.attr_name` | -| FK-inherited (primary or secondary) | Traced to original definition | -| Native secondary | `None` | -| Computed (in projection) | `None` | - -#### Example - -```python -class Session(dj.Manual): # table: session - definition = """ - session_id : uint32 - --- - session_date : date - """ - -class Trial(dj.Manual): # table: trial - definition = """ - -> Session - trial_num : uint16 - --- - stimulus : varchar(100) - """ -``` - -Lineages: -- `Session.session_id` → `myschema.session.session_id` (native PK) -- `Session.session_date` → `None` (native secondary) -- `Trial.session_id` → `myschema.session.session_id` (inherited via FK) -- `Trial.trial_num` → `myschema.trial.trial_num` (native PK) -- `Trial.stimulus` → `None` (native secondary) - -### Terminology - -| Term | Definition | -|------|------------| -| **Lineage** | The origin of an attribute: `schema.table.attribute` | -| **Homologous attributes** | Attributes with the same lineage | -| **Namesake attributes** | Attributes with the same name | -| **Homologous namesakes** | Same name AND same lineage — used for join matching | -| **Non-homologous namesakes** | Same name BUT different lineage — cause join errors | - -### Semantic Matching Rules - -| Scenario | Action | -|----------|--------| -| Same name, same lineage (both non-null) | **Match** | -| Same name, different lineage | **Error** | -| Same name, either lineage is null | **Error** | -| Different names | **No match** | - ---- - -## Implementation Details - -### `~lineage` Table - -Each schema has a hidden `~lineage` table storing lineage information: - -```sql -CREATE TABLE `schema_name`.`~lineage` ( - table_name VARCHAR(64) NOT NULL, - attribute_name VARCHAR(64) NOT NULL, - lineage VARCHAR(255) NOT NULL, - PRIMARY KEY (table_name, attribute_name) -) -``` - -### Lineage Population - -**At table declaration**: -1. Delete any existing lineage entries for the table -2. For FK attributes: copy lineage from parent (with warning if parent lineage missing) -3. For native PK attributes: set lineage to `schema.table.attribute` -4. Native secondary attributes: no entry (lineage = None) - -**At table drop**: -- Delete all lineage entries for the table - -### Missing Lineage Handling - -**If `~lineage` table doesn't exist**: -- Warning issued during semantic check -- Semantic checking disabled (join proceeds as natural join) - -**If parent lineage missing during declaration**: -- Warning issued -- Parent attribute used as origin -- Recommend rebuilding lineage after parent schema is fixed - -### Heading's `lineage_available` Property - -The `Heading` class tracks whether lineage information is available: - -```python -heading.lineage_available # True if ~lineage table exists for this schema -``` - -This property is: -- Set when heading is loaded from database -- Propagated through projections, joins, and other operations -- Used by `assert_join_compatibility` to decide whether to perform semantic checking - ---- - -## Error Messages - -### Non-Homologous Namesakes - -``` -DataJointError: Cannot join on attribute `id`: different lineages -(university.student.id vs university.course.id). -Use .proj() to rename one of the attributes. -``` - -### Removed `@` Operator - -``` -DataJointError: The @ operator has been removed in DataJoint 2.0. -Use .join(other, semantic_check=False) for permissive joins. -``` - -### Removed `^` Operator - -``` -DataJointError: The ^ operator has been removed in DataJoint 2.0. -Use .restrict(other, semantic_check=False) for permissive restrictions. -``` - -### Removed `dj.U * table` - -``` -DataJointError: dj.U(...) * table is no longer supported in DataJoint 2.0. -Use dj.U(...) & table instead. -``` - -### Missing Lineage Warning - -``` -WARNING: Semantic check disabled: ~lineage table not found. -To enable semantic matching, rebuild lineage with: schema.rebuild_lineage() -``` - -### Parent Lineage Missing Warning - -``` -WARNING: Lineage for `parent_db`.`parent_table`.`attr` not found -(parent schema's ~lineage table may be missing or incomplete). -Using it as origin. Once the parent schema's lineage is rebuilt, -run schema.rebuild_lineage() on this schema to correct the lineage. -``` - ---- - -## Examples - -### Example 1: Valid Join (Shared Lineage) - -```python -class Student(dj.Manual): - definition = """ - student_id : uint32 - --- - name : varchar(100) - """ - -class Enrollment(dj.Manual): - definition = """ - -> Student - -> Course - --- - grade : varchar(2) - """ - -# Works: student_id has same lineage in both -Student() * Enrollment() -``` - -### Example 2: Invalid Join (Different Lineage) - -```python -class TableA(dj.Manual): - definition = """ - id : uint32 - --- - value_a : int32 - """ - -class TableB(dj.Manual): - definition = """ - id : uint32 - --- - value_b : int32 - """ - -# Error: 'id' has different lineages -TableA() * TableB() - -# Solution 1: Rename -TableA() * TableB().proj(b_id='id') - -# Solution 2: Bypass (use with caution) -TableA().join(TableB(), semantic_check=False) -``` - -### Example 3: Multi-hop FK Inheritance - -```python -class Session(dj.Manual): - definition = """ - session_id : uint32 - --- - session_date : date - """ - -class Trial(dj.Manual): - definition = """ - -> Session - trial_num : uint16 - """ - -class Response(dj.Computed): - definition = """ - -> Trial - --- - response_time : float64 - """ - -# All work: session_id traces back to Session in all tables -Session() * Trial() -Session() * Response() -Trial() * Response() -``` - -### Example 4: Secondary FK Attribute - -```python -class Course(dj.Manual): - definition = """ - course_id : int unsigned - --- - title : varchar(100) - """ - -class FavoriteCourse(dj.Manual): - definition = """ - student_id : int unsigned - --- - -> Course - """ - -class RequiredCourse(dj.Manual): - definition = """ - major_id : int unsigned - --- - -> Course - """ - -# Works: course_id is secondary in both, but has same lineage -FavoriteCourse() * RequiredCourse() -``` - -### Example 5: Aliased Foreign Key - -```python -class Person(dj.Manual): - definition = """ - person_id : int unsigned - --- - full_name : varchar(100) - """ - -class Marriage(dj.Manual): - definition = """ - -> Person.proj(husband='person_id') - -> Person.proj(wife='person_id') - --- - marriage_date : date - """ - -# husband and wife both have lineage: schema.person.person_id -# They are homologous (same lineage) but have different names -``` - ---- - -## Best Practices - -1. **Use descriptive attribute names**: Prefer `student_id` over generic `id` - -2. **Leverage foreign keys**: Inherited attributes maintain lineage automatically - -3. **Rebuild lineage for legacy schemas**: Run `schema.rebuild_lineage()` once - -4. **Rebuild upstream schemas first**: For cross-schema FKs, rebuild parent schemas before child schemas - -5. **Restart after rebuilding**: Restart Python kernel to pick up new lineage information - -6. **Use `semantic_check=False` sparingly**: Only when you're certain the natural join is correct diff --git a/docs/src/archive/design/tables/attach.md b/docs/src/archive/design/tables/attach.md deleted file mode 100644 index c4950ffdf..000000000 --- a/docs/src/archive/design/tables/attach.md +++ /dev/null @@ -1,67 +0,0 @@ -# External Data - -## File Attachment Datatype - -### Configuration & Usage - -Corresponding to issue -[#480](https://github.com/datajoint/datajoint-python/issues/480), -the `attach` attribute type allows users to `attach` files into DataJoint -schemas as DataJoint-managed files. This is in contrast to traditional `blobs` -which are encodings of programming language data structures such as arrays. - -The functionality is modeled after email attachments, where users `attach` -a file along with a message and message recipients have access to a -copy of that file upon retrieval of the message. - -For DataJoint `attach` attributes, DataJoint will copy the input -file into a DataJoint store, hash the file contents, and track -the input file name. Subsequent `fetch` operations will transfer a -copy of the file to the local directory of the Python process and -return a pointer to it's location for subsequent client usage. This -allows arbitrary files to be `uploaded` or `attached` to a DataJoint -schema for later use in processing. File integrity is preserved by -checksum comparison against the attachment data and verifying the contents -during retrieval. - -For example, given a `localattach` store: - -```python -dj.config['stores'] = { - 'localattach': { - 'protocol': 'file', - 'location': '/data/attach' - } -} -``` - -A `ScanAttachment` table can be created: - -```python -@schema -class ScanAttachment(dj.Manual): - definition = """ - -> Session - --- - scan_image: attach@localattach # attached image scans - """ -``` - -Files can be added using an insert pointing to the source file: - -```python ->>> ScanAttachment.insert1((0, '/input/image0.tif')) -``` - -And then retrieved to the current directory using `fetch`: - -```python ->>> s0 = (ScanAttachment & {'session_id': 0}).fetch1() ->>> s0 -{'session_id': 0, 'scan_image': './image0.tif'} ->>> fh = open(s0['scan_image'], 'rb') ->>> fh -<_io.BufferedReader name='./image0.tif') -``` - - diff --git a/docs/src/archive/design/tables/attributes.md b/docs/src/archive/design/tables/attributes.md deleted file mode 100644 index 3753621d5..000000000 --- a/docs/src/archive/design/tables/attributes.md +++ /dev/null @@ -1,181 +0,0 @@ -# Datatypes - -DataJoint supports the following datatypes. -To conserve database resources, use the smallest and most restrictive datatype -sufficient for your data. -This also ensures that only valid data are entered into the pipeline. - -## Core datatypes (recommended) - -Use these portable, scientist-friendly types for cross-database compatibility. - -### Integers - -- `int8`: 8-bit signed integer (-128 to 127) -- `uint8`: 8-bit unsigned integer (0 to 255) -- `int16`: 16-bit signed integer (-32,768 to 32,767) -- `uint16`: 16-bit unsigned integer (0 to 65,535) -- `int32`: 32-bit signed integer -- `uint32`: 32-bit unsigned integer -- `int64`: 64-bit signed integer -- `uint64`: 64-bit unsigned integer -- `bool`: boolean value (True/False, stored as 0/1) - -### Floating-point - -- `float32`: 32-bit single-precision floating-point. Sufficient for many measurements. -- `float64`: 64-bit double-precision floating-point. - Avoid using floating-point types in primary keys due to equality comparison issues. -- `decimal(n,f)`: fixed-point number with *n* total digits and *f* fractional digits. - Use for exact decimal representation (e.g., currency, coordinates). - Safe for primary keys due to well-defined precision. - -### Strings - -- `char(n)`: fixed-length string of exactly *n* characters. -- `varchar(n)`: variable-length string up to *n* characters. -- `enum(...)`: one of several enumerated values, e.g., `enum("low", "medium", "high")`. - Do not use enums in primary keys due to difficulty changing definitions. - -> **Note:** For unlimited text, use `varchar` with a generous limit, `json` for structured content, -> or `` for large text files. Native SQL `text` types are supported but not portable. - -**Encoding policy:** All strings use UTF-8 encoding (`utf8mb4` in MySQL, `UTF8` in PostgreSQL). -Character encoding and collation are database-level configuration, not part of type definitions. -Comparisons are case-sensitive by default. - -### Date/Time - -- `date`: date as `'YYYY-MM-DD'`. -- `datetime`: date and time as `'YYYY-MM-DD HH:MM:SS'`. - Use `CURRENT_TIMESTAMP` as default for auto-populated timestamps. - -**Timezone policy:** All `datetime` values should be stored as **UTC**. Timezone -conversion is a presentation concern handled by the application layer. This ensures -reproducible computations regardless of server location or timezone settings. - -### Binary - -- `bytes`: raw binary data (up to 4 GiB). Stores and returns raw bytes without - serialization. For serialized Python objects (arrays, dicts, etc.), use ``. - -### Other - -- `uuid`: 128-bit universally unique identifier. -- `json`: JSON document for structured data. - -## Native datatypes (advanced) - -Native database types are available for advanced use cases but are **not recommended** -for portable pipelines. Using native types will generate a warning. - -- `tinyint`, `smallint`, `int`, `bigint` (with optional `unsigned`) -- `float`, `double`, `real` -- `tinyblob`, `blob`, `mediumblob`, `longblob` -- `tinytext`, `mediumtext`, `longtext` (size variants) -- `time`, `timestamp`, `year` -- `mediumint`, `serial`, `int auto_increment` - -See the [storage types spec](storage-types-spec.md) for complete mappings. - -## Codec types (special datatypes) - -Codecs provide `encode()`/`decode()` semantics for complex data that doesn't -fit native database types. They are denoted with angle brackets: ``. - -### Storage mode: `@` convention - -The `@` character indicates **external storage** (object store vs database): - -- **No `@`**: Internal storage (database) - e.g., ``, `` -- **`@` present**: External storage (object store) - e.g., ``, `` -- **`@` alone**: Use default store - e.g., `` -- **`@name`**: Use named store - e.g., `` - -### Built-in codecs - -**Serialization types** - for Python objects: - -- ``: DataJoint's native serialization format for Python objects. Supports - NumPy arrays, dicts, lists, datetime objects, and nested structures. Stores in - database. Compatible with MATLAB. See [custom codecs](codecs.md) for details. - -- `` / ``: Like `` but stores externally with hash- - addressed deduplication. Use for large arrays that may be duplicated across rows. - -**File storage types** - for managed files: - -- `` / ``: Managed file and folder storage with path derived - from primary key. Supports Zarr, HDF5, and direct writes via fsspec. Returns - `ObjectRef` for lazy access. External only. See [object storage](object.md). - -- `` / ``: Hash-addressed storage for raw bytes with - MD5 deduplication. External only. Use via `` or `` rather than directly. - -**File attachment types** - for file transfer: - -- ``: File attachment stored in database with filename preserved. Similar - to email attachments. Good for small files (<16MB). See [attachments](attach.md). - -- `` / ``: Like `` but stores externally with - deduplication. Use for large files. - -**File reference types** - for external files: - -- ``: Reference to existing file in a configured store. No file - copying occurs. Returns `ObjectRef` for lazy access. External only. See [filepath](filepath.md). - -### User-defined codecs - -- ``: Define your own [custom codec](codecs.md) with - bidirectional conversion between Python objects and database storage. Use for - graphs, domain-specific objects, or custom data structures. - -## Core type aliases - -DataJoint provides convenient type aliases that map to standard database types. -These aliases use familiar naming conventions from NumPy and other numerical computing -libraries, making table definitions more readable and portable across database backends. - -| Alias | MySQL | PostgreSQL | Description | -|-------|-------|------------|-------------| -| `bool` | `TINYINT` | `BOOLEAN` | Boolean value (0 or 1) | -| `int8` | `TINYINT` | `SMALLINT` | 8-bit signed integer (-128 to 127) | -| `uint8` | `TINYINT UNSIGNED` | `SMALLINT` | 8-bit unsigned integer (0 to 255) | -| `int16` | `SMALLINT` | `SMALLINT` | 16-bit signed integer | -| `uint16` | `SMALLINT UNSIGNED` | `INTEGER` | 16-bit unsigned integer | -| `int32` | `INT` | `INTEGER` | 32-bit signed integer | -| `uint32` | `INT UNSIGNED` | `BIGINT` | 32-bit unsigned integer | -| `int64` | `BIGINT` | `BIGINT` | 64-bit signed integer | -| `uint64` | `BIGINT UNSIGNED` | `NUMERIC(20)` | 64-bit unsigned integer | -| `float32` | `FLOAT` | `REAL` | 32-bit single-precision float | -| `float64` | `DOUBLE` | `DOUBLE PRECISION` | 64-bit double-precision float | -| `bytes` | `LONGBLOB` | `BYTEA` | Raw binary data | - -Example usage: - -```python -@schema -class Measurement(dj.Manual): - definition = """ - measurement_id : int32 - --- - temperature : float32 # single-precision temperature reading - precise_value : float64 # double-precision measurement - sample_count : uint32 # unsigned 32-bit counter - sensor_flags : uint8 # 8-bit status flags - is_valid : bool # boolean flag - raw_data : bytes # raw binary data - processed : # serialized Python object - large_array : # external storage with deduplication - """ -``` - -## Datatypes not (yet) supported - -- `binary(n)` / `varbinary(n)` - use `bytes` instead -- `bit(n)` - use `int` types with bitwise operations -- `set(...)` - use `json` for multiple selections - -For additional information about these datatypes, see -http://dev.mysql.com/doc/refman/5.6/en/data-types.html diff --git a/docs/src/archive/design/tables/blobs.md b/docs/src/archive/design/tables/blobs.md deleted file mode 100644 index 9f73d54d4..000000000 --- a/docs/src/archive/design/tables/blobs.md +++ /dev/null @@ -1,26 +0,0 @@ -# Blobs - -DataJoint provides functionality for serializing and deserializing complex data types -into binary blobs for efficient storage and compatibility with MATLAB's mYm -serialization. This includes support for: - -+ Basic Python data types (e.g., integers, floats, strings, dictionaries). -+ NumPy arrays and scalars. -+ Specialized data types like UUIDs, decimals, and datetime objects. - -## Serialization and Deserialization Process - -Serialization converts Python objects into a binary representation for efficient storage -within the database. Deserialization converts the binary representation back into the -original Python object. - -Blobs over 1 KiB are compressed using the zlib library to reduce storage requirements. - -## Supported Data Types - -DataJoint supports the following data types for serialization: - -+ Scalars: Integers, floats, booleans, strings. -+ Collections: Lists, tuples, sets, dictionaries. -+ NumPy: Arrays, structured arrays, and scalars. -+ Custom Types: UUIDs, decimals, datetime objects, MATLAB cell and struct arrays. diff --git a/docs/src/archive/design/tables/codec-spec.md b/docs/src/archive/design/tables/codec-spec.md deleted file mode 100644 index a3eefa578..000000000 --- a/docs/src/archive/design/tables/codec-spec.md +++ /dev/null @@ -1,766 +0,0 @@ -# Codec Specification - -This document specifies the DataJoint Codec API for creating custom attribute types -that extend DataJoint's native type system. - -## Overview - -Codecs define bidirectional conversion between Python objects and database storage. -They enable storing complex data types (graphs, models, custom formats) while -maintaining DataJoint's query capabilities. - -``` -┌─────────────────┐ ┌─────────────────┐ -│ Python Object │ ──── encode ────► │ Storage Type │ -│ (e.g. Graph) │ │ (e.g. bytes) │ -│ │ ◄─── decode ──── │ │ -└─────────────────┘ └─────────────────┘ -``` - -## Quick Start - -```python -import datajoint as dj -import networkx as nx - -class GraphCodec(dj.Codec): - """Store NetworkX graphs.""" - - name = "graph" # Use as in definitions - - def get_dtype(self, is_external: bool) -> str: - return "" # Delegate to blob for serialization - - def encode(self, graph, *, key=None, store_name=None): - return { - 'nodes': list(graph.nodes(data=True)), - 'edges': list(graph.edges(data=True)), - } - - def decode(self, stored, *, key=None): - G = nx.Graph() - G.add_nodes_from(stored['nodes']) - G.add_edges_from(stored['edges']) - return G - -# Use in table definition -@schema -class Connectivity(dj.Manual): - definition = ''' - conn_id : int - --- - network : - ''' -``` - -## The Codec Base Class - -All custom codecs inherit from `dj.Codec`: - -```python -class Codec(ABC): - """Base class for codec types.""" - - name: str | None = None # Required: unique identifier - - def get_dtype(self, is_external: bool) -> str: - """Return the storage dtype.""" - raise NotImplementedError - - @abstractmethod - def encode(self, value, *, key=None, store_name=None) -> Any: - """Encode Python value for storage.""" - ... - - @abstractmethod - def decode(self, stored, *, key=None) -> Any: - """Decode stored value back to Python.""" - ... - - def validate(self, value) -> None: - """Optional: validate value before encoding.""" - pass -``` - -## Required Components - -### 1. The `name` Attribute - -The `name` class attribute is a unique identifier used in table definitions with -`` syntax: - -```python -class MyCodec(dj.Codec): - name = "mycodec" # Use as in definitions -``` - -Naming conventions: -- Use lowercase with underscores: `spike_train`, `graph_embedding` -- Avoid generic names that might conflict: prefer `lab_model` over `model` -- Names must be unique across all registered codecs - -### 2. The `get_dtype()` Method - -Returns the underlying storage type. The `is_external` parameter indicates whether -the `@` modifier is present in the table definition: - -```python -def get_dtype(self, is_external: bool) -> str: - """ - Args: - is_external: True if @ modifier present (e.g., ) - - Returns: - - A core type: "bytes", "json", "varchar(N)", "int32", etc. - - Another codec: "", "", etc. - - Raises: - DataJointError: If external storage not supported but @ is present - """ -``` - -Examples: - -```python -# Simple: always store as bytes -def get_dtype(self, is_external: bool) -> str: - return "bytes" - -# Different behavior for internal/external -def get_dtype(self, is_external: bool) -> str: - return "" if is_external else "bytes" - -# External-only codec -def get_dtype(self, is_external: bool) -> str: - if not is_external: - raise DataJointError(" requires @ (external storage only)") - return "json" -``` - -### 3. The `encode()` Method - -Converts Python objects to the format expected by `get_dtype()`: - -```python -def encode(self, value: Any, *, key: dict | None = None, store_name: str | None = None) -> Any: - """ - Args: - value: The Python object to store - key: Primary key values (for context-dependent encoding) - store_name: Target store name (for external storage) - - Returns: - Value in the format expected by get_dtype() - """ -``` - -### 4. The `decode()` Method - -Converts stored values back to Python objects: - -```python -def decode(self, stored: Any, *, key: dict | None = None) -> Any: - """ - Args: - stored: Data retrieved from storage - key: Primary key values (for context-dependent decoding) - - Returns: - The reconstructed Python object - """ -``` - -### 5. The `validate()` Method (Optional) - -Called automatically before `encode()` during INSERT operations: - -```python -def validate(self, value: Any) -> None: - """ - Args: - value: The value to validate - - Raises: - TypeError: If the value has an incompatible type - ValueError: If the value fails domain validation - """ - if not isinstance(value, ExpectedType): - raise TypeError(f"Expected ExpectedType, got {type(value).__name__}") -``` - -## Auto-Registration - -Codecs automatically register when their class is defined. No decorator needed: - -```python -# This codec is registered automatically when the class is defined -class MyCodec(dj.Codec): - name = "mycodec" - # ... -``` - -### Skipping Registration - -For abstract base classes that shouldn't be registered: - -```python -class BaseCodec(dj.Codec, register=False): - """Abstract base - not registered.""" - name = None # Or omit entirely - -class ConcreteCodec(BaseCodec): - name = "concrete" # This one IS registered - # ... -``` - -### Registration Timing - -Codecs are registered at class definition time. Ensure your codec classes are -imported before any table definitions that use them: - -```python -# myproject/codecs.py -class GraphCodec(dj.Codec): - name = "graph" - ... - -# myproject/tables.py -import myproject.codecs # Ensure codecs are registered - -@schema -class Networks(dj.Manual): - definition = ''' - id : int - --- - network : - ''' -``` - -## Codec Composition (Chaining) - -Codecs can delegate to other codecs by returning `` from `get_dtype()`. -This enables layered functionality: - -```python -class CompressedJsonCodec(dj.Codec): - """Compress JSON data with zlib.""" - - name = "zjson" - - def get_dtype(self, is_external: bool) -> str: - return "" # Delegate serialization to blob codec - - def encode(self, value, *, key=None, store_name=None): - import json, zlib - json_bytes = json.dumps(value).encode('utf-8') - return zlib.compress(json_bytes) - - def decode(self, stored, *, key=None): - import json, zlib - json_bytes = zlib.decompress(stored) - return json.loads(json_bytes.decode('utf-8')) -``` - -### How Chaining Works - -When DataJoint encounters ``: - -1. Calls `ZjsonCodec.get_dtype(is_external=False)` → returns `""` -2. Calls `BlobCodec.get_dtype(is_external=False)` → returns `"bytes"` -3. Final storage type is `bytes` (LONGBLOB in MySQL) - -During INSERT: -1. `ZjsonCodec.encode()` converts Python dict → compressed bytes -2. `BlobCodec.encode()` packs bytes → DJ blob format -3. Stored in database - -During FETCH: -1. Read from database -2. `BlobCodec.decode()` unpacks DJ blob → compressed bytes -3. `ZjsonCodec.decode()` decompresses → Python dict - -### Built-in Codec Chains - -DataJoint's built-in codecs form these chains: - -``` - → bytes (internal) - → json (external) - - → bytes (internal) - → json (external) - - → json (external only) - → json (external only) - → json (external only) -``` - -### Store Name Propagation - -When using external storage (`@`), the store name propagates through the chain: - -```python -# Table definition -data : - -# Resolution: -# 1. MyCodec.get_dtype(is_external=True) → "" -# 2. BlobCodec.get_dtype(is_external=True) → "" -# 3. HashCodec.get_dtype(is_external=True) → "json" -# 4. store_name="coldstore" passed to HashCodec.encode() -``` - -## Plugin System (Entry Points) - -Codecs can be distributed as installable packages using Python entry points. - -### Package Structure - -``` -dj-graph-codecs/ -├── pyproject.toml -└── src/ - └── dj_graph_codecs/ - ├── __init__.py - └── codecs.py -``` - -### pyproject.toml - -```toml -[project] -name = "dj-graph-codecs" -version = "1.0.0" -dependencies = ["datajoint>=2.0", "networkx"] - -[project.entry-points."datajoint.codecs"] -graph = "dj_graph_codecs.codecs:GraphCodec" -weighted_graph = "dj_graph_codecs.codecs:WeightedGraphCodec" -``` - -### Codec Implementation - -```python -# src/dj_graph_codecs/codecs.py -import datajoint as dj -import networkx as nx - -class GraphCodec(dj.Codec): - name = "graph" - - def get_dtype(self, is_external: bool) -> str: - return "" - - def encode(self, graph, *, key=None, store_name=None): - return { - 'nodes': list(graph.nodes(data=True)), - 'edges': list(graph.edges(data=True)), - } - - def decode(self, stored, *, key=None): - G = nx.Graph() - G.add_nodes_from(stored['nodes']) - G.add_edges_from(stored['edges']) - return G - -class WeightedGraphCodec(dj.Codec): - name = "weighted_graph" - - def get_dtype(self, is_external: bool) -> str: - return "" - - def encode(self, graph, *, key=None, store_name=None): - return { - 'nodes': list(graph.nodes(data=True)), - 'edges': [(u, v, d) for u, v, d in graph.edges(data=True)], - } - - def decode(self, stored, *, key=None): - G = nx.Graph() - G.add_nodes_from(stored['nodes']) - for u, v, d in stored['edges']: - G.add_edge(u, v, **d) - return G -``` - -### Usage After Installation - -```bash -pip install dj-graph-codecs -``` - -```python -# Codecs are automatically discovered and available -@schema -class Networks(dj.Manual): - definition = ''' - network_id : int - --- - topology : - weights : - ''' -``` - -### Entry Point Discovery - -DataJoint loads entry points lazily when a codec is first requested: - -1. Check explicit registry (codecs defined in current process) -2. Load entry points from `datajoint.codecs` group -3. Also checks legacy `datajoint.types` group for compatibility - -## API Reference - -### Module Functions - -```python -import datajoint as dj - -# List all registered codec names -dj.list_codecs() # Returns: ['blob', 'hash', 'object', 'attach', 'filepath', ...] - -# Get a codec instance by name -codec = dj.get_codec("blob") -codec = dj.get_codec("") # Angle brackets are optional -codec = dj.get_codec("") # Store parameter is stripped -``` - -### Internal Functions (for advanced use) - -```python -from datajoint.codecs import ( - is_codec_registered, # Check if codec exists - unregister_codec, # Remove codec (testing only) - resolve_dtype, # Resolve codec chain - parse_type_spec, # Parse "" syntax -) -``` - -## Built-in Codecs - -DataJoint provides these built-in codecs: - -| Codec | Internal | External | Description | -|-------|----------|----------|-------------| -| `` | `bytes` | `` | DataJoint serialization for Python objects | -| `` | N/A | `json` | Content-addressed storage with MD5 deduplication | -| `` | N/A | `json` | Path-addressed storage for files/folders | -| `` | `bytes` | `` | File attachments with filename preserved | -| `` | N/A | `json` | Reference to existing files in store | - -## Complete Examples - -### Example 1: Simple Serialization - -```python -import datajoint as dj -import numpy as np - -class SpikeTrainCodec(dj.Codec): - """Efficient storage for sparse spike timing data.""" - - name = "spike_train" - - def get_dtype(self, is_external: bool) -> str: - return "" - - def validate(self, value): - if not isinstance(value, np.ndarray): - raise TypeError("Expected numpy array of spike times") - if value.ndim != 1: - raise ValueError("Spike train must be 1-dimensional") - if len(value) > 1 and not np.all(np.diff(value) >= 0): - raise ValueError("Spike times must be sorted") - - def encode(self, spike_times, *, key=None, store_name=None): - # Store as differences (smaller values, better compression) - return np.diff(spike_times, prepend=0).astype(np.float32) - - def decode(self, stored, *, key=None): - # Reconstruct original spike times - return np.cumsum(stored).astype(np.float64) -``` - -### Example 2: External Storage - -```python -import datajoint as dj -import pickle - -class ModelCodec(dj.Codec): - """Store ML models with optional external storage.""" - - name = "model" - - def get_dtype(self, is_external: bool) -> str: - # Use hash-addressed storage for large models - return "" if is_external else "" - - def encode(self, model, *, key=None, store_name=None): - return pickle.dumps(model, protocol=pickle.HIGHEST_PROTOCOL) - - def decode(self, stored, *, key=None): - return pickle.loads(stored) - - def validate(self, value): - # Check that model has required interface - if not hasattr(value, 'predict'): - raise TypeError("Model must have a predict() method") -``` - -Usage: -```python -@schema -class Models(dj.Manual): - definition = ''' - model_id : int - --- - small_model : # Internal storage - large_model : # External (default store) - archive_model : # External (specific store) - ''' -``` - -### Example 3: JSON with Schema Validation - -```python -import datajoint as dj -import jsonschema - -class ConfigCodec(dj.Codec): - """Store validated JSON configuration.""" - - name = "config" - - SCHEMA = { - "type": "object", - "properties": { - "version": {"type": "integer", "minimum": 1}, - "settings": {"type": "object"}, - }, - "required": ["version", "settings"], - } - - def get_dtype(self, is_external: bool) -> str: - return "json" - - def validate(self, value): - jsonschema.validate(value, self.SCHEMA) - - def encode(self, config, *, key=None, store_name=None): - return config # JSON type handles serialization - - def decode(self, stored, *, key=None): - return stored -``` - -### Example 4: Context-Dependent Encoding - -```python -import datajoint as dj - -class VersionedDataCodec(dj.Codec): - """Handle different encoding versions based on primary key.""" - - name = "versioned" - - def get_dtype(self, is_external: bool) -> str: - return "" - - def encode(self, value, *, key=None, store_name=None): - version = key.get("schema_version", 1) if key else 1 - if version >= 2: - return {"v": 2, "data": self._encode_v2(value)} - return {"v": 1, "data": self._encode_v1(value)} - - def decode(self, stored, *, key=None): - version = stored.get("v", 1) - if version >= 2: - return self._decode_v2(stored["data"]) - return self._decode_v1(stored["data"]) - - def _encode_v1(self, value): - return value - - def _decode_v1(self, data): - return data - - def _encode_v2(self, value): - # New encoding format - return {"optimized": True, "payload": value} - - def _decode_v2(self, data): - return data["payload"] -``` - -### Example 5: External-Only Codec - -```python -import datajoint as dj -from pathlib import Path - -class ZarrCodec(dj.Codec): - """Store Zarr arrays in object storage.""" - - name = "zarr" - - def get_dtype(self, is_external: bool) -> str: - if not is_external: - raise dj.DataJointError(" requires @ (external storage only)") - return "" # Delegate to object storage - - def encode(self, value, *, key=None, store_name=None): - import zarr - import tempfile - - # If already a path, pass through - if isinstance(value, (str, Path)): - return str(value) - - # If zarr array, save to temp and return path - if isinstance(value, zarr.Array): - tmpdir = tempfile.mkdtemp() - path = Path(tmpdir) / "data.zarr" - zarr.save(path, value) - return str(path) - - raise TypeError(f"Expected zarr.Array or path, got {type(value)}") - - def decode(self, stored, *, key=None): - # ObjectCodec returns ObjectRef, use its fsmap for zarr - import zarr - return zarr.open(stored.fsmap, mode='r') -``` - -## Best Practices - -### 1. Choose Appropriate Storage Types - -| Data Type | Recommended `get_dtype()` | -|-----------|---------------------------| -| Python objects (dicts, arrays) | `""` | -| Large binary data | `""` (external) | -| Files/folders (Zarr, HDF5) | `""` (external) | -| Simple JSON-serializable | `"json"` | -| Short strings | `"varchar(N)"` | -| Numeric identifiers | `"int32"`, `"int64"` | - -### 2. Handle None Values - -Nullable columns may pass `None` to your codec: - -```python -def encode(self, value, *, key=None, store_name=None): - if value is None: - return None # Pass through for nullable columns - return self._actual_encode(value) - -def decode(self, stored, *, key=None): - if stored is None: - return None - return self._actual_decode(stored) -``` - -### 3. Test Round-Trips - -Always verify that `decode(encode(x)) == x`: - -```python -def test_codec_roundtrip(): - codec = MyCodec() - - test_values = [ - {"key": "value"}, - [1, 2, 3], - np.array([1.0, 2.0]), - ] - - for original in test_values: - encoded = codec.encode(original) - decoded = codec.decode(encoded) - assert decoded == original or np.array_equal(decoded, original) -``` - -### 4. Include Validation - -Catch errors early with `validate()`: - -```python -def validate(self, value): - if not isinstance(value, ExpectedType): - raise TypeError(f"Expected ExpectedType, got {type(value).__name__}") - - if not self._is_valid(value): - raise ValueError("Value fails validation constraints") -``` - -### 5. Document Expected Formats - -Include docstrings explaining input/output formats: - -```python -class MyCodec(dj.Codec): - """ - Store MyType objects. - - Input format (encode): - MyType instance with attributes: x, y, z - - Storage format: - Dict with keys: 'x', 'y', 'z' - - Output format (decode): - MyType instance reconstructed from storage - """ -``` - -### 6. Consider Versioning - -If your encoding format might change: - -```python -def encode(self, value, *, key=None, store_name=None): - return { - "_version": 2, - "_data": self._encode_v2(value), - } - -def decode(self, stored, *, key=None): - version = stored.get("_version", 1) - data = stored.get("_data", stored) - - if version == 1: - return self._decode_v1(data) - return self._decode_v2(data) -``` - -## Error Handling - -### Common Errors - -| Error | Cause | Solution | -|-------|-------|----------| -| `Unknown codec: ` | Codec not registered | Import module defining codec before table definition | -| `Codec already registered` | Duplicate name | Use unique names; check for conflicts | -| ` requires @` | External-only codec used without @ | Add `@` or `@store` to attribute type | -| `Circular codec reference` | Codec chain forms a loop | Check `get_dtype()` return values | - -### Debugging - -```python -# Check what codecs are registered -print(dj.list_codecs()) - -# Inspect a codec -codec = dj.get_codec("mycodec") -print(f"Name: {codec.name}") -print(f"Internal dtype: {codec.get_dtype(is_external=False)}") -print(f"External dtype: {codec.get_dtype(is_external=True)}") - -# Resolve full chain -from datajoint.codecs import resolve_dtype -final_type, chain, store = resolve_dtype("") -print(f"Final storage type: {final_type}") -print(f"Codec chain: {[c.name for c in chain]}") -print(f"Store: {store}") -``` diff --git a/docs/src/archive/design/tables/codecs.md b/docs/src/archive/design/tables/codecs.md deleted file mode 100644 index ccc9db1f7..000000000 --- a/docs/src/archive/design/tables/codecs.md +++ /dev/null @@ -1,553 +0,0 @@ -# Custom Codecs - -In modern scientific research, data pipelines often involve complex workflows that -generate diverse data types. From high-dimensional imaging data to machine learning -models, these data types frequently exceed the basic representations supported by -traditional relational databases. For example: - -+ A lab working on neural connectivity might use graph objects to represent brain - networks. -+ Researchers processing raw imaging data might store custom objects for pre-processing - configurations. -+ Computational biologists might store fitted machine learning models or parameter - objects for downstream predictions. - -To handle these diverse needs, DataJoint provides the **Codec** system. It -enables researchers to store and retrieve complex, non-standard data types—like Python -objects or data structures—in a relational database while maintaining the -reproducibility, modularity, and query capabilities required for scientific workflows. - -## Overview - -Custom codecs define bidirectional conversion between: - -- **Python objects** (what your code works with) -- **Storage format** (what gets stored in the database) - -``` -┌─────────────────┐ encode() ┌─────────────────┐ -│ Python Object │ ───────────────► │ Storage Type │ -│ (e.g. Graph) │ │ (e.g. bytes) │ -└─────────────────┘ decode() └─────────────────┘ - ◄─────────────── -``` - -## Defining Custom Codecs - -Create a custom codec by subclassing `dj.Codec` and implementing the required -methods. Codecs auto-register when their class is defined: - -```python -import datajoint as dj -import networkx as nx - -class GraphCodec(dj.Codec): - """Custom codec for storing networkx graphs.""" - - # Required: unique identifier used in table definitions - name = "graph" - - def get_dtype(self, is_external: bool) -> str: - """Return the underlying storage type.""" - return "" # Delegate to blob for serialization - - def encode(self, graph, *, key=None, store_name=None): - """Convert graph to storable format (called on INSERT).""" - return { - 'nodes': list(graph.nodes(data=True)), - 'edges': list(graph.edges(data=True)), - } - - def decode(self, stored, *, key=None): - """Convert stored data back to graph (called on FETCH).""" - G = nx.Graph() - G.add_nodes_from(stored['nodes']) - G.add_edges_from(stored['edges']) - return G -``` - -### Required Components - -| Component | Description | -|-----------|-------------| -| `name` | Unique identifier used in table definitions with `` syntax | -| `get_dtype(is_external)` | Returns underlying storage type (e.g., `""`, `"bytes"`, `"json"`) | -| `encode(value, *, key=None, store_name=None)` | Converts Python object to storable format | -| `decode(stored, *, key=None)` | Converts stored data back to Python object | - -### Using Custom Codecs in Tables - -Once defined, use the codec in table definitions with angle brackets: - -```python -@schema -class Connectivity(dj.Manual): - definition = """ - conn_id : int - --- - conn_graph = null : # Uses the GraphCodec we defined - """ -``` - -Insert and fetch work seamlessly: - -```python -import networkx as nx - -# Insert - encode() is called automatically -g = nx.lollipop_graph(4, 2) -Connectivity.insert1({"conn_id": 1, "conn_graph": g}) - -# Fetch - decode() is called automatically -result = (Connectivity & "conn_id = 1").fetch1("conn_graph") -assert isinstance(result, nx.Graph) -``` - -## Auto-Registration - -Codecs automatically register when their class is defined. No decorator needed: - -```python -# This codec is registered automatically when the class is defined -class MyCodec(dj.Codec): - name = "mycodec" - ... -``` - -### Skipping Registration - -For abstract base classes that shouldn't be registered: - -```python -class BaseCodec(dj.Codec, register=False): - """Abstract base - not registered.""" - name = None - -class ConcreteCodec(BaseCodec): - name = "concrete" # This one IS registered - ... -``` - -### Listing Registered Codecs - -```python -# List all registered codec names -print(dj.list_codecs()) -``` - -## Validation - -Add data validation by overriding the `validate()` method. It's called automatically -before `encode()` during INSERT operations: - -```python -class PositiveArrayCodec(dj.Codec): - name = "positive_array" - - def get_dtype(self, is_external: bool) -> str: - return "" - - def validate(self, value): - """Ensure all values are positive.""" - import numpy as np - if not isinstance(value, np.ndarray): - raise TypeError(f"Expected numpy array, got {type(value).__name__}") - if np.any(value < 0): - raise ValueError("Array must contain only positive values") - - def encode(self, array, *, key=None, store_name=None): - return array - - def decode(self, stored, *, key=None): - return stored -``` - -## The `get_dtype()` Method - -The `get_dtype()` method specifies how data is stored. The `is_external` parameter -indicates whether the `@` modifier is present: - -```python -def get_dtype(self, is_external: bool) -> str: - """ - Args: - is_external: True if @ modifier present (e.g., ) - - Returns: - - A core type: "bytes", "json", "varchar(N)", etc. - - Another codec: "", "", etc. - """ -``` - -### Storage Type Options - -| Return Value | Use Case | Database Type | -|--------------|----------|---------------| -| `"bytes"` | Raw binary data | LONGBLOB | -| `"json"` | JSON-serializable data | JSON | -| `"varchar(N)"` | String representations | VARCHAR(N) | -| `"int32"` | Integer identifiers | INT | -| `""` | Serialized Python objects | Depends on internal/external | -| `""` | Large objects with deduplication | JSON (external only) | -| `""` | Chain to another codec | Varies | - -### External Storage - -For large data, use external storage with the `@` modifier: - -```python -class LargeArrayCodec(dj.Codec): - name = "large_array" - - def get_dtype(self, is_external: bool) -> str: - # Use hash-addressed external storage for large data - return "" if is_external else "" - - def encode(self, array, *, key=None, store_name=None): - import pickle - return pickle.dumps(array) - - def decode(self, stored, *, key=None): - import pickle - return pickle.loads(stored) -``` - -Usage: -```python -@schema -class Data(dj.Manual): - definition = ''' - id : int - --- - small_array : # Internal (in database) - big_array : # External (default store) - archive : # External (specific store) - ''' -``` - -## Codec Chaining - -Custom codecs can build on other codecs by returning `` from `get_dtype()`: - -```python -class CompressedGraphCodec(dj.Codec): - name = "compressed_graph" - - def get_dtype(self, is_external: bool) -> str: - return "" # Chain to the GraphCodec - - def encode(self, graph, *, key=None, store_name=None): - # Compress before passing to GraphCodec - return self._compress(graph) - - def decode(self, stored, *, key=None): - # GraphCodec's decode already ran, decompress result - return self._decompress(stored) -``` - -DataJoint automatically resolves the chain to find the final storage type. - -### How Chaining Works - -When DataJoint encounters ``: - -1. `CompressedGraphCodec.get_dtype()` returns `""` -2. `GraphCodec.get_dtype()` returns `""` -3. `BlobCodec.get_dtype()` returns `"bytes"` -4. Final storage type is `bytes` (LONGBLOB in MySQL) - -During INSERT, encoders run outer → inner: -1. `CompressedGraphCodec.encode()` → compressed graph -2. `GraphCodec.encode()` → edge list dict -3. `BlobCodec.encode()` → serialized bytes - -During FETCH, decoders run inner → outer (reverse order). - -## The Key Parameter - -The `key` parameter provides access to primary key values during encode/decode -operations. This is useful when the conversion depends on record context: - -```python -class ContextAwareCodec(dj.Codec): - name = "context_aware" - - def get_dtype(self, is_external: bool) -> str: - return "" - - def encode(self, value, *, key=None, store_name=None): - if key and key.get("version") == 2: - return self._encode_v2(value) - return self._encode_v1(value) - - def decode(self, stored, *, key=None): - if key and key.get("version") == 2: - return self._decode_v2(stored) - return self._decode_v1(stored) -``` - -## Publishing Codecs as Packages - -Custom codecs can be distributed as installable packages using Python entry points. -This allows codecs to be automatically discovered when the package is installed. - -### Package Structure - -``` -dj-graph-codecs/ -├── pyproject.toml -└── src/ - └── dj_graph_codecs/ - ├── __init__.py - └── codecs.py -``` - -### pyproject.toml - -```toml -[project] -name = "dj-graph-codecs" -version = "1.0.0" -dependencies = ["datajoint>=2.0", "networkx"] - -[project.entry-points."datajoint.codecs"] -graph = "dj_graph_codecs.codecs:GraphCodec" -weighted_graph = "dj_graph_codecs.codecs:WeightedGraphCodec" -``` - -### Codec Implementation - -```python -# src/dj_graph_codecs/codecs.py -import datajoint as dj -import networkx as nx - -class GraphCodec(dj.Codec): - name = "graph" - - def get_dtype(self, is_external: bool) -> str: - return "" - - def encode(self, graph, *, key=None, store_name=None): - return { - 'nodes': list(graph.nodes(data=True)), - 'edges': list(graph.edges(data=True)), - } - - def decode(self, stored, *, key=None): - G = nx.Graph() - G.add_nodes_from(stored['nodes']) - G.add_edges_from(stored['edges']) - return G - -class WeightedGraphCodec(dj.Codec): - name = "weighted_graph" - - def get_dtype(self, is_external: bool) -> str: - return "" - - def encode(self, graph, *, key=None, store_name=None): - return [(u, v, d) for u, v, d in graph.edges(data=True)] - - def decode(self, edges, *, key=None): - g = nx.Graph() - for u, v, d in edges: - g.add_edge(u, v, **d) - return g -``` - -### Usage After Installation - -```bash -pip install dj-graph-codecs -``` - -```python -# Codecs are automatically available after package installation -@schema -class MyTable(dj.Manual): - definition = """ - id : int - --- - network : - weighted_network : - """ -``` - -## Complete Example - -Here's a complete example demonstrating custom codecs for a neuroscience workflow: - -```python -import datajoint as dj -import numpy as np - -# Define custom codecs -class SpikeTrainCodec(dj.Codec): - """Efficient storage for sparse spike timing data.""" - name = "spike_train" - - def get_dtype(self, is_external: bool) -> str: - return "" - - def validate(self, value): - if not isinstance(value, np.ndarray): - raise TypeError("Expected numpy array of spike times") - if value.ndim != 1: - raise ValueError("Spike train must be 1-dimensional") - if len(value) > 1 and not np.all(np.diff(value) >= 0): - raise ValueError("Spike times must be sorted") - - def encode(self, spike_times, *, key=None, store_name=None): - # Store as differences (smaller values, better compression) - return np.diff(spike_times, prepend=0).astype(np.float32) - - def decode(self, stored, *, key=None): - # Reconstruct original spike times - return np.cumsum(stored).astype(np.float64) - - -class WaveformCodec(dj.Codec): - """Storage for spike waveform templates with metadata.""" - name = "waveform" - - def get_dtype(self, is_external: bool) -> str: - return "" - - def encode(self, waveform_dict, *, key=None, store_name=None): - return { - "data": waveform_dict["data"].astype(np.float32), - "sampling_rate": waveform_dict["sampling_rate"], - "channel_ids": list(waveform_dict["channel_ids"]), - } - - def decode(self, stored, *, key=None): - return { - "data": stored["data"].astype(np.float64), - "sampling_rate": stored["sampling_rate"], - "channel_ids": np.array(stored["channel_ids"]), - } - - -# Create schema and tables -schema = dj.schema("ephys_analysis") - -@schema -class Unit(dj.Manual): - definition = """ - unit_id : int - --- - spike_times : - waveform : - quality : enum('good', 'mua', 'noise') - """ - - -# Usage -spike_times = np.array([0.1, 0.15, 0.23, 0.45, 0.67, 0.89]) -waveform = { - "data": np.random.randn(82, 4), - "sampling_rate": 30000, - "channel_ids": [10, 11, 12, 13], -} - -Unit.insert1({ - "unit_id": 1, - "spike_times": spike_times, - "waveform": waveform, - "quality": "good", -}) - -# Fetch - automatically decoded -result = (Unit & "unit_id = 1").fetch1() -print(f"Spike times: {result['spike_times']}") -print(f"Waveform shape: {result['waveform']['data'].shape}") -``` - -## Built-in Codecs - -DataJoint includes several built-in codecs: - -### `` - DataJoint Blob Serialization - -The `` codec provides DataJoint's native binary serialization. It supports: - -- NumPy arrays (compatible with MATLAB) -- Python dicts, lists, tuples, sets -- datetime objects, Decimals, UUIDs -- Nested data structures -- Optional compression - -```python -@schema -class ProcessedData(dj.Manual): - definition = """ - data_id : int - --- - results : # Internal (serialized in database) - large_results : # External (hash-addressed storage) - """ -``` - -### `` - Content-Addressed Storage - -Stores raw bytes using MD5 content hashing with automatic deduplication. -External storage only. - -### `` - Path-Addressed Storage - -Stores files and folders at paths derived from primary keys. Ideal for -Zarr arrays, HDF5 files, and multi-file outputs. External storage only. - -### `` - File Attachments - -Stores files with filename preserved. Supports internal and external storage. - -### `` - File References - -References existing files in configured stores without copying. -External storage only. - -## Best Practices - -1. **Choose descriptive codec names**: Use lowercase with underscores (e.g., `spike_train`, `graph_embedding`) - -2. **Select appropriate storage types**: Use `` for complex objects, `json` for simple structures, `` or `` for large data - -3. **Add validation**: Use `validate()` to catch data errors early - -4. **Document your codecs**: Include docstrings explaining the expected input/output formats - -5. **Handle None values**: Your encode/decode methods may receive `None` for nullable attributes - -6. **Consider versioning**: If your encoding format might change, include version information - -7. **Test round-trips**: Ensure `decode(encode(x)) == x` for all valid inputs - -```python -def test_graph_codec_roundtrip(): - import networkx as nx - g = nx.lollipop_graph(4, 2) - codec = GraphCodec() - - encoded = codec.encode(g) - decoded = codec.decode(encoded) - - assert set(g.edges) == set(decoded.edges) -``` - -## API Reference - -```python -import datajoint as dj - -# List all registered codecs -dj.list_codecs() - -# Get a codec instance -codec = dj.get_codec("blob") -codec = dj.get_codec("") # Angle brackets optional -codec = dj.get_codec("") # Store parameter stripped -``` - -For the complete Codec API specification, see [Codec Specification](codec-spec.md). diff --git a/docs/src/archive/design/tables/declare.md b/docs/src/archive/design/tables/declare.md deleted file mode 100644 index d4fb070a2..000000000 --- a/docs/src/archive/design/tables/declare.md +++ /dev/null @@ -1,242 +0,0 @@ -# Declaration Syntax - -## Creating Tables - -### Classes represent tables - -To make it easy to work with tables in MATLAB and Python, DataJoint programs create a -separate class for each table. -Computer programmers refer to this concept as -[object-relational mapping](https://en.wikipedia.org/wiki/Object-relational_mapping). -For example, the class `experiment.Subject` in the DataJoint client language may -correspond to the table called `subject` on the database server. -Users never need to see the database directly; they only interact with data in the -database by creating and interacting with DataJoint classes. - -#### Data tiers - -The table class must inherit from one of the following superclasses to indicate its -data tier: `dj.Lookup`, `dj.Manual`, `dj.Imported`, `dj.Computed`, or `dj.Part`. -See [tiers](tiers.md) and [master-part](./master-part.md). - -### Defining a table - -To define a DataJoint table in Python: - -1. Define a class inheriting from the appropriate DataJoint class: `dj.Lookup`, -`dj.Manual`, `dj.Imported` or `dj.Computed`. - -2. Decorate the class with the schema object (see [schema](../schema.md)) - -3. Define the class property `definition` to define the table heading. - -For example, the following code defines the table `Person`: - -```python -import datajoint as dj -schema = dj.Schema('alice_experiment') - -@schema -class Person(dj.Manual): - definition = ''' - username : varchar(20) # unique user name - --- - first_name : varchar(30) - last_name : varchar(30) - ''' -``` - -The `@schema` decorator uses the class name and the data tier to check whether an -appropriate table exists on the database. -If a table does not already exist, the decorator creates one on the database using the -definition property. -The decorator attaches the information about the table to the class, and then returns -the class. - -The class will become usable after you define the `definition` property as described in -[Table definition](#table-definition). - -#### DataJoint classes in Python - -DataJoint for Python is implemented through the use of classes providing access to the -actual tables stored on the database. -Since only a single table exists on the database for any class, interactions with all -instances of the class are equivalent. -As such, most methods can be called on the classes themselves rather than on an object, -for convenience. -Whether calling a DataJoint method on a class or on an instance, the result will only -depend on or apply to the corresponding table. -All of the basic functionality of DataJoint is built to operate on the classes -themselves, even when called on an instance. -For example, calling `Person.insert(...)` (on the class) and `Person.insert(...)` (on -an instance) both have the identical effect of inserting data into the table on the -database server. -DataJoint does not prevent a user from working with instances, but the workflow is -complete without the need for instantiation. -It is up to the user whether to implement additional functionality as class methods or -methods called on instances. - -### Valid class names - -Note that in both MATLAB and Python, the class names must follow the CamelCase compound -word notation: - -- start with a capital letter and -- contain only alphanumerical characters (no underscores). - -Examples of valid class names: - -`TwoPhotonScan`, `Scan2P`, `Ephys`, `MembraneVoltage` - -Invalid class names: - -`Two_photon_Scan`, `twoPhotonScan`, `2PhotonScan`, `membranePotential`, `membrane_potential` - -## Table Definition - -DataJoint models data as sets of **entities** with shared **attributes**, often -visualized as tables with rows and columns. -Each row represents a single entity and the values of all of its attributes. -Each column represents a single attribute with a name and a datatype, applicable to -entity in the table. -Unlike rows in a spreadsheet, entities in DataJoint don't have names or numbers: they -can only be identified by the values of their attributes. -Defining a table means defining the names and datatypes of the attributes as well as -the constraints to be applied to those attributes. -Both MATLAB and Python use the same syntax define tables. - -For example, the following code in defines the table `User`, that contains users of the -database: - -The table definition is contained in the `definition` property of the class. - -```python -@schema -class User(dj.Manual): - definition = """ - # database users - username : varchar(20) # unique user name - --- - first_name : varchar(30) - last_name : varchar(30) - role : enum('admin', 'contributor', 'viewer') - """ -``` - -This defines the class `User` that creates the table in the database and provides all -its data manipulation functionality. - -### Table creation on the database server - -Users do not need to do anything special to have a table created in the database. -Tables are created at the time of class definition. -In fact, table creation on the database is one of the jobs performed by the decorator -`@schema` of the class. - -### Changing the definition of an existing table - -Once the table is created in the database, the definition string has no further effect. -In other words, changing the definition string in the class of an existing table will -not actually update the table definition. -To change the table definition, one must first [drop](../drop.md) the existing table. -This means that all the data will be lost, and the new definition will be applied to -create the new empty table. - -Therefore, in the initial phases of designing a DataJoint pipeline, it is common to -experiment with variations of the design before populating it with substantial amounts -of data. - -It is possible to modify a table without dropping it. -This topic is covered separately. - -### Reverse-engineering the table definition - -DataJoint objects provide the `describe` method, which displays the table definition -used to define the table when it was created in the database. -This definition may differ from the definition string of the class if the definition -string has been edited after creation of the table. - -Examples - -```python -s = lab.User.describe() -``` - -## Definition Syntax - -The table definition consists of one or more lines. -Each line can be one of the following: - -- The optional first line starting with a `#` provides a description of the table's purpose. - It may also be thought of as the table's long title. -- A new attribute definition in any of the following forms (see -[Attributes](./attributes.md) for valid datatypes): - ``name : datatype`` - ``name : datatype # comment`` - ``name = default : datatype`` - ``name = default : datatype # comment`` -- The divider `---` (at least three hyphens) separating primary key attributes above -from secondary attributes below. -- A foreign key in the format `-> ReferencedTable`. - (See [Dependencies](dependencies.md).) - -For example, the table for Persons may have the following definition: - -```python -# Persons in the lab -username : varchar(16) # username in the database ---- -full_name : varchar(255) -start_date : date # date when joined the lab -``` - -This will define the table with attributes `username`, `full_name`, and `start_date`, -in which `username` is the [primary key](primary.md). - -### Attribute names - -Attribute names must be in lowercase and must start with a letter. -They can only contain alphanumerical characters and underscores. -The attribute name cannot exceed 64 characters. - -Valid attribute names - `first_name`, `two_photon_scan`, `scan_2p`, `two_photon_scan` - -Invalid attribute names - `firstName`, `first name`, `2photon_scan`, `two-photon_scan`, `TwoPhotonScan` - -Ideally, attribute names should be unique across all tables that are likely to be used -in queries together. -For example, tables often have attributes representing the start times of sessions, -recordings, etc. -Such attributes must be uniquely named in each table, such as `session_start_time` or -`recording_start_time`. - -### Default values - -Secondary attributes can be given default values. -A default value will be used for an attribute if no other value is given at the time -the entity is [inserted](../../manipulation/insert.md) into the table. -Generally, default values are numerical values or character strings. -Default values for dates must be given as strings as well, contained within quotes -(with the exception of `CURRENT_TIMESTAMP`). -Note that default values can only be used when inserting as a mapping. -Primary key attributes cannot have default values (with the exceptions of -`auto_increment` and `CURRENT_TIMESTAMP` attributes; see [primary-key](primary.md)). - -An attribute with a default value of `NULL` is called a **nullable attribute**. -A nullable attribute can be thought of as applying to all entities in a table but -having an optional *value* that may be absent in some entities. -Nullable attributes should *not* be used to indicate that an attribute is inapplicable -to some entities in a table (see [normalization](../normalization.md)). -Nullable attributes should be used sparingly to indicate optional rather than -inapplicable attributes that still apply to all entities in the table. -`NULL` is a special literal value and does not need to be enclosed in quotes. - -Here are some examples of attributes with default values: - -```python -failures = 0 : int -due_date = "2020-05-31" : date -additional_comments = NULL : varchar(256) -``` diff --git a/docs/src/archive/design/tables/dependencies.md b/docs/src/archive/design/tables/dependencies.md deleted file mode 100644 index e06278ee8..000000000 --- a/docs/src/archive/design/tables/dependencies.md +++ /dev/null @@ -1,241 +0,0 @@ -# Dependencies - -## Understanding dependencies - -A schema contains collections of tables of related data. -Accordingly, entities in one table often derive some of their meaning or context from -entities in other tables. -A **foreign key** defines a **dependency** of entities in one table on entities in -another within a schema. -In more complex designs, dependencies can even exist between entities in tables from -different schemas. -Dependencies play a functional role in DataJoint and do not simply label the structure -of a pipeline. -Dependencies provide entities in one table with access to data in another table and -establish certain constraints on entities containing a foreign key. - -A DataJoint pipeline, including the dependency relationships established by foreign -keys, can be visualized as a graph with nodes and edges. -The diagram of such a graph is called the **entity relationship diagram** or -[Diagram](../diagrams.md). -The nodes of the graph are tables and the edges connecting them are foreign keys. -The edges are directed and the overall graph is a **directed acyclic graph**, a graph -with no loops. - -For example, the Diagram below is the pipeline for multipatching experiments - -![mp-diagram](../../images/mp-diagram.png){: style="align:center"} - -The graph defines the direction of the workflow. -The tables at the top of the flow need to be populated first, followed by those tables -one step below and so forth until the last table is populated at the bottom of the -pipeline. -The top of the pipeline tends to be dominated by lookup tables (gray stars) and manual -tables (green squares). -The middle has many imported tables (blue triangles), and the bottom has computed -tables (red stars). - -## Defining a dependency - -Foreign keys are defined with arrows `->` in the [table definition](declare.md), -pointing to another table. - -A foreign key may be defined as part of the [primary-key](primary.md). - -In the Diagram, foreign keys from the primary key are shown as solid lines. -This means that the primary key of the referenced table becomes part of the primary key -of the new table. -A foreign key outside the primary key is indicated by dashed line in the ERD. - -For example, the following definition for the table `mp.Slice` has three foreign keys, -including one within the primary key. - -```python -# brain slice --> mp.Subject -slice_id : smallint # slice number within subject ---- --> mp.BrainRegion --> mp.Plane -slice_date : date # date of the slicing (not patching) -thickness : smallint unsigned # slice thickness in microns -experimenter : varchar(20) # person who performed this experiment -``` - -You can examine the resulting table heading with - -```python -mp.BrainSlice.heading -``` - -The heading of `mp.Slice` may look something like - -```python -subject_id : char(8) # experiment subject id -slice_id : smallint # slice number within subject ---- -brain_region : varchar(12) # abbreviated name for brain region -plane : varchar(12) # plane of section -slice_date : date # date of the slicing (not patching) -thickness : smallint unsigned # slice thickness in microns -experimenter : varchar(20) # person who performed this experiment -``` - -This displayed heading reflects the actual attributes in the table. -The foreign keys have been replaced by the primary key attributes of the referenced -tables, including their data types and comments. - -## How dependencies work - -The foreign key `-> A` in the definition of table `B` has the following effects: - -1. The primary key attributes of `A` are made part of `B`'s definition. -2. A referential constraint is created in `B` with reference to `A`. -3. If one does not already exist, an index is created to speed up searches in `B` for -matches to `A`. - (The reverse search is already fast because it uses the primary key of `A`.) - -A referential constraint means that an entity in `B` cannot exist without a matching -entity in `A`. -**Matching** means attributes in `B` that correspond to the primary key of `A` must -have the same values. -An attempt to insert an entity into `B` that does not have a matching counterpart in -`A` will fail. -Conversely, deleting an entity from `A` that has matching entities in `B` will result -in the deletion of those matching entities and so forth, recursively, downstream in the -pipeline. - -When `B` references `A` with a foreign key, one can say that `B` **depends** on `A`. -In DataJoint terms, `B` is the **dependent table** and `A` is the **referenced table** -with respect to the foreign key from `B` to `A`. - -Note to those already familiar with the theory of relational databases: The usage of -the words "depends" and "dependency" here should not be confused with the unrelated -concept of *functional dependencies* that is used to define normal forms. - -## Referential integrity - -Dependencies enforce the desired property of databases known as -**referential integrity**. -Referential integrity is the guarantee made by the data management process that related -data across the database remain present, correctly associated, and mutually consistent. -Guaranteeing referential integrity means enforcing the constraint that no entity can -exist in the database without all the other entities on which it depends. -An entity in table `B` depends on an entity in table `A` when they belong to them or -are computed from them. - -## Dependencies with renamed attributes - -In most cases, a dependency includes the primary key attributes of the referenced table -as they appear in its table definition. -Sometimes it can be helpful to choose a new name for a foreign key attribute that -better fits the context of the dependent table. -DataJoint provides the following [projection](../../query/project.md) syntax to rename -the primary key attributes when they are included in the new table. - -The dependency - -```python --> Table.project(new_attr='old_attr') -``` - -renames the primary key attribute `old_attr` of `Table` as `new_attr` before -integrating it into the table definition. -Any additional primary key attributes will retain their original names. -For example, the table `Experiment` may depend on table `User` but rename the `user` -attribute into `operator` as follows: - -```python --> User.proj(operator='user') -``` - -In the above example, an entity in the dependent table depends on exactly one entity in -the referenced table. -Sometimes entities may depend on multiple entities from the same table. -Such a design requires a way to distinguish between dependent attributes having the -same name in the reference table. -For example, a table for `Synapse` may reference the table `Cell` twice as -`presynaptic` and `postsynaptic`. -The table definition may appear as - -```python -# synapse between two cells --> Cell.proj(presynaptic='cell_id') --> Cell.proj(postsynaptic='cell_id') ---- -connection_strength : double # (pA) peak synaptic current -``` - -If the primary key of `Cell` is (`animal_id`, `slice_id`, `cell_id`), then the primary -key of `Synapse` resulting from the above definition will be (`animal_id`, `slice_id`, -`presynaptic`, `postsynaptic`). -Projection always returns all of the primary key attributes of a table, so `animal_id` -and `slice_id` are included, with their original names. - -Note that the design of the `Synapse` table above imposes the constraint that the -synapse can only be found between cells in the same animal and in the same slice. - -Allowing representation of synapses between cells from different slices requires the -renamimg of `slice_id` as well: - -```python -# synapse between two cells --> Cell(presynaptic_slice='slice_id', presynaptic_cell='cell_id') --> Cell(postsynaptic_slice='slice_id', postsynaptic_cell='cell_id') ---- -connection_strength : double # (pA) peak synaptic current -``` - -In this case, the primary key of `Synapse` will be (`animal_id`, `presynaptic_slice`, -`presynaptic_cell`, `postsynaptic_slice`, `postsynaptic_cell`). -This primary key still imposes the constraint that synapses can only form between cells -within the same animal but now allows connecting cells across different slices. - -In the Diagram, renamed foreign keys are shown as red lines with an additional dot node -in the middle to indicate that a renaming took place. - -## Foreign key options - -Note: Foreign key options are currently in development. - -Foreign keys allow the additional options `nullable` and `unique`, which can be -inserted in square brackets following the arrow. - -For example, in the following table definition - -```python -rig_id : char(4) # experimental rig ---- --> Person -``` - -each rig belongs to a person, but the table definition does not prevent one person -owning multiple rigs. -With the `unique` option, a person may only appear once in the entire table, which -means that no one person can own more than one rig. - -```python -rig_id : char(4) # experimental rig ---- --> [unique] Person -``` - -With the `nullable` option, a rig may not belong to anyone, in which case the foreign -key attributes for `Person` are set to `NULL`: - -```python -rig_id : char(4) # experimental rig ---- --> [nullable] Person -``` - -Finally with both `unique` and `nullable`, a rig may or may not be owned by anyone and -each person may own up to one rig. - -```python -rig_id : char(4) # experimental rig ---- --> [unique, nullable] Person -``` - -Foreign keys made from the primary key cannot be nullable but may be unique. diff --git a/docs/src/archive/design/tables/filepath.md b/docs/src/archive/design/tables/filepath.md deleted file mode 100644 index 05e9ca744..000000000 --- a/docs/src/archive/design/tables/filepath.md +++ /dev/null @@ -1,96 +0,0 @@ -# Filepath Datatype - -Note: Filepath Datatype is available as a preview feature in DataJoint Python v0.12. -This means that the feature is required to be explicitly enabled. To do so, make sure -to set the environment variable `FILEPATH_FEATURE_SWITCH=TRUE` prior to use. - -## Configuration & Usage - -Corresponding to issue -[#481](https://github.com/datajoint/datajoint-python/issues/481), -the `filepath` attribute type links DataJoint records to files already -managed outside of DataJoint. This can aid in sharing data with -other systems such as allowing an image viewer application to -directly use files from a DataJoint pipeline, or to allow downstream -tables to reference data which reside outside of DataJoint -pipelines. - -To define a table using the `filepath` datatype, an existing DataJoint -[store](../../sysadmin/external-store.md) should be created and then referenced in the -new table definition. For example, given a simple store: - -```python - dj.config['stores'] = { - 'data': { - 'protocol': 'file', - 'location': '/data', - 'stage': '/data' - } - } -``` - -we can define an `ScanImages` table as follows: - -```python -@schema -class ScanImages(dj.Manual): - definition = """ - -> Session - image_id: int - --- - image_path: filepath@data - """ -``` - -This table can now be used for tracking paths within the `/data` local directory. -For example: - -```python ->>> ScanImages.insert1((0, 0, '/data/images/image_0.tif')) ->>> (ScanImages() & {'session_id': 0}).fetch1(as_dict=True) -{'session_id': 0, 'image_id': 0, 'image_path': '/data/images/image_0.tif'} -``` - -As can be seen from the example, unlike [blob](blobs.md) records, file -paths are managed as path locations to the underlying file. - -## Integrity Notes - -Unlike other data in DataJoint, data in `filepath` records are -deliberately intended for shared use outside of DataJoint. To help -ensure integrity of `filepath` records, DataJoint will record a -checksum of the file data on `insert`, and will verify this checksum -on `fetch`. However, since the underlying file data may be shared -with other applications, special care should be taken to ensure -records stored in `filepath` attributes are not modified outside -of the pipeline, or, if they are, that records in the pipeline are -updated accordingly. A safe method of changing `filepath` data is -as follows: - -1. Delete the `filepath` database record. - This will ensure that any downstream records in the pipeline depending - on the `filepath` record are purged from the database. -2. Modify `filepath` data. -3. Re-insert corresponding the `filepath` record. - This will add the record back to DataJoint with an updated file checksum. -4. Compute any downstream dependencies, if needed. - This will ensure that downstream results dependent on the `filepath` - record are updated to reflect the newer `filepath` contents. - -### Disable Fetch Verification - -Note: Skipping the checksum is not recommended as it ensures file integrity i.e. -downloaded files are not corrupted. With S3 stores, most of the time to complete a -`.fetch()` is from the file download itself as opposed to evaluating the checksum. This -option will primarily benefit `filepath` usage connected to a local `file` store. - -To disable checksums you can set a threshold in bytes -for when to stop evaluating checksums like in the example below: - -```python -dj.config["filepath_checksum_size_limit"] = 5 * 1024**3 # Skip for all files greater than 5GiB -``` - -The default is `None` which means it will always verify checksums. - - diff --git a/docs/src/archive/design/tables/indexes.md b/docs/src/archive/design/tables/indexes.md deleted file mode 100644 index 9d8148c36..000000000 --- a/docs/src/archive/design/tables/indexes.md +++ /dev/null @@ -1,97 +0,0 @@ -# Indexes - -Table indexes are data structures that allow fast lookups by an indexed attribute or -combination of attributes. - -In DataJoint, indexes are created by one of the three mechanisms: - -1. Primary key -2. Foreign key -3. Explicitly defined indexes - -The first two mechanisms are obligatory. Every table has a primary key, which serves as -an unique index. Therefore, restrictions by a primary key are very fast. Foreign keys -create additional indexes unless a suitable index already exists. - -## Indexes for single primary key tables - -Let’s say a mouse in the lab has a lab-specific ID but it also has a separate id issued -by the animal facility. - -```python -@schema -class Mouse(dj.Manual): - definition = """ - mouse_id : int # lab-specific ID - --- - tag_id : int # animal facility ID - """ -``` - -In this case, searching for a mouse by `mouse_id` is much faster than by `tag_id` -because `mouse_id` is a primary key, and is therefore indexed. - -To make searches faster on fields other than the primary key or a foreign key, you can -add a secondary index explicitly. - -Regular indexes are declared as `index(attr1, ..., attrN)` on a separate line anywhere in -the table declaration (below the primary key divide). - -Indexes can be declared with unique constraint as `unique index (attr1, ..., attrN)`. - -Let’s redeclare the table with a unique index on `tag_id`. - -```python -@schema -class Mouse(dj.Manual): - definition = """ - mouse_id : int # lab-specific ID - --- - tag_id : int # animal facility ID - unique index (tag_id) - """ -``` -Now, searches with `mouse_id` and `tag_id` are similarly fast. - -## Indexes for tables with multiple primary keys - -Let’s now imagine that rats in a lab are identified by the combination of `lab_name` and -`rat_id` in a table `Rat`. - -```python -@schema -class Rat(dj.Manual): - definition = """ - lab_name : char(16) - rat_id : int unsigned # lab-specific ID - --- - date_of_birth = null : date - """ -``` -Note that despite the fact that `rat_id` is in the index, searches by `rat_id` alone are not -helped by the index because it is not first in the index. This is similar to searching for -a word in a dictionary that orders words alphabetically. Searching by the first letters -of a word is easy but searching by the last few letters of a word requires scanning the -whole dictionary. - -In this table, the primary key is a unique index on the combination `(lab_name, rat_id)`. -Therefore searches on these attributes or on `lab_name` alone are fast. But this index -cannot help searches on `rat_id` alone. Similarly, searing by `date_of_birth` requires a -full-table scan and is inefficient. - -To speed up searches by the `rat_id` and `date_of_birth`, we can explicit indexes to -`Rat`: - -```python -@schema -class Rat2(dj.Manual): - definition = """ - lab_name : char(16) - rat_id : int unsigned # lab-specific ID - --- - date_of_birth = null : date - - index(rat_id) - index(date_of_birth) - """ -``` diff --git a/docs/src/archive/design/tables/lookup.md b/docs/src/archive/design/tables/lookup.md deleted file mode 100644 index 79b2c67ba..000000000 --- a/docs/src/archive/design/tables/lookup.md +++ /dev/null @@ -1,31 +0,0 @@ -# Lookup Tables - -Lookup tables contain basic facts that are not specific to an experiment and are fairly -persistent. -Their contents are typically small. -In GUIs, lookup tables are often used for drop-down menus or radio buttons. -In computed tables, they are often used to specify alternative methods for computations. -Lookup tables are commonly populated from their `contents` property. -In a [diagram](../diagrams.md) they are shown in gray. -The decision of which tables are lookup tables and which are manual can be somewhat -arbitrary. - -The table below is declared as a lookup table with its contents property provided to -generate entities. - -```python -@schema -class User(dj.Lookup): - definition = """ - # users in the lab - username : varchar(20) # user in the lab - --- - first_name : varchar(20) # user first name - last_name : varchar(20) # user last name - """ - contents = [ - ['cajal', 'Santiago', 'Cajal'], - ['hubel', 'David', 'Hubel'], - ['wiesel', 'Torsten', 'Wiesel'] - ] -``` diff --git a/docs/src/archive/design/tables/manual.md b/docs/src/archive/design/tables/manual.md deleted file mode 100644 index d97b6ce52..000000000 --- a/docs/src/archive/design/tables/manual.md +++ /dev/null @@ -1,47 +0,0 @@ -# Manual Tables - -Manual tables are populated during experiments through a variety of interfaces. -Not all manual information is entered by typing. -Automated software can enter it directly into the database. -What makes a manual table manual is that it does not perform any computations within -the DataJoint pipeline. - -The following code defines three manual tables `Animal`, `Session`, and `Scan`: - -```python -@schema -class Animal(dj.Manual): - definition = """ - # information about animal - animal_id : int # animal id assigned by the lab - --- - -> Species - date_of_birth=null : date # YYYY-MM-DD optional - sex='' : enum('M', 'F', '') # leave empty if unspecified - """ - -@schema -class Session(dj.Manual): - definition = """ - # Experiment Session - -> Animal - session : smallint # session number for the animal - --- - session_date : date # YYYY-MM-DD - -> User - -> Anesthesia - -> Rig - """ - -@schema -class Scan(dj.Manual): - definition = """ - # Two-photon imaging scan - -> Session - scan : smallint # scan number within the session - --- - -> Lens - laser_wavelength : decimal(5,1) # um - laser_power : decimal(4,1) # mW - """ -``` diff --git a/docs/src/archive/design/tables/master-part.md b/docs/src/archive/design/tables/master-part.md deleted file mode 100644 index d0f575e4d..000000000 --- a/docs/src/archive/design/tables/master-part.md +++ /dev/null @@ -1,112 +0,0 @@ -# Master-Part Relationship - -Often an entity in one table is inseparably associated with a group of entities in -another, forming a **master-part** relationship. -The master-part relationship ensures that all parts of a complex representation appear -together or not at all. -This has become one of the most powerful data integrity principles in DataJoint. - -As an example, imagine segmenting an image to identify regions of interest. -The resulting segmentation is inseparable from the ROIs that it produces. -In this case, the two tables might be called `Segmentation` and `Segmentation.ROI`. - -In Python, the master-part relationship is expressed by making the part a nested class -of the master. -The part is subclassed from `dj.Part` and does not need the `@schema` decorator. - -```python -@schema -class Segmentation(dj.Computed): - definition = """ # image segmentation - -> Image - """ - - class ROI(dj.Part): - definition = """ # Region of interest resulting from segmentation - -> Segmentation - roi : smallint # roi number - --- - roi_pixels : # indices of pixels - roi_weights : # weights of pixels - """ - - def make(self, key): - image = (Image & key).fetch1('image') - self.insert1(key) - count = itertools.count() - Segmentation.ROI.insert( - dict(key, roi=next(count), roi_pixel=roi_pixels, roi_weights=roi_weights) - for roi_pixels, roi_weights in mylib.segment(image)) -``` - -## Populating - -Master-part relationships can form in any data tier, but DataJoint observes them more -strictly for auto-populated tables. -To populate both the master `Segmentation` and the part `Segmentation.ROI`, it is -sufficient to call the `populate` method of the master: - -```python -Segmentation.populate() -``` - -Note that the entities in the master and the matching entities in the part are inserted -within a single `make` call of the master, which means that they are a processed inside -a single transactions: either all are inserted and committed or the entire transaction -is rolled back. -This ensures that partial results never appear in the database. - -For example, imagine that a segmentation is performed, but an error occurs halfway -through inserting the results. -If this situation were allowed to persist, then it might appear that 20 ROIs were -detected where 45 had actually been found. - -## Deleting - -To delete from a master-part pair, one should never delete from the part tables -directly. -The only valid method to delete from a part table is to delete the master. -This has been an unenforced rule, but upcoming versions of DataJoint will prohibit -direct deletes from the master table. -DataJoint's [delete](../../manipulation/delete.md) operation is also enclosed in a -transaction. - -Together, the rules of master-part relationships ensure a key aspect of data integrity: -results of computations involving multiple components and steps appear in their -entirety or not at all. - -## Multiple parts - -The master-part relationship cannot be chained or nested. -DataJoint does not allow part tables of other part tables per se. -However, it is common to have a master table with multiple part tables that depend on -each other. -For example: - -```python -@schema -class ArrayResponse(dj.Computed): -definition = """ -array: int -""" - -class ElectrodeResponse(dj.Part): -definition = """ --> master -electrode: int # electrode number on the probe -""" - -class ChannelResponse(dj.Part): -definition = """ --> ElectrodeResponse -channel: int ---- -response: # response of a channel -""" -``` - -Conceptually, one or more channels belongs to an electrode, and one or more electrodes -belong to an array. -This example assumes that information about an array's response (which consists -ultimately of the responses of multiple electrodes each consisting of multiple channel -responses) including it's electrodes and channels are entered together. diff --git a/docs/src/archive/design/tables/object.md b/docs/src/archive/design/tables/object.md deleted file mode 100644 index e2ed8bf25..000000000 --- a/docs/src/archive/design/tables/object.md +++ /dev/null @@ -1,357 +0,0 @@ -# Object Type - -The `object` type provides managed file and folder storage for DataJoint pipelines. Unlike `attach@store` and `filepath@store` which reference named stores, the `object` type uses a unified storage backend configured at the pipeline level. - -## Overview - -The `object` type supports both files and folders: - -- **Files**: Copied to storage at insert time, accessed via handle on fetch -- **Folders**: Entire directory trees stored as a unit (e.g., Zarr arrays) -- **Staged inserts**: Write directly to storage for large objects - -### Key Features - -- **Unified storage**: One storage backend per pipeline (local filesystem or cloud) -- **No hidden tables**: Metadata stored inline as JSON (simpler than `attach@store`) -- **fsspec integration**: Direct access for Zarr, xarray, and other array libraries -- **Immutable objects**: Content cannot be modified after insert - -## Configuration - -Configure object storage in `datajoint.json`: - -```json -{ - "object_storage": { - "project_name": "my_project", - "protocol": "s3", - "bucket": "my-bucket", - "location": "my_project", - "endpoint": "s3.amazonaws.com" - } -} -``` - -For local filesystem storage: - -```json -{ - "object_storage": { - "project_name": "my_project", - "protocol": "file", - "location": "/data/my_project" - } -} -``` - -### Configuration Options - -| Setting | Required | Description | -|---------|----------|-------------| -| `project_name` | Yes | Unique project identifier | -| `protocol` | Yes | Storage backend: `file`, `s3`, `gcs`, `azure` | -| `location` | Yes | Base path or bucket prefix | -| `bucket` | For cloud | Bucket name (S3, GCS, Azure) | -| `endpoint` | For S3 | S3 endpoint URL | -| `partition_pattern` | No | Path pattern with `{attribute}` placeholders | -| `token_length` | No | Random suffix length (default: 8, range: 4-16) | - -### Environment Variables - -Settings can be overridden via environment variables: - -```bash -DJ_OBJECT_STORAGE_PROTOCOL=s3 -DJ_OBJECT_STORAGE_BUCKET=my-bucket -DJ_OBJECT_STORAGE_LOCATION=my_project -``` - -## Table Definition - -Define an object attribute in your table: - -```python -@schema -class Recording(dj.Manual): - definition = """ - subject_id : int - session_id : int - --- - raw_data : object # managed file storage - processed : object # another object attribute - """ -``` - -Note: No `@store` suffix needed—storage is determined by pipeline configuration. - -## Insert Operations - -### Inserting Files - -Insert a file by providing its local path: - -```python -Recording.insert1({ - "subject_id": 123, - "session_id": 45, - "raw_data": "/local/path/to/recording.dat" -}) -``` - -The file is copied to object storage and the path is stored as JSON metadata. - -### Inserting Folders - -Insert an entire directory: - -```python -Recording.insert1({ - "subject_id": 123, - "session_id": 45, - "raw_data": "/local/path/to/data_folder/" -}) -``` - -### Inserting from Remote URLs - -Insert from cloud storage or HTTP sources—content is copied to managed storage: - -```python -# From S3 -Recording.insert1({ - "subject_id": 123, - "session_id": 45, - "raw_data": "s3://source-bucket/path/to/data.dat" -}) - -# From Google Cloud Storage (e.g., collaborator data) -Recording.insert1({ - "subject_id": 123, - "session_id": 45, - "neural_data": "gs://collaborator-bucket/shared/experiment.zarr" -}) - -# From HTTP/HTTPS -Recording.insert1({ - "subject_id": 123, - "session_id": 45, - "raw_data": "https://example.com/public/data.dat" -}) -``` - -Supported protocols: `s3://`, `gs://`, `az://`, `http://`, `https://` - -Remote sources may require credentials configured via environment variables or fsspec configuration files. - -### Inserting from Streams - -Insert from a file-like object with explicit extension: - -```python -with open("/local/path/data.bin", "rb") as f: - Recording.insert1({ - "subject_id": 123, - "session_id": 45, - "raw_data": (".bin", f) - }) -``` - -### Staged Insert (Direct Write) - -For large objects like Zarr arrays, use staged insert to write directly to storage without a local copy: - -```python -import zarr - -with Recording.staged_insert1 as staged: - # Set primary key values first - staged.rec['subject_id'] = 123 - staged.rec['session_id'] = 45 - - # Create Zarr array directly in object storage - z = zarr.open(staged.store('raw_data', '.zarr'), mode='w', shape=(10000, 10000)) - z[:] = compute_large_array() - - # Assign to record - staged.rec['raw_data'] = z - -# On successful exit: metadata computed, record inserted -# On exception: storage cleaned up, no record inserted -``` - -The `staged_insert1` context manager provides: - -- `staged.rec`: Dict for setting attribute values -- `staged.store(field, ext)`: Returns `fsspec.FSMap` for Zarr/xarray -- `staged.open(field, ext, mode)`: Returns file handle for writing -- `staged.fs`: Direct fsspec filesystem access - -## Fetch Operations - -Fetching an object attribute returns an `ObjectRef` handle: - -```python -record = Recording.fetch1() -obj = record["raw_data"] - -# Access metadata (no I/O) -print(obj.path) # Storage path -print(obj.size) # Size in bytes -print(obj.ext) # File extension (e.g., ".dat") -print(obj.is_dir) # True if folder -``` - -### Reading File Content - -```python -# Read entire file as bytes -content = obj.read() - -# Open as file object -with obj.open() as f: - data = f.read() -``` - -### Working with Folders - -```python -# List contents -contents = obj.listdir() - -# Walk directory tree -for root, dirs, files in obj.walk(): - print(root, files) - -# Open specific file in folder -with obj.open("subdir/file.dat") as f: - data = f.read() -``` - -### Downloading Files - -Download to local filesystem: - -```python -# Download entire object -local_path = obj.download("/local/destination/") - -# Download specific file from folder -local_path = obj.download("/local/destination/", "subdir/file.dat") -``` - -### Integration with Zarr and xarray - -The `ObjectRef` provides direct fsspec access: - -```python -import zarr -import xarray as xr - -record = Recording.fetch1() -obj = record["raw_data"] - -# Open as Zarr array -z = zarr.open(obj.store, mode='r') -print(z.shape) - -# Open with xarray -ds = xr.open_zarr(obj.store) - -# Access fsspec filesystem directly -fs = obj.fs -files = fs.ls(obj.full_path) -``` - -### Verifying Integrity - -Verify that stored content matches metadata: - -```python -try: - obj.verify() - print("Object integrity verified") -except IntegrityError as e: - print(f"Verification failed: {e}") -``` - -For files, this checks size (and hash if available). For folders, it validates the manifest. - -## Storage Structure - -Objects are stored with a deterministic path structure: - -``` -{location}/{schema}/{Table}/objects/{pk_attrs}/{field}_{token}{ext} -``` - -Example: -``` -my_project/my_schema/Recording/objects/subject_id=123/session_id=45/raw_data_Ax7bQ2kM.dat -``` - -### Partitioning - -Use `partition_pattern` to organize files by attributes: - -```json -{ - "object_storage": { - "partition_pattern": "{subject_id}/{session_id}" - } -} -``` - -This promotes specified attributes to the path root for better organization: - -``` -my_project/subject_id=123/session_id=45/my_schema/Recording/objects/raw_data_Ax7bQ2kM.dat -``` - -## Database Storage - -The `object` type is stored as a JSON column containing metadata: - -```json -{ - "path": "my_schema/Recording/objects/subject_id=123/raw_data_Ax7bQ2kM.dat", - "size": 12345, - "hash": null, - "ext": ".dat", - "is_dir": false, - "timestamp": "2025-01-15T10:30:00Z", - "mime_type": "application/octet-stream" -} -``` - -For folders, the metadata includes `item_count` and a manifest file is stored alongside the folder in object storage. - -## Comparison with Other Types - -| Feature | `attach@store` | `filepath@store` | `object` | -|---------|----------------|------------------|----------| -| Store config | Per-attribute | Per-attribute | Per-pipeline | -| Path control | DataJoint | User-managed | DataJoint | -| Hidden tables | Yes | Yes | **No** | -| Backend | File/S3 only | File/S3 only | fsspec (any) | -| Metadata storage | External table | External table | Inline JSON | -| Folder support | No | No | **Yes** | -| Direct write | No | No | **Yes** | - -## Delete Behavior - -When a record is deleted: - -1. Database record is deleted first (within transaction) -2. Storage file/folder deletion is attempted after commit -3. File deletion failures are logged but don't fail the transaction - -Orphaned files (from failed deletes or crashed inserts) can be cleaned up using maintenance utilities. - -## Best Practices - -1. **Use staged insert for large objects**: Avoid copying multi-gigabyte files through local storage -2. **Set primary keys before calling `store()`**: The storage path depends on primary key values -3. **Use meaningful extensions**: Extensions like `.zarr`, `.hdf5` help identify content type -4. **Verify after critical inserts**: Call `obj.verify()` for important data -5. **Configure partitioning for large datasets**: Improves storage organization and browsing diff --git a/docs/src/archive/design/tables/primary.md b/docs/src/archive/design/tables/primary.md deleted file mode 100644 index fc4f5b8e0..000000000 --- a/docs/src/archive/design/tables/primary.md +++ /dev/null @@ -1,178 +0,0 @@ -# Primary Key - -## Primary keys in DataJoint - -Entities in tables are neither named nor numbered. -DataJoint does not answer questions of the type "What is the 10th element of this table?" -Instead, entities are distinguished by the values of their attributes. -Furthermore, the entire entity is not required for identification. -In each table, a subset of its attributes are designated to be the **primary key**. -Attributes in the primary key alone are sufficient to differentiate any entity from any -other within the table. - -Each table must have exactly one -[primary key](http://en.wikipedia.org/wiki/Primary_key): a subset of its attributes -that uniquely identify each entity in the table. -The database uses the primary key to prevent duplicate entries, to relate data across -tables, and to accelerate data queries. -The choice of the primary key will determine how you identify entities. -Therefore, make the primary key **short**, **expressive**, and **persistent**. - -For example, mice in our lab are assigned unique IDs. -The mouse ID number `animal_id` of type `smallint` can serve as the primary key for the -table `Mice`. -An experiment performed on a mouse may be identified in the table `Experiments` by two -attributes: `animal_id` and `experiment_number`. - -DataJoint takes the concept of primary keys somewhat more seriously than other models -and query languages. -Even **table expressions**, i.e. those tables produced through operations on other -tables, have a well-defined primary key. -All operators on tables are designed in such a way that the results always have a -well-defined primary key. - -In all representations of tables in DataJoint, the primary key attributes are always -listed before other attributes and highlighted for emphasis (e.g. in a **bold** font or -marked with an asterisk \*) - -## Defining a primary key - -In table declarations, the primary key attributes always come first and are separated -from the other attributes with a line containing at least three hyphens. -For example, the following is the definition of a table containing database users where -`username` is the primary key. - -```python -# database users -username : varchar(20) # unique user name ---- -first_name : varchar(30) -last_name : varchar(30) -role : enum('admin', 'contributor', 'viewer') -``` - -## Entity integrity - -The primary key defines and enforces the desired property of databases known as -[entity integrity](../integrity.md). -**Entity integrity** ensures that there is a one-to-one and unambiguous mapping between -real-world entities and their representations in the database system. -The data management process must prevent any duplication or misidentification of -entities. - -To enforce entity integrity, DataJoint implements several rules: - -- Every table must have a primary key. -- Primary key attributes cannot have default values (with the exception of -`auto_increment` and `CURRENT_TIMESTAMP`; see below). -- Operators on tables are defined with respect to the primary key and preserve a -primary key in their results. - -## Datatypes in primary keys - -All integer types, dates, timestamps, and short character strings make good primary key -attributes. -Character strings are somewhat less suitable because they can be long and because they -may have invisible trailing spaces. -Floating-point numbers should be avoided because rounding errors may lead to -misidentification of entities. -Enums are okay as long as they do not need to be modified after -[dependencies](dependencies.md) are already created referencing the table. -Finally, DataJoint does not support blob types in primary keys. - -The primary key may be composite, i.e. comprising several attributes. -In DataJoint, hierarchical designs often produce tables whose primary keys comprise -many attributes. - -## Choosing primary key attributes - -A primary key comprising real-world attributes is a good choice when such real-world -attributes are already properly and permanently assigned. -Whatever characteristics are used to uniquely identify the actual entities can be used -to identify their representations in the database. - -If there are no attributes that could readily serve as a primary key, an artificial -attribute may be created solely for the purpose of distinguishing entities. -In such cases, the primary key created for management in the database must also be used -to uniquely identify the entities themselves. -If the primary key resides only in the database while entities remain indistinguishable -in the real world, then the process cannot ensure entity integrity. -When a primary key is created as part of data management rather than based on -real-world attributes, an institutional process must ensure the uniqueness and -permanence of such an identifier. - -For example, the U.S. government assigns every worker an identifying attribute, the -social security number. -However, the government must go to great lengths to ensure that this primary key is -assigned exactly once, by checking against other less convenient candidate keys (i.e. -the combination of name, parents' names, date of birth, place of birth, etc.). -Just like the SSN, well managed primary keys tend to get institutionalized and find -multiple uses. - -Your lab must maintain a system for uniquely identifying important entities. -For example, experiment subjects and experiment protocols must have unique IDs. -Use these as the primary keys in the corresponding tables in your DataJoint databases. - -### Using hashes as primary keys - -Some tables include too many attributes in their primary keys. -For example, the stimulus condition in a psychophysics experiment may have a dozen -parameters such that a change in any one of them makes a different valid stimulus -condition. -In such a case, all the attributes would need to be included in the primary key to -ensure entity integrity. -However, long primary keys make it difficult to reference individual entities. -To be most useful, primary keys need to be relatively short. - -This problem is effectively solved through the use of a hash of all the identifying -attributes as the primary key. -For example, MD5 or SHA-1 hash algorithms can be used for this purpose. -To keep their representations human-readable, they may be encoded in base-64 ASCII. -For example, the 128-bit MD5 hash can be represented by 21 base-64 ASCII characters, -but for many applications, taking the first 8 to 12 characters is sufficient to avoid -collisions. - -### `auto_increment` - -Some entities are created by the very action of being entered into the database. -The action of entering them into the database gives them their identity. -It is impossible to duplicate them since entering the same thing twice still means -creating two distinct entities. - -In such cases, the use of an auto-incremented primary key is warranted. -These are declared by adding the word `auto_increment` after the data type in the -declaration. -The datatype must be an integer. -Then the database will assign incrementing numbers at each insert. - -The example definition below defines an auto-incremented primary key - -```python -# log entries -entry_id : smallint auto_increment ---- -entry_text : varchar(4000) -entry_time = CURRENT_TIMESTAMP : timestamp(3) # automatic timestamp with millisecond precision -``` - -DataJoint passes `auto_increment` behavior to the underlying MySQL and therefore it has -the same limitation: it can only be used for tables with a single attribute in the -primary key. - -If you need to auto-increment an attribute in a composite primary key, you will need to -do so programmatically within a transaction to avoid collisions. - -For example, let’s say that you want to auto-increment `scan_idx` in a table called -`Scan` whose primary key is `(animal_id, session, scan_idx)`. -You must already have the values for `animal_id` and `session` in the dictionary `key`. -Then you can do the following: - -```python -U().aggr(Scan & key, next='max(scan_idx)+1') - -# or - -Session.aggr(Scan, next='max(scan_idx)+1') & key -``` - -Note that the first option uses a [universal set](../../query/universals.md). diff --git a/docs/src/archive/design/tables/storage-types-spec.md b/docs/src/archive/design/tables/storage-types-spec.md deleted file mode 100644 index 7157d4d42..000000000 --- a/docs/src/archive/design/tables/storage-types-spec.md +++ /dev/null @@ -1,892 +0,0 @@ -# Storage Types Redesign Spec - -## Overview - -This document defines a three-layer type architecture: - -1. **Native database types** - Backend-specific (`FLOAT`, `TINYINT UNSIGNED`, `LONGBLOB`). Discouraged for direct use. -2. **Core DataJoint types** - Standardized across backends, scientist-friendly (`float32`, `uint8`, `bool`, `json`). -3. **Codec Types** - Programmatic types with `encode()`/`decode()` semantics. Composable. - -``` -┌───────────────────────────────────────────────────────────────────┐ -│ Codec Types (Layer 3) │ -│ │ -│ Built-in: │ -│ User: ... │ -├───────────────────────────────────────────────────────────────────┤ -│ Core DataJoint Types (Layer 2) │ -│ │ -│ float32 float64 int64 uint64 int32 uint32 int16 uint16 │ -│ int8 uint8 bool uuid json bytes date datetime text │ -│ char(n) varchar(n) enum(...) decimal(n,f) │ -├───────────────────────────────────────────────────────────────────┤ -│ Native Database Types (Layer 1) │ -│ │ -│ MySQL: TINYINT SMALLINT INT BIGINT FLOAT DOUBLE ... │ -│ PostgreSQL: SMALLINT INTEGER BIGINT REAL DOUBLE PRECISION │ -│ (pass through with warning for non-standard types) │ -└───────────────────────────────────────────────────────────────────┘ -``` - -**Syntax distinction:** -- Core types: `int32`, `float64`, `varchar(255)` - no brackets -- Codec types: ``, ``, `` - angle brackets -- The `@` character indicates external storage (object store vs database) - -### OAS Storage Regions - -| Region | Path Pattern | Addressing | Use Case | -|--------|--------------|------------|----------| -| Object | `{schema}/{table}/{pk}/` | Primary key | Large objects, Zarr, HDF5 | -| Hash | `_hash/{hash}` | MD5 hash | Deduplicated blobs/files | - -### External References - -`` provides portable relative paths within configured stores with lazy ObjectRef access. -For arbitrary URLs that don't need ObjectRef semantics, use `varchar` instead. - -## Core DataJoint Types (Layer 2) - -Core types provide a standardized, scientist-friendly interface that works identically across -MySQL and PostgreSQL backends. Users should prefer these over native database types. - -**All core types are recorded in field comments using `:type:` syntax for reconstruction.** - -### Numeric Types - -| Core Type | Description | MySQL | PostgreSQL | -|-----------|-------------|-------|------------| -| `int8` | 8-bit signed | `TINYINT` | `SMALLINT` | -| `int16` | 16-bit signed | `SMALLINT` | `SMALLINT` | -| `int32` | 32-bit signed | `INT` | `INTEGER` | -| `int64` | 64-bit signed | `BIGINT` | `BIGINT` | -| `uint8` | 8-bit unsigned | `TINYINT UNSIGNED` | `SMALLINT` | -| `uint16` | 16-bit unsigned | `SMALLINT UNSIGNED` | `INTEGER` | -| `uint32` | 32-bit unsigned | `INT UNSIGNED` | `BIGINT` | -| `uint64` | 64-bit unsigned | `BIGINT UNSIGNED` | `NUMERIC(20)` | -| `float32` | 32-bit float | `FLOAT` | `REAL` | -| `float64` | 64-bit float | `DOUBLE` | `DOUBLE PRECISION` | -| `decimal(n,f)` | Fixed-point | `DECIMAL(n,f)` | `NUMERIC(n,f)` | - -### String Types - -| Core Type | Description | MySQL | PostgreSQL | -|-----------|-------------|-------|------------| -| `char(n)` | Fixed-length | `CHAR(n)` | `CHAR(n)` | -| `varchar(n)` | Variable-length | `VARCHAR(n)` | `VARCHAR(n)` | - -> **Note:** Native SQL `text` types (`text`, `tinytext`, `mediumtext`, `longtext`) are supported -> but not portable. Prefer `varchar(n)`, `json`, or `` for portable schemas. - -**Encoding:** All strings use UTF-8 (`utf8mb4` in MySQL, `UTF8` in PostgreSQL). -See [Encoding and Collation Policy](#encoding-and-collation-policy) for details. - -### Boolean - -| Core Type | Description | MySQL | PostgreSQL | -|-----------|-------------|-------|------------| -| `bool` | True/False | `TINYINT` | `BOOLEAN` | - -### Date/Time Types - -| Core Type | Description | MySQL | PostgreSQL | -|-----------|-------------|-------|------------| -| `date` | Date only | `DATE` | `DATE` | -| `datetime` | Date and time | `DATETIME` | `TIMESTAMP` | - -**Timezone policy:** All `datetime` values should be stored as **UTC**. Timezone conversion is a -presentation concern handled by the application layer, not the database. This ensures: -- Reproducible computations regardless of server or client timezone settings -- Simple arithmetic on temporal values (no DST ambiguity) -- Portable data across systems and regions - -Use `CURRENT_TIMESTAMP` for auto-populated creation times: -``` -created_at : datetime = CURRENT_TIMESTAMP -``` - -### Binary Types - -The core `bytes` type stores raw bytes without any serialization. Use the `` codec -for serialized Python objects. - -| Core Type | Description | MySQL | PostgreSQL | -|-----------|-------------|-------|------------| -| `bytes` | Raw bytes | `LONGBLOB` | `BYTEA` | - -### Other Types - -| Core Type | Description | MySQL | PostgreSQL | -|-----------|-------------|-------|------------| -| `json` | JSON document | `JSON` | `JSONB` | -| `uuid` | UUID | `BINARY(16)` | `UUID` | -| `enum(...)` | Enumeration | `ENUM(...)` | `CREATE TYPE ... AS ENUM` | - -### Native Passthrough Types - -Users may use native database types directly (e.g., `mediumint`, `tinyblob`), -but these will generate a warning about non-standard usage. Native types are not recorded -in field comments and may have portability issues across database backends. - -### Type Modifiers Policy - -DataJoint table definitions have their own syntax for constraints and metadata. SQL type -modifiers are **not allowed** in type specifications because they conflict with DataJoint's -declarative syntax: - -| Modifier | Status | DataJoint Alternative | -|----------|--------|----------------------| -| `NOT NULL` / `NULL` | ❌ Not allowed | Use `= NULL` for nullable; omit default for required | -| `DEFAULT value` | ❌ Not allowed | Use `= value` syntax before the type | -| `PRIMARY KEY` | ❌ Not allowed | Position above `---` line | -| `UNIQUE` | ❌ Not allowed | Use DataJoint index syntax | -| `COMMENT 'text'` | ❌ Not allowed | Use `# comment` syntax | -| `CHARACTER SET` | ❌ Not allowed | Database-level configuration | -| `COLLATE` | ❌ Not allowed | Database-level configuration | -| `AUTO_INCREMENT` | ⚠️ Discouraged | Allowed with native types only, generates warning | -| `UNSIGNED` | ✅ Allowed | Part of type semantics (use `uint*` core types) | - -**Nullability and defaults:** DataJoint handles nullability through the default value syntax. -An attribute is nullable if and only if its default is `NULL`: - -``` -# Required (NOT NULL, no default) -name : varchar(100) - -# Nullable (default is NULL) -nickname = NULL : varchar(100) - -# Required with default value -status = "active" : varchar(20) -``` - -**Auto-increment policy:** DataJoint discourages `AUTO_INCREMENT` / `SERIAL` because: -- Breaks reproducibility (IDs depend on insertion order) -- Makes pipelines non-deterministic -- Complicates data migration and replication -- Primary keys should be meaningful, not arbitrary - -If required, use native types: `int auto_increment` or `serial` (with warning). - -### Encoding and Collation Policy - -Character encoding and collation are **database-level configuration**, not part of type -definitions. This ensures consistent behavior across all tables and simplifies portability. - -**Configuration** (in `dj.config` or `datajoint.json`): -```json -{ - "database.charset": "utf8mb4", - "database.collation": "utf8mb4_bin" -} -``` - -**Defaults:** - -| Setting | MySQL | PostgreSQL | -|---------|-------|------------| -| Charset | `utf8mb4` | `UTF8` | -| Collation | `utf8mb4_bin` | `C` | - -**Policy:** -- **UTF-8 required**: DataJoint validates charset is UTF-8 compatible at connection time -- **Case-sensitive by default**: Binary collation (`utf8mb4_bin` / `C`) ensures predictable comparisons -- **No per-column overrides**: `CHARACTER SET` and `COLLATE` are rejected in type definitions -- **Like timezone**: Encoding is infrastructure configuration, not part of the data model - -## Codec Types (Layer 3) - -Codec types provide `encode()`/`decode()` semantics on top of core types. They are -composable and can be built-in or user-defined. - -### Storage Mode: `@` Convention - -The `@` character in codec syntax indicates **external storage** (object store): - -- **No `@`**: Internal storage (database) - e.g., ``, `` -- **`@` present**: External storage (object store) - e.g., ``, `` -- **`@` alone**: Use default store - e.g., `` -- **`@name`**: Use named store - e.g., `` - -Some codecs support both modes (``, ``), others are external-only (``, ``, ``). - -### Codec Base Class - -Codecs auto-register when subclassed using Python's `__init_subclass__` mechanism. -No decorator is needed. - -```python -from abc import ABC, abstractmethod -from typing import Any - -# Global codec registry -_codec_registry: dict[str, "Codec"] = {} - - -class Codec(ABC): - """ - Base class for codec types. Subclasses auto-register by name. - - Requires Python 3.10+. - """ - name: str | None = None # Must be set by concrete subclasses - - def __init_subclass__(cls, *, register: bool = True, **kwargs): - """Auto-register concrete codecs when subclassed.""" - super().__init_subclass__(**kwargs) - - if not register: - return # Skip registration for abstract bases - - if cls.name is None: - return # Skip registration if no name (abstract) - - if cls.name in _codec_registry: - existing = _codec_registry[cls.name] - if type(existing) is not cls: - raise DataJointError( - f"Codec <{cls.name}> already registered by {type(existing).__name__}" - ) - return # Same class, idempotent - - _codec_registry[cls.name] = cls() - - def get_dtype(self, is_external: bool) -> str: - """ - Return the storage dtype for this codec. - - Args: - is_external: True if @ modifier present (external storage) - - Returns: - A core type (e.g., "bytes", "json") or another codec (e.g., "") - """ - raise NotImplementedError - - @abstractmethod - def encode(self, value: Any, *, key: dict | None = None, store_name: str | None = None) -> Any: - """Encode Python value for storage.""" - ... - - @abstractmethod - def decode(self, stored: Any, *, key: dict | None = None) -> Any: - """Decode stored value back to Python.""" - ... - - def validate(self, value: Any) -> None: - """Optional validation before encoding. Override to add constraints.""" - pass - - -def list_codecs() -> list[str]: - """Return list of registered codec names.""" - return sorted(_codec_registry.keys()) - - -def get_codec(name: str) -> Codec: - """Get codec by name. Raises DataJointError if not found.""" - if name not in _codec_registry: - raise DataJointError(f"Unknown codec: <{name}>") - return _codec_registry[name] -``` - -**Usage - no decorator needed:** - -```python -class GraphCodec(dj.Codec): - """Auto-registered as .""" - name = "graph" - - def get_dtype(self, is_external: bool) -> str: - return "" - - def encode(self, graph, *, key=None, store_name=None): - return {'nodes': list(graph.nodes()), 'edges': list(graph.edges())} - - def decode(self, stored, *, key=None): - import networkx as nx - G = nx.Graph() - G.add_nodes_from(stored['nodes']) - G.add_edges_from(stored['edges']) - return G -``` - -**Skip registration for abstract bases:** - -```python -class ExternalOnlyCodec(dj.Codec, register=False): - """Abstract base for external-only codecs. Not registered.""" - - def get_dtype(self, is_external: bool) -> str: - if not is_external: - raise DataJointError(f"<{self.name}> requires @ (external only)") - return "json" -``` - -### Codec Resolution and Chaining - -Codecs resolve to core types through chaining. The `get_dtype(is_external)` method -returns the appropriate dtype based on storage mode: - -``` -Resolution at declaration time: - - → get_dtype(False) → "bytes" → LONGBLOB/BYTEA - → get_dtype(True) → "" → json → JSON/JSONB - → get_dtype(True) → "" → json (store=cold) - - → get_dtype(False) → "bytes" → LONGBLOB/BYTEA - → get_dtype(True) → "" → json → JSON/JSONB - - → get_dtype(True) → "json" → JSON/JSONB - → get_dtype(False) → ERROR (external only) - - → get_dtype(True) → "json" → JSON/JSONB - → get_dtype(True) → "json" → JSON/JSONB -``` - -### `` / `` - Path-Addressed Storage - -**Built-in codec. External only.** - -OAS (Object-Augmented Schema) storage for files and folders: - -- Path derived from primary key: `{schema}/{table}/{pk}/{attribute}/` -- One-to-one relationship with table row -- Deleted when row is deleted -- Returns `ObjectRef` for lazy access -- Supports direct writes (Zarr, HDF5) via fsspec -- **dtype**: `json` (stores path, store name, metadata) - -```python -class Analysis(dj.Computed): - definition = """ - -> Recording - --- - results : # default store - archive : # specific store - """ -``` - -#### Implementation - -```python -class ObjectCodec(dj.Codec): - """Path-addressed OAS storage. External only.""" - name = "object" - - def get_dtype(self, is_external: bool) -> str: - if not is_external: - raise DataJointError(" requires @ (external storage only)") - return "json" - - def encode(self, value, *, key=None, store_name=None) -> dict: - store = get_store(store_name or dj.config['stores']['default']) - path = self._compute_path(key) # {schema}/{table}/{pk}/{attr}/ - store.put(path, value) - return {"path": path, "store": store_name, ...} - - def decode(self, stored: dict, *, key=None) -> ObjectRef: - return ObjectRef(store=get_store(stored["store"]), path=stored["path"]) -``` - -### `` / `` - Hash-Addressed Storage - -**Built-in codec. External only.** - -Hash-addressed storage with deduplication: - -- **Single blob only**: stores a single file or serialized object (not folders) -- **Per-project scope**: content is shared across all schemas in a project (not per-schema) -- Path derived from content hash: `_hash/{hash[:2]}/{hash[2:4]}/{hash}` -- Many-to-one: multiple rows (even across schemas) can reference same content -- Reference counted for garbage collection -- Deduplication: identical content stored once across the entire project -- For folders/complex objects, use `object` type instead -- **dtype**: `json` (stores hash, store name, size, metadata) - -``` -store_root/ -├── {schema}/{table}/{pk}/ # object storage (path-addressed by PK) -│ └── {attribute}/ -│ -└── _hash/ # content storage (hash-addressed) - └── {hash[:2]}/{hash[2:4]}/{hash} -``` - -#### Implementation - -```python -class HashCodec(dj.Codec): - """Hash-addressed storage. External only.""" - name = "hash" - - def get_dtype(self, is_external: bool) -> str: - if not is_external: - raise DataJointError(" requires @ (external storage only)") - return "json" - - def encode(self, data: bytes, *, key=None, store_name=None) -> dict: - """Store content, return metadata as JSON.""" - hash_id = hashlib.md5(data).hexdigest() # 32-char hex - store = get_store(store_name or dj.config['stores']['default']) - path = f"_hash/{hash_id[:2]}/{hash_id[2:4]}/{hash_id}" - - if not store.exists(path): - store.put(path, data) - - # Metadata stored in JSON column (no separate registry) - return {"hash": hash_id, "store": store_name, "size": len(data)} - - def decode(self, stored: dict, *, key=None) -> bytes: - """Retrieve content by hash.""" - store = get_store(stored["store"]) - path = f"_hash/{stored['hash'][:2]}/{stored['hash'][2:4]}/{stored['hash']}" - return store.get(path) -``` - -#### Database Column - -The `` type stores JSON metadata: - -```sql --- content column (MySQL) -features JSON NOT NULL --- Contains: {"hash": "abc123...", "store": "main", "size": 12345} - --- content column (PostgreSQL) -features JSONB NOT NULL -``` - -### `` - Portable External Reference - -**Built-in codec. External only (store required).** - -Relative path references within configured stores: - -- **Relative paths**: paths within a configured store (portable across environments) -- **Store-aware**: resolves paths against configured store backend -- Returns `ObjectRef` for lazy access via fsspec -- Stores optional checksum for verification -- **dtype**: `json` (stores path, store name, checksum, metadata) - -**Key benefit**: Portability. The path is relative to the store, so pipelines can be moved -between environments (dev → prod, cloud → local) by changing store configuration without -updating data. - -```python -class RawData(dj.Manual): - definition = """ - session_id : int32 - --- - recording : # relative path within 'main' store - """ - -# Insert - user provides relative path within the store -table.insert1({ - 'session_id': 1, - 'recording': 'experiment_001/data.nwb' # relative to main store root -}) - -# Fetch - returns ObjectRef (lazy) -row = (table & 'session_id=1').fetch1() -ref = row['recording'] # ObjectRef -ref.download('/local/path') # explicit download -ref.open() # fsspec streaming access -``` - -#### When to Use `` vs `varchar` - -| Use Case | Recommended Type | -|----------|------------------| -| Need ObjectRef/lazy access | `` | -| Need portability (relative paths) | `` | -| Want checksum verification | `` | -| Just storing a URL string | `varchar` | -| External URLs you don't control | `varchar` | - -For arbitrary URLs (S3, HTTP, etc.) where you don't need ObjectRef semantics, -just use `varchar`. A string is simpler and more transparent. - -#### Implementation - -```python -class FilepathCodec(dj.Codec): - """Store-relative file references. External only.""" - name = "filepath" - - def get_dtype(self, is_external: bool) -> str: - if not is_external: - raise DataJointError(" requires @store") - return "json" - - def encode(self, relative_path: str, *, key=None, store_name=None) -> dict: - """Register reference to file in store.""" - store = get_store(store_name) # store_name required for filepath - return {'path': relative_path, 'store': store_name} - - def decode(self, stored: dict, *, key=None) -> ObjectRef: - """Return ObjectRef for lazy access.""" - return ObjectRef(store=get_store(stored['store']), path=stored['path']) -``` - -#### Database Column - -```sql --- filepath column (MySQL) -recording JSON NOT NULL --- Contains: {"path": "experiment_001/data.nwb", "store": "main", "checksum": "...", "size": ...} - --- filepath column (PostgreSQL) -recording JSONB NOT NULL -``` - -#### Key Differences from Legacy `filepath@store` (now ``) - -| Feature | Legacy | New | -|---------|--------|-----| -| Access | Copy to local stage | ObjectRef (lazy) | -| Copying | Automatic | Explicit via `ref.download()` | -| Streaming | No | Yes via `ref.open()` | -| Paths | Relative | Relative (unchanged) | -| Store param | Required (`@store`) | Required (`@store`) | - -## Database Types - -### `json` - Cross-Database JSON Type - -JSON storage compatible across MySQL and PostgreSQL: - -```sql --- MySQL -column_name JSON NOT NULL - --- PostgreSQL (uses JSONB for better indexing) -column_name JSONB NOT NULL -``` - -The `json` database type: -- Used as dtype by built-in codecs (``, ``, ``) -- Stores arbitrary JSON-serializable data -- Automatically uses appropriate type for database backend -- Supports JSON path queries where available - -## Built-in Codecs - -### `` / `` - Serialized Python Objects - -**Supports both internal and external storage.** - -Serializes Python objects (NumPy arrays, dicts, lists, etc.) using DataJoint's -blob format. Compatible with MATLAB. - -- **``**: Stored in database (`bytes` → `LONGBLOB`/`BYTEA`) -- **``**: Stored externally via `` with deduplication -- **``**: Stored in specific named store - -```python -class BlobCodec(dj.Codec): - """Serialized Python objects. Supports internal and external.""" - name = "blob" - - def get_dtype(self, is_external: bool) -> str: - return "" if is_external else "bytes" - - def encode(self, value, *, key=None, store_name=None) -> bytes: - from . import blob - return blob.pack(value, compress=True) - - def decode(self, stored, *, key=None) -> Any: - from . import blob - return blob.unpack(stored) -``` - -Usage: -```python -class ProcessedData(dj.Computed): - definition = """ - -> RawData - --- - small_result : # internal (in database) - large_result : # external (default store) - archive_result : # external (specific store) - """ -``` - -### `` / `` - File Attachments - -**Supports both internal and external storage.** - -Stores files with filename preserved. On fetch, extracts to configured download path. - -- **``**: Stored in database (`bytes` → `LONGBLOB`/`BYTEA`) -- **``**: Stored externally via `` with deduplication -- **``**: Stored in specific named store - -```python -class AttachCodec(dj.Codec): - """File attachment with filename. Supports internal and external.""" - name = "attach" - - def get_dtype(self, is_external: bool) -> str: - return "" if is_external else "bytes" - - def encode(self, filepath, *, key=None, store_name=None) -> bytes: - path = Path(filepath) - return path.name.encode() + b"\0" + path.read_bytes() - - def decode(self, stored, *, key=None) -> str: - filename, contents = stored.split(b"\0", 1) - filename = filename.decode() - download_path = Path(dj.config['download_path']) / filename - download_path.write_bytes(contents) - return str(download_path) -``` - -Usage: -```python -class Attachments(dj.Manual): - definition = """ - attachment_id : int32 - --- - config : # internal (small file in DB) - data_file : # external (default store) - archive : # external (specific store) - """ -``` - -## User-Defined Codecs - -Users can define custom codecs for domain-specific data: - -```python -class GraphCodec(dj.Codec): - """Store NetworkX graphs. Internal only (no external support).""" - name = "graph" - - def get_dtype(self, is_external: bool) -> str: - if is_external: - raise DataJointError(" does not support external storage") - return "" # Chain to blob for serialization - - def encode(self, graph, *, key=None, store_name=None): - return {'nodes': list(graph.nodes()), 'edges': list(graph.edges())} - - def decode(self, stored, *, key=None): - import networkx as nx - G = nx.Graph() - G.add_nodes_from(stored['nodes']) - G.add_edges_from(stored['edges']) - return G -``` - -Custom codecs can support both modes by returning different dtypes: - -```python -class ImageCodec(dj.Codec): - """Store images. Supports both internal and external.""" - name = "image" - - def get_dtype(self, is_external: bool) -> str: - return "" if is_external else "bytes" - - def encode(self, image, *, key=None, store_name=None) -> bytes: - # Convert PIL Image to PNG bytes - buffer = io.BytesIO() - image.save(buffer, format='PNG') - return buffer.getvalue() - - def decode(self, stored: bytes, *, key=None): - return PIL.Image.open(io.BytesIO(stored)) -``` - -## Storage Comparison - -| Type | get_dtype | Resolves To | Storage Location | Dedup | Returns | -|------|-----------|-------------|------------------|-------|---------| -| `` | `bytes` | `LONGBLOB`/`BYTEA` | Database | No | Python object | -| `` | `` | `json` | `_hash/{hash}` | Yes | Python object | -| `` | `` | `json` | `_hash/{hash}` | Yes | Python object | -| `` | `bytes` | `LONGBLOB`/`BYTEA` | Database | No | Local file path | -| `` | `` | `json` | `_hash/{hash}` | Yes | Local file path | -| `` | `` | `json` | `_hash/{hash}` | Yes | Local file path | -| `` | `json` | `JSON`/`JSONB` | `{schema}/{table}/{pk}/` | No | ObjectRef | -| `` | `json` | `JSON`/`JSONB` | `{schema}/{table}/{pk}/` | No | ObjectRef | -| `` | `json` | `JSON`/`JSONB` | `_hash/{hash}` | Yes | bytes | -| `` | `json` | `JSON`/`JSONB` | `_hash/{hash}` | Yes | bytes | -| `` | `json` | `JSON`/`JSONB` | Configured store | No | ObjectRef | - -## Garbage Collection for Hash Storage - -Hash metadata (hash, store, size) is stored directly in each table's JSON column - no separate -registry table is needed. Garbage collection scans all tables to find referenced hashes: - -```python -def garbage_collect(store_name): - """Remove hash-addressed data not referenced by any table.""" - # Scan store for all hash files - store = get_store(store_name) - all_hashes = set(store.list_hashes()) # from _hash/ directory - - # Scan all tables for referenced hashes - referenced = set() - for schema in project.schemas: - for table in schema.tables: - for attr in table.heading.attributes: - if uses_hash_storage(attr): # , , - for row in table.fetch(attr.name): - if row and row.get('store') == store_name: - referenced.add(row['hash']) - - # Delete orphaned files - for hash_id in (all_hashes - referenced): - store.delete(hash_path(hash_id)) -``` - -## Built-in Codec Comparison - -| Feature | `` | `` | `` | `` | `` | -|---------|----------|------------|-------------|--------------|---------------| -| Storage modes | Both | Both | External only | External only | External only | -| Internal dtype | `bytes` | `bytes` | N/A | N/A | N/A | -| External dtype | `` | `` | `json` | `json` | `json` | -| Addressing | Hash | Hash | Primary key | Hash | Relative path | -| Deduplication | Yes (external) | Yes (external) | No | Yes | No | -| Structure | Single blob | Single file | Files, folders | Single blob | Any | -| Returns | Python object | Local path | ObjectRef | bytes | ObjectRef | -| GC | Ref counted | Ref counted | With row | Ref counted | User managed | - -**When to use each:** -- **``**: Serialized Python objects (NumPy arrays, dicts). Use `` for large/duplicated data -- **``**: File attachments with filename preserved. Use `` for large files -- **``**: Large/complex file structures (Zarr, HDF5) where DataJoint controls organization -- **``**: Raw bytes with deduplication (typically used via `` or ``) -- **``**: Portable references to externally-managed files -- **`varchar`**: Arbitrary URLs/paths where ObjectRef semantics aren't needed - -## Key Design Decisions - -1. **Three-layer architecture**: - - Layer 1: Native database types (backend-specific, discouraged) - - Layer 2: Core DataJoint types (standardized, scientist-friendly) - - Layer 3: Codec types (encode/decode, composable) -2. **Core types are scientist-friendly**: `float32`, `uint8`, `bool`, `bytes` instead of `FLOAT`, `TINYINT UNSIGNED`, `LONGBLOB` -3. **Codecs use angle brackets**: ``, ``, `` - distinguishes from core types -4. **`@` indicates external storage**: No `@` = database, `@` present = object store -5. **`get_dtype(is_external)` method**: Codecs resolve dtype at declaration time based on storage mode -6. **Codecs are composable**: `` uses ``, which uses `json` -7. **Built-in external codecs use JSON dtype**: Stores metadata (path, hash, store name, etc.) -8. **Two OAS regions**: object (PK-addressed) and hash (hash-addressed) within managed stores -9. **Filepath for portability**: `` uses relative paths within stores for environment portability -10. **No `uri` type**: For arbitrary URLs, use `varchar`—simpler and more transparent -11. **Naming conventions**: - - `@` = external storage (object store) - - No `@` = internal storage (database) - - `@` alone = default store - - `@name` = named store -12. **Dual-mode codecs**: `` and `` support both internal and external storage -13. **External-only codecs**: ``, ``, `` require `@` -14. **Transparent access**: Codecs return Python objects or file paths -15. **Lazy access**: `` and `` return ObjectRef -16. **MD5 for content hashing**: See [Hash Algorithm Choice](#hash-algorithm-choice) below -17. **No separate registry**: Hash metadata stored in JSON columns, not a separate table -18. **Auto-registration via `__init_subclass__`**: Codecs register automatically when subclassed—no decorator needed. Use `register=False` for abstract bases. Requires Python 3.10+. - -### Hash Algorithm Choice - -Content-addressed storage uses **MD5** (128-bit, 32-char hex) rather than SHA256 (256-bit, 64-char hex). - -**Rationale:** - -1. **Practical collision resistance is sufficient**: The birthday bound for MD5 is ~2^64 operations - before 50% collision probability. No scientific project will store anywhere near 10^19 files. - For content deduplication (not cryptographic verification), MD5 provides adequate uniqueness. - -2. **Storage efficiency**: 32-char hashes vs 64-char hashes in every JSON metadata field. - With millions of records, this halves the storage overhead for hash identifiers. - -3. **Performance**: MD5 is ~2-3x faster than SHA256 for large files. While both are fast, - the difference is measurable when hashing large scientific datasets. - -4. **Legacy compatibility**: DataJoint's existing `uuid_from_buffer()` function uses MD5. - The new system changes only the storage format (hex string in JSON vs binary UUID), - not the underlying hash algorithm. This simplifies migration. - -5. **Consistency with existing codebase**: The `dj.hash` module already uses MD5 for - `key_hash()` (job reservation) and `uuid_from_buffer()` (query caching). - -**Why not SHA256?** - -SHA256 is the modern standard for content-addressable storage (Git, Docker, IPFS). However: -- These systems prioritize cryptographic security against adversarial collision attacks -- Scientific data pipelines face no adversarial threat model -- The practical benefits (storage, speed, compatibility) outweigh theoretical security gains - -**Note**: If cryptographic verification is ever needed (e.g., for compliance or reproducibility -audits), SHA256 checksums can be computed on-demand without changing the storage addressing scheme. - -## Migration from Legacy Types - -| Legacy | New Equivalent | -|--------|----------------| -| `longblob` (auto-serialized) | `` | -| `blob@store` | `` | -| `attach` | `` | -| `attach@store` | `` | -| `filepath@store` (copy-based) | `` (ObjectRef-based) | - -### Migration from Legacy `~external_*` Stores - -Legacy external storage used per-schema `~external_{store}` tables with UUID references. -Migration to the new JSON-based hash storage requires: - -```python -def migrate_external_store(schema, store_name): - """ - Migrate legacy ~external_{store} to new HashRegistry. - - 1. Read all entries from ~external_{store} - 2. For each entry: - - Fetch content from legacy location - - Compute MD5 hash - - Copy to _hash/{hash}/ if not exists - - Update table column to new hash format - 3. After all schemas migrated, drop ~external_{store} tables - """ - external_table = schema.external[store_name] - - for entry in external_table.fetch(as_dict=True): - legacy_uuid = entry['hash'] - - # Fetch content from legacy location - content = external_table.get(legacy_uuid) - - # Compute new content hash - hash_id = hashlib.md5(content).hexdigest() - - # Store in new location if not exists - new_path = f"_hash/{hash_id[:2]}/{hash_id[2:4]}/{hash_id}" - store = get_store(store_name) - if not store.exists(new_path): - store.put(new_path, content) - - # Update referencing tables: convert UUID column to JSON with hash metadata - # The JSON column stores {"hash": hash_id, "store": store_name, "size": len(content)} - # ... update all tables that reference this UUID ... - - # After migration complete for all schemas: - # DROP TABLE `{schema}`.`~external_{store}` -``` - -**Migration considerations:** -- Legacy UUIDs were based on MD5 content hash stored as `binary(16)` (UUID format) -- New system uses `char(32)` MD5 hex strings stored in JSON -- The hash algorithm is unchanged (MD5), only the storage format differs -- Migration can be done incrementally per schema -- Backward compatibility layer can read both formats during transition - -## Open Questions - -1. How long should the backward compatibility layer support legacy `~external_*` format? -2. Should `` (without store name) use a default store or require explicit store name? diff --git a/docs/src/archive/design/tables/tiers.md b/docs/src/archive/design/tables/tiers.md deleted file mode 100644 index 2cf1f9428..000000000 --- a/docs/src/archive/design/tables/tiers.md +++ /dev/null @@ -1,68 +0,0 @@ -# Data Tiers - -DataJoint assigns all tables to one of the following data tiers that differentiate how -the data originate. - -## Table tiers - -| Tier | Superclass | Description | -| -- | -- | -- | -| Lookup | `dj.Lookup` | Small tables containing general facts and settings of the data pipeline; not specific to any experiment or dataset. | -| Manual | `dj.Manual` | Data entered from outside the pipeline, either by hand or with external helper scripts. | -| Imported | `dj.Imported` | Data ingested automatically inside the pipeline but requiring access to data outside the pipeline. | -| Computed | `dj.Computed` | Data computed automatically entirely inside the pipeline. | - -Table data tiers indicate to database administrators how valuable the data are. -Manual data are the most valuable, as re-entry may be tedious or impossible. -Computed data are safe to delete, as the data can always be recomputed from within DataJoint. -Imported data are safer than manual data but less safe than computed data because of -dependency on external data sources. -With these considerations, database administrators may opt not to back up computed -data, for example, or to back up imported data less frequently than manual data. - -The data tier of a table is specified by the superclass of its class. -For example, the User class in [definitions](declare.md) uses the `dj.Manual` -superclass. -Therefore, the corresponding User table on the database would be of the Manual tier. -Furthermore, the classes for **imported** and **computed** tables have additional -capabilities for automated processing as described in -[Auto-populate](../../compute/populate.md). - -## Internal conventions for naming tables - -On the server side, DataJoint uses a naming scheme to generate a table name -corresponding to a given class. -The naming scheme includes prefixes specifying each table's data tier. - -First, the name of the class is converted from `CamelCase` to `snake_case` -([separation by underscores](https://en.wikipedia.org/wiki/Snake_case)). -Then the name is prefixed according to the data tier. - -- `Manual` tables have no prefix. -- `Lookup` tables are prefixed with `#`. -- `Imported` tables are prefixed with `_`, a single underscore. -- `Computed` tables are prefixed with `__`, two underscores. - -For example: - -The table for the class `StructuralScan` subclassing `dj.Manual` will be named -`structural_scan`. - -The table for the class `SpatialFilter` subclassing `dj.Lookup` will be named -`#spatial_filter`. - -Again, the internal table names including prefixes are used only on the server side. -These are never visible to the user, and DataJoint users do not need to know these -conventions -However, database administrators may use these naming patterns to set backup policies -or to restrict access based on data tiers. - -## Part tables - -[Part tables](master-part.md) do not have their own tier. -Instead, they share the same tier as their master table. -The prefix for part tables also differs from the other tiers. -They are prefixed by the name of their master table, separated by two underscores. - -For example, the table for the class `Channel(dj.Part)` with the master -`Ephys(dj.Imported)` will be named `_ephys__channel`. diff --git a/docs/src/archive/faq.md b/docs/src/archive/faq.md deleted file mode 100644 index c4c82d014..000000000 --- a/docs/src/archive/faq.md +++ /dev/null @@ -1,192 +0,0 @@ -# Frequently Asked Questions - -## How do I use DataJoint with a GUI? - -It is common to enter data during experiments using a graphical user interface. - -1. The [DataJoint platform](https://works.datajoint.com) platform is a web-based, - end-to-end platform to host and execute data pipelines. - -2. [DataJoint LabBook](https://github.com/datajoint/datajoint-labbook) is an open -source project for data entry but is no longer actively maintained. - -## Does DataJoint support other programming languages? - -DataJoint [Python](https://docs.datajoint.com/core/datajoint-python/) is the most -up-to-date version and all future development will focus on the Python API. The -[Matlab](https://datajoint.com/docs/core/datajoint-matlab/) API was actively developed -through 2023. Previous projects implemented some DataJoint features in -[Julia](https://github.com/BrainCOGS/neuronex_workshop_2018/tree/julia/julia) and -[Rust](https://github.com/datajoint/datajoint-core). DataJoint's data model and data -representation are largely language independent, which means that any language with a -DataJoint client can work with a data pipeline defined in any other language. DataJoint -clients for other programming languages will be implemented based on demand. All -languages must comply to the same data model and computation approach as defined in -[DataJoint: a simpler relational data model](https://arxiv.org/abs/1807.11104). - -## Can I use DataJoint with my current database? - -Researchers use many different tools to keep records, from simple formalized file -hierarchies to complete software packages for colony management and standard file types -like NWB. Existing projects have built interfaces with many such tools, such as -[PyRAT](https://github.com/SFB1089/adamacs/blob/main/notebooks/03_pyrat_insert.ipynb). -The only requirement for interface is that tool has an open API. Contact -[support@datajoint.com](mailto:Support@DataJoint.com) with inquiries. The DataJoint -team will consider development requests based on community demand. - -## Is DataJoint an ORM? - -Programmers are familiar with object-relational mappings (ORM) in various programming -languages. Python in particular has several popular ORMs such as -[SQLAlchemy](https://www.sqlalchemy.org/) and [Django ORM](https://tutorial.djangogirls.org/en/django_orm/). -The purpose of ORMs is to allow representations and manipulations of objects from the -host programming language as data in a relational database. ORMs allow making objects -persistent between program executions by creating a bridge (i.e., mapping) between the -object model used by the host language and the relational model allowed by the database. -The result is always a compromise, usually toward the object model. ORMs usually forgo -key concepts, features, and capabilities of the relational model for the sake of -convenient programming constructs in the language. - -In contrast, DataJoint implements a data model that is a refinement of the relational -data model without compromising its core principles of data representation and queries. -DataJoint supports data integrity (entity integrity, referential integrity, and group -integrity) and provides a fully capable relational query language. DataJoint remains -absolutely data-centric, with the primary focus on the structure and integrity of the -data pipeline. Other ORMs are more application-centric, primarily focusing on the -application design while the database plays a secondary role supporting the application -with object persistence and sharing. - -## What is the difference between DataJoint and Alyx? - -[Alyx](https://github.com/cortex-lab/alyx) is an experiment management database -application developed in Kenneth Harris' lab at UCL. - -Alyx is an application with a fixed pipeline design with a nice graphical user -interface. In contrast, DataJoint is a general-purpose library for designing and -building data processing pipelines. - -Alyx is geared towards ease of data entry and tracking for a specific workflow -(e.g. mouse colony information and some pre-specified experiments) and data types. -DataJoint could be used as a more general purposes tool to design, implement, and -execute processing on such workflows/pipelines from scratch, and DataJoint focuses on -flexibility, data integrity, and ease of data analysis. The purposes are partly -overlapping and complementary. The -[International Brain Lab project](https://internationalbrainlab.com) is developing a -bridge from Alyx to DataJoint, hosted as an -[open-source project](https://github.com/datajoint-company/ibl-pipeline). It -implements a DataJoint schema that replicates the major features of the Alyx -application and a synchronization script from an existing Alyx database to its -DataJoint counterpart. - -## Where is my data? - -New users often ask this question thinking of passive **data repositories** -- -collections of files and folders and a separate collection of metadata -- information -about how the files were collected and what they contain. -Let's address metadata first, since the answer there is easy: Everything goes in the -database! -Any information about the experiment that would normally be stored in a lab notebook, -in an Excel spreadsheet, or in a Word document is entered into tables in the database. -These tables can accommodate numbers, strings, dates, or numerical arrays. -The entry of metadata can be manual, or it can be an automated part of data acquisition -(in this case the acquisition software itself is modified to enter information directly -into the database). - -Depending on their size and contents, raw data files can be stored in a number of ways. -In the simplest and most common scenario, raw data continues to be stored in either a -local filesystem or in the cloud as collections of files and folders. -The paths to these files are entered in the database (again, either manually or by -automated processes). -This is the point at which the notion of a **data pipeline** begins. -Below these "manual tables" that contain metadata and file paths are a series of tables -that load raw data from these files, process it in some way, and insert derived or -summarized data directly into the database. -For example, in an imaging application, the very large raw `.TIFF` stacks would reside on -the filesystem, but the extracted fluorescent trace timeseries for each cell in the -image would be stored as a numerical array directly in the database. -Or the raw video used for animal tracking might be stored in a standard video format on -the filesystem, but the computed X/Y positions of the animal would be stored in the -database. -Storing these intermediate computations in the database makes them easily available for -downstream analyses and queries. - -## Do I have to manually enter all my data into the database? - -No! While some of the data will be manually entered (the same way that it would be -manually recorded in a lab notebook), the advantage of DataJoint is that standard -downstream processing steps can be run automatically on all new data with a single -command. -This is where the notion of a **data pipeline** comes into play. -When the workflow of cleaning and processing the data, extracting important features, -and performing basic analyses is all implemented in a DataJoint pipeline, minimal -effort is required to analyze newly-collected data. -Depending on the size of the raw files and the complexity of analysis, useful results -may be available in a matter of minutes or hours. -Because these results are stored in the database, they can be made available to anyone -who is given access credentials for additional downstream analyses. - -## Won't the database get too big if all my data are there? - -Typically, this is not a problem. -If you find that your database is getting larger than a few dozen TB, DataJoint -provides transparent solutions for storing very large chunks of data (larger than the 4 -GB that can be natively stored as a LONGBLOB in MySQL). -However, in many scenarios even long time series or images can be stored directly in -the database with little effect on performance. - -## Why not just process the data and save them back to a file? - -There are two main advantages to storing results in the database. -The first is data integrity. -Because the relationships between data are enforced by the structure of the database, -DataJoint ensures that the metadata in the upstream nodes always correctly describes -the computed results downstream in the pipeline. -If a specific experimental session is deleted, for example, all the data extracted from -that session are automatically removed as well, so there is no chance of "orphaned" -data. -Likewise, the database ensures that computations are atomic. -This means that any computation performed on a dataset is performed in an all-or-none -fashion. -Either all of the data are processed and inserted, or none at all. -This ensures that there are no incomplete data. -Neither of these important features of data integrity can be guaranteed by a file -system. - -The second advantage of storing intermediate results in a data pipeline is flexible -access. -Accessing arbitrarily complex subsets of the data can be achieved with DataJoint's -flexible query language. -When data are stored in files, collecting the desired data requires trawling through -the file hierarchy, finding and loading the files of interest, and selecting the -interesting parts of the data. - -This brings us to the final important question: - -## How do I get my data out? - -This is the fun part. See [queries](query/operators.md) for details of the DataJoint -query language directly from Python. - -## Interfaces - -Multiple interfaces may be used to get the data into and out of the pipeline. - -Some labs use third-party GUI applications such as -[HeidiSQL](https://www.heidisql.com/) and -[Navicat](https://www.navicat.com/), for example. These applications allow entering -and editing data in tables similarly to spreadsheets. - -The Helium Application (https://mattbdean.github.io/Helium/ and -https://github.com/mattbdean/Helium) is web application for browsing DataJoint -pipelines and entering new data. -Matt Dean develops and maintains Helium under the direction of members of Karel -Svoboda's lab at Janelia Research Campus and Vathes LLC. - -Data may also be imported or synchronized into a DataJoint pipeline from existing LIMS -(laboratory information management systems). -For example, the [International Brain Lab](https://internationalbrainlab.com) -synchronizes data from an [Alyx database](https://github.com/cortex-lab/alyx). -For implementation details, see https://github.com/int-brain-lab/IBL-pipeline. - -Other labs (e.g. Sinz Lab) have developed GUI interfaces using the Flask web framework -in Python. diff --git a/docs/src/archive/images/StudentTable.png b/docs/src/archive/images/StudentTable.png deleted file mode 100644 index c8623f2ab..000000000 Binary files a/docs/src/archive/images/StudentTable.png and /dev/null differ diff --git a/docs/src/archive/images/added-example-ERD.svg b/docs/src/archive/images/added-example-ERD.svg deleted file mode 100644 index 0884853f4..000000000 --- a/docs/src/archive/images/added-example-ERD.svg +++ /dev/null @@ -1,207 +0,0 @@ - - - -%3 - - - -uni.Term - - -uni.Term - - - - - -uni.Section - - -uni.Section - - - - - -uni.Term->uni.Section - - - - -uni.CurrentTerm - - -uni.CurrentTerm - - - - - -uni.Term->uni.CurrentTerm - - - - -uni.Student - - -uni.Student - - - - - -uni.Enroll - - -uni.Enroll - - - - - -uni.Student->uni.Enroll - - - - -uni.StudentMajor - - -uni.StudentMajor - - - - - -uni.Student->uni.StudentMajor - - - - -uni.Example - - -uni.Example - - - - - -uni.Student->uni.Example - - - - -uni.Grade - - -uni.Grade - - - - - -uni.Enroll->uni.Grade - - - - -uni.Section->uni.Enroll - - - - -uni.Course - - -uni.Course - - - - - -uni.Course->uni.Section - - - - -uni.Department - - -uni.Department - - - - - -uni.Department->uni.StudentMajor - - - - -uni.Department->uni.Course - - - - -uni.LetterGrade - - -uni.LetterGrade - - - - - -uni.LetterGrade->uni.Grade - - - - diff --git a/docs/src/archive/images/data-engineering.png b/docs/src/archive/images/data-engineering.png deleted file mode 100644 index e038ac299..000000000 Binary files a/docs/src/archive/images/data-engineering.png and /dev/null differ diff --git a/docs/src/archive/images/data-science-after.png b/docs/src/archive/images/data-science-after.png deleted file mode 100644 index e4f824cab..000000000 Binary files a/docs/src/archive/images/data-science-after.png and /dev/null differ diff --git a/docs/src/archive/images/data-science-before.png b/docs/src/archive/images/data-science-before.png deleted file mode 100644 index eb8ee311d..000000000 Binary files a/docs/src/archive/images/data-science-before.png and /dev/null differ diff --git a/docs/src/archive/images/diff-example1.png b/docs/src/archive/images/diff-example1.png deleted file mode 100644 index 2c8844b81..000000000 Binary files a/docs/src/archive/images/diff-example1.png and /dev/null differ diff --git a/docs/src/archive/images/diff-example2.png b/docs/src/archive/images/diff-example2.png deleted file mode 100644 index ab7465c7b..000000000 Binary files a/docs/src/archive/images/diff-example2.png and /dev/null differ diff --git a/docs/src/archive/images/diff-example3.png b/docs/src/archive/images/diff-example3.png deleted file mode 100644 index b4f511fec..000000000 Binary files a/docs/src/archive/images/diff-example3.png and /dev/null differ diff --git a/docs/src/archive/images/dimitri-ERD.svg b/docs/src/archive/images/dimitri-ERD.svg deleted file mode 100644 index 590b30887..000000000 --- a/docs/src/archive/images/dimitri-ERD.svg +++ /dev/null @@ -1,117 +0,0 @@ - - - -%3 - - - -`dimitri_university`.`course` - -`dimitri_university`.`course` - - - -`dimitri_university`.`section` - -`dimitri_university`.`section` - - - -`dimitri_university`.`course`->`dimitri_university`.`section` - - - - -`dimitri_university`.`current_term` - -`dimitri_university`.`current_term` - - - -`dimitri_university`.`department` - -`dimitri_university`.`department` - - - -`dimitri_university`.`department`->`dimitri_university`.`course` - - - - -`dimitri_university`.`student_major` - -`dimitri_university`.`student_major` - - - -`dimitri_university`.`department`->`dimitri_university`.`student_major` - - - - -`dimitri_university`.`enroll` - -`dimitri_university`.`enroll` - - - -`dimitri_university`.`grade` - -`dimitri_university`.`grade` - - - -`dimitri_university`.`enroll`->`dimitri_university`.`grade` - - - - -`dimitri_university`.`letter_grade` - -`dimitri_university`.`letter_grade` - - - -`dimitri_university`.`letter_grade`->`dimitri_university`.`grade` - - - - -`dimitri_university`.`section`->`dimitri_university`.`enroll` - - - - -`dimitri_university`.`student` - -`dimitri_university`.`student` - - - -`dimitri_university`.`student`->`dimitri_university`.`enroll` - - - - -`dimitri_university`.`student`->`dimitri_university`.`student_major` - - - - -`dimitri_university`.`term` - -`dimitri_university`.`term` - - - -`dimitri_university`.`term`->`dimitri_university`.`current_term` - - - - -`dimitri_university`.`term`->`dimitri_university`.`section` - - - - diff --git a/docs/src/archive/images/doc_1-1.png b/docs/src/archive/images/doc_1-1.png deleted file mode 100644 index 4f6f0fa0b..000000000 Binary files a/docs/src/archive/images/doc_1-1.png and /dev/null differ diff --git a/docs/src/archive/images/doc_1-many.png b/docs/src/archive/images/doc_1-many.png deleted file mode 100644 index 32fbbf15b..000000000 Binary files a/docs/src/archive/images/doc_1-many.png and /dev/null differ diff --git a/docs/src/archive/images/doc_many-1.png b/docs/src/archive/images/doc_many-1.png deleted file mode 100644 index 961a306dc..000000000 Binary files a/docs/src/archive/images/doc_many-1.png and /dev/null differ diff --git a/docs/src/archive/images/doc_many-many.png b/docs/src/archive/images/doc_many-many.png deleted file mode 100644 index 3aa484dd6..000000000 Binary files a/docs/src/archive/images/doc_many-many.png and /dev/null differ diff --git a/docs/src/archive/images/how-it-works.png b/docs/src/archive/images/how-it-works.png deleted file mode 100644 index 10c611f3d..000000000 Binary files a/docs/src/archive/images/how-it-works.png and /dev/null differ diff --git a/docs/src/archive/images/install-cmd-prompt.png b/docs/src/archive/images/install-cmd-prompt.png deleted file mode 100644 index 58c9fa964..000000000 Binary files a/docs/src/archive/images/install-cmd-prompt.png and /dev/null differ diff --git a/docs/src/archive/images/install-datajoint-1.png b/docs/src/archive/images/install-datajoint-1.png deleted file mode 100644 index 7aa0a7133..000000000 Binary files a/docs/src/archive/images/install-datajoint-1.png and /dev/null differ diff --git a/docs/src/archive/images/install-datajoint-2.png b/docs/src/archive/images/install-datajoint-2.png deleted file mode 100644 index 970e8c6d4..000000000 Binary files a/docs/src/archive/images/install-datajoint-2.png and /dev/null differ diff --git a/docs/src/archive/images/install-git-1.png b/docs/src/archive/images/install-git-1.png deleted file mode 100644 index 7503dbb61..000000000 Binary files a/docs/src/archive/images/install-git-1.png and /dev/null differ diff --git a/docs/src/archive/images/install-graphviz-1.png b/docs/src/archive/images/install-graphviz-1.png deleted file mode 100644 index dc79e58f1..000000000 Binary files a/docs/src/archive/images/install-graphviz-1.png and /dev/null differ diff --git a/docs/src/archive/images/install-graphviz-2a.png b/docs/src/archive/images/install-graphviz-2a.png deleted file mode 100644 index 394598db7..000000000 Binary files a/docs/src/archive/images/install-graphviz-2a.png and /dev/null differ diff --git a/docs/src/archive/images/install-graphviz-2b.png b/docs/src/archive/images/install-graphviz-2b.png deleted file mode 100644 index 790f88d40..000000000 Binary files a/docs/src/archive/images/install-graphviz-2b.png and /dev/null differ diff --git a/docs/src/archive/images/install-jupyter-1.png b/docs/src/archive/images/install-jupyter-1.png deleted file mode 100644 index 14d697942..000000000 Binary files a/docs/src/archive/images/install-jupyter-1.png and /dev/null differ diff --git a/docs/src/archive/images/install-jupyter-2.png b/docs/src/archive/images/install-jupyter-2.png deleted file mode 100644 index 0d69e6667..000000000 Binary files a/docs/src/archive/images/install-jupyter-2.png and /dev/null differ diff --git a/docs/src/archive/images/install-matplotlib.png b/docs/src/archive/images/install-matplotlib.png deleted file mode 100644 index d092376bb..000000000 Binary files a/docs/src/archive/images/install-matplotlib.png and /dev/null differ diff --git a/docs/src/archive/images/install-pydotplus.png b/docs/src/archive/images/install-pydotplus.png deleted file mode 100644 index 4a0b33f91..000000000 Binary files a/docs/src/archive/images/install-pydotplus.png and /dev/null differ diff --git a/docs/src/archive/images/install-python-advanced-1.png b/docs/src/archive/images/install-python-advanced-1.png deleted file mode 100644 index b07c70e94..000000000 Binary files a/docs/src/archive/images/install-python-advanced-1.png and /dev/null differ diff --git a/docs/src/archive/images/install-python-advanced-2.png b/docs/src/archive/images/install-python-advanced-2.png deleted file mode 100644 index b10be09cc..000000000 Binary files a/docs/src/archive/images/install-python-advanced-2.png and /dev/null differ diff --git a/docs/src/archive/images/install-python-simple.png b/docs/src/archive/images/install-python-simple.png deleted file mode 100644 index ec28cf8cc..000000000 Binary files a/docs/src/archive/images/install-python-simple.png and /dev/null differ diff --git a/docs/src/archive/images/install-run-jupyter-1.png b/docs/src/archive/images/install-run-jupyter-1.png deleted file mode 100644 index cd1e9cfb5..000000000 Binary files a/docs/src/archive/images/install-run-jupyter-1.png and /dev/null differ diff --git a/docs/src/archive/images/install-run-jupyter-2.png b/docs/src/archive/images/install-run-jupyter-2.png deleted file mode 100644 index 7fcee8ee7..000000000 Binary files a/docs/src/archive/images/install-run-jupyter-2.png and /dev/null differ diff --git a/docs/src/archive/images/install-verify-graphviz.png b/docs/src/archive/images/install-verify-graphviz.png deleted file mode 100644 index 6468a98c3..000000000 Binary files a/docs/src/archive/images/install-verify-graphviz.png and /dev/null differ diff --git a/docs/src/archive/images/install-verify-jupyter.png b/docs/src/archive/images/install-verify-jupyter.png deleted file mode 100644 index 73defac5d..000000000 Binary files a/docs/src/archive/images/install-verify-jupyter.png and /dev/null differ diff --git a/docs/src/archive/images/install-verify-python.png b/docs/src/archive/images/install-verify-python.png deleted file mode 100644 index 54ad47290..000000000 Binary files a/docs/src/archive/images/install-verify-python.png and /dev/null differ diff --git a/docs/src/archive/images/join-example1.png b/docs/src/archive/images/join-example1.png deleted file mode 100644 index a518896ef..000000000 Binary files a/docs/src/archive/images/join-example1.png and /dev/null differ diff --git a/docs/src/archive/images/join-example2.png b/docs/src/archive/images/join-example2.png deleted file mode 100644 index c219a6a02..000000000 Binary files a/docs/src/archive/images/join-example2.png and /dev/null differ diff --git a/docs/src/archive/images/join-example3.png b/docs/src/archive/images/join-example3.png deleted file mode 100644 index b2782469e..000000000 Binary files a/docs/src/archive/images/join-example3.png and /dev/null differ diff --git a/docs/src/archive/images/key_source_combination.png b/docs/src/archive/images/key_source_combination.png deleted file mode 100644 index 3db45de37..000000000 Binary files a/docs/src/archive/images/key_source_combination.png and /dev/null differ diff --git a/docs/src/archive/images/map-dataflow.png b/docs/src/archive/images/map-dataflow.png deleted file mode 100644 index 5a3bb34ce..000000000 Binary files a/docs/src/archive/images/map-dataflow.png and /dev/null differ diff --git a/docs/src/archive/images/matched_tuples1.png b/docs/src/archive/images/matched_tuples1.png deleted file mode 100644 index c27593e14..000000000 Binary files a/docs/src/archive/images/matched_tuples1.png and /dev/null differ diff --git a/docs/src/archive/images/matched_tuples2.png b/docs/src/archive/images/matched_tuples2.png deleted file mode 100644 index 673fa5865..000000000 Binary files a/docs/src/archive/images/matched_tuples2.png and /dev/null differ diff --git a/docs/src/archive/images/matched_tuples3.png b/docs/src/archive/images/matched_tuples3.png deleted file mode 100644 index f60e11b50..000000000 Binary files a/docs/src/archive/images/matched_tuples3.png and /dev/null differ diff --git a/docs/src/archive/images/mp-diagram.png b/docs/src/archive/images/mp-diagram.png deleted file mode 100644 index d834726fb..000000000 Binary files a/docs/src/archive/images/mp-diagram.png and /dev/null differ diff --git a/docs/src/archive/images/op-restrict.png b/docs/src/archive/images/op-restrict.png deleted file mode 100644 index e686ac94a..000000000 Binary files a/docs/src/archive/images/op-restrict.png and /dev/null differ diff --git a/docs/src/archive/images/outer-example1.png b/docs/src/archive/images/outer-example1.png deleted file mode 100644 index 0a7c7552f..000000000 Binary files a/docs/src/archive/images/outer-example1.png and /dev/null differ diff --git a/docs/src/archive/images/pipeline-database.png b/docs/src/archive/images/pipeline-database.png deleted file mode 100644 index 035df17cb..000000000 Binary files a/docs/src/archive/images/pipeline-database.png and /dev/null differ diff --git a/docs/src/archive/images/pipeline.png b/docs/src/archive/images/pipeline.png deleted file mode 100644 index 0d91f72e9..000000000 Binary files a/docs/src/archive/images/pipeline.png and /dev/null differ diff --git a/docs/src/archive/images/python_collection.png b/docs/src/archive/images/python_collection.png deleted file mode 100644 index 76fd1d7b0..000000000 Binary files a/docs/src/archive/images/python_collection.png and /dev/null differ diff --git a/docs/src/archive/images/queries_example_diagram.png b/docs/src/archive/images/queries_example_diagram.png deleted file mode 100644 index d6aae1377..000000000 Binary files a/docs/src/archive/images/queries_example_diagram.png and /dev/null differ diff --git a/docs/src/archive/images/query_object_preview.png b/docs/src/archive/images/query_object_preview.png deleted file mode 100644 index 16cedc8fc..000000000 Binary files a/docs/src/archive/images/query_object_preview.png and /dev/null differ diff --git a/docs/src/archive/images/restrict-example1.png b/docs/src/archive/images/restrict-example1.png deleted file mode 100644 index 451e68c58..000000000 Binary files a/docs/src/archive/images/restrict-example1.png and /dev/null differ diff --git a/docs/src/archive/images/restrict-example2.png b/docs/src/archive/images/restrict-example2.png deleted file mode 100644 index aa9a4636b..000000000 Binary files a/docs/src/archive/images/restrict-example2.png and /dev/null differ diff --git a/docs/src/archive/images/restrict-example3.png b/docs/src/archive/images/restrict-example3.png deleted file mode 100644 index e8de7f6ca..000000000 Binary files a/docs/src/archive/images/restrict-example3.png and /dev/null differ diff --git a/docs/src/archive/images/shapes_pipeline.svg b/docs/src/archive/images/shapes_pipeline.svg deleted file mode 100644 index 14572c4ce..000000000 --- a/docs/src/archive/images/shapes_pipeline.svg +++ /dev/null @@ -1,36 +0,0 @@ - - -%3 - - - -Area - - -Area - - - - - -Rectangle - - -Rectangle - - - - - -Rectangle->Area - - - - diff --git a/docs/src/archive/images/spawned-classes-ERD.svg b/docs/src/archive/images/spawned-classes-ERD.svg deleted file mode 100644 index 313841e81..000000000 --- a/docs/src/archive/images/spawned-classes-ERD.svg +++ /dev/null @@ -1,147 +0,0 @@ - - - -%3 - - - -Course - - -Course - - - - - -Section - - -Section - - - - - -Course->Section - - - - -Department - - -Department - - - - - -Department->Course - - - - -StudentMajor - - -StudentMajor - - - - - -Department->StudentMajor - - - - -Term - - -Term - - - - - -Term->Section - - - - -CurrentTerm - - -CurrentTerm - - - - - -Term->CurrentTerm - - - - -LetterGrade - - -LetterGrade - - - - - -Grade - - -Grade - - - - - -LetterGrade->Grade - - - - -Enroll - - -Enroll - - - - - -Enroll->Grade - - - - -Student - - -Student - - - - - -Student->Enroll - - - - -Student->StudentMajor - - - - -Section->Enroll - - - - diff --git a/docs/src/archive/images/union-example1.png b/docs/src/archive/images/union-example1.png deleted file mode 100644 index e693e7170..000000000 Binary files a/docs/src/archive/images/union-example1.png and /dev/null differ diff --git a/docs/src/archive/images/union-example2.png b/docs/src/archive/images/union-example2.png deleted file mode 100644 index 82cc5cc51..000000000 Binary files a/docs/src/archive/images/union-example2.png and /dev/null differ diff --git a/docs/src/archive/images/virtual-module-ERD.svg b/docs/src/archive/images/virtual-module-ERD.svg deleted file mode 100644 index 28eb0c481..000000000 --- a/docs/src/archive/images/virtual-module-ERD.svg +++ /dev/null @@ -1,147 +0,0 @@ - - - -%3 - - - -uni.LetterGrade - - -uni.LetterGrade - - - - - -uni.Grade - - -uni.Grade - - - - - -uni.LetterGrade->uni.Grade - - - - -uni.Course - - -uni.Course - - - - - -uni.Section - - -uni.Section - - - - - -uni.Course->uni.Section - - - - -uni.Term - - -uni.Term - - - - - -uni.Term->uni.Section - - - - -uni.CurrentTerm - - -uni.CurrentTerm - - - - - -uni.Term->uni.CurrentTerm - - - - -uni.Enroll - - -uni.Enroll - - - - - -uni.Section->uni.Enroll - - - - -uni.StudentMajor - - -uni.StudentMajor - - - - - -uni.Enroll->uni.Grade - - - - -uni.Department - - -uni.Department - - - - - -uni.Department->uni.Course - - - - -uni.Department->uni.StudentMajor - - - - -uni.Student - - -uni.Student - - - - - -uni.Student->uni.StudentMajor - - - - -uni.Student->uni.Enroll - - - - diff --git a/docs/src/archive/manipulation/delete.md b/docs/src/archive/manipulation/delete.md deleted file mode 100644 index e61e8a2b8..000000000 --- a/docs/src/archive/manipulation/delete.md +++ /dev/null @@ -1,31 +0,0 @@ -# Delete - -The `delete` method deletes entities from a table and all dependent entries in -dependent tables. - -Delete is often used in conjunction with the [restriction](../query/restrict.md) -operator to define the subset of entities to delete. -Delete is performed as an atomic transaction so that partial deletes never occur. - -## Examples - -```python -# delete all entries from tuning.VonMises -tuning.VonMises.delete() - -# delete entries from tuning.VonMises for mouse 1010 -(tuning.VonMises & 'mouse=1010').delete() - -# delete entries from tuning.VonMises except mouse 1010 -(tuning.VonMises - 'mouse=1010').delete() -``` - -## Deleting from part tables - -Entities in a [part table](../design/tables/master-part.md) are usually removed as a -consequence of deleting the master table. - -To enforce this workflow, calling `delete` directly on a part table produces an error. -In some cases, it may be necessary to override this behavior using the `part_integrity` parameter: -- `part_integrity="ignore"`: Remove entities from a part table without deleting from master (breaks integrity). -- `part_integrity="cascade"`: Delete from parts and also cascade up to delete the corresponding master entries. diff --git a/docs/src/archive/manipulation/index.md b/docs/src/archive/manipulation/index.md deleted file mode 100644 index 295195778..000000000 --- a/docs/src/archive/manipulation/index.md +++ /dev/null @@ -1,9 +0,0 @@ -# Data Manipulation - -Data **manipulation** operations change the state of the data stored in the database -without modifying the structure of the stored data. -These operations include [insert](insert.md), [delete](delete.md), and -[update](update.md). - -Data manipulation operations in DataJoint respect the -[integrity](../design/integrity.md) constraints. diff --git a/docs/src/archive/manipulation/insert.md b/docs/src/archive/manipulation/insert.md deleted file mode 100644 index 2db4157d6..000000000 --- a/docs/src/archive/manipulation/insert.md +++ /dev/null @@ -1,173 +0,0 @@ -# Insert - -The `insert` method of DataJoint table objects inserts entities into the table. - -In Python there is a separate method `insert1` to insert one entity at a time. -The entity may have the form of a Python dictionary with key names matching the -attribute names in the table. - -```python -lab.Person.insert1( - dict(username='alice', - first_name='Alice', - last_name='Cooper')) -``` - -The entity also may take the form of a sequence of values in the same order as the -attributes in the table. - -```python -lab.Person.insert1(['alice', 'Alice', 'Cooper']) -``` - -Additionally, the entity may be inserted as a -[NumPy record array](https://docs.scipy.org/doc/numpy/reference/generated/numpy.record.html#numpy.record) - or [Pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html). - -The `insert` method accepts a sequence or a generator of multiple entities and is used -to insert multiple entities at once. - -```python -lab.Person.insert([ - ['alice', 'Alice', 'Cooper'], - ['bob', 'Bob', 'Dylan'], - ['carol', 'Carol', 'Douglas']]) -``` - -Several optional parameters can be used with `insert`: - - `replace` If `True`, replaces the existing entity. - (Default `False`.) - - `skip_duplicates` If `True`, silently skip duplicate inserts. - (Default `False`.) - - `ignore_extra_fields` If `False`, fields that are not in the heading raise an error. - (Default `False`.) - - `allow_direct_insert` If `True`, allows inserts outside of populate calls. - Applies only in auto-populated tables. - (Default `None`.) - -## Batched inserts - -Inserting a set of entities in a single `insert` differs from inserting the same set of -entities one-by-one in a `for` loop in two ways: - -1. Network overhead is reduced. - Network overhead can be tens of milliseconds per query. - Inserting 1000 entities in a single `insert` call may save a few seconds over - inserting them individually. -2. The insert is performed as an all-or-nothing transaction. - If even one insert fails because it violates any constraint, then none of the - entities in the set are inserted. - -However, inserting too many entities in a single query may run against buffer size or -packet size limits of the database server. -Due to these limitations, performing inserts of very large numbers of entities should -be broken up into moderately sized batches, such as a few hundred at a time. - -## Server-side inserts - -Data inserted into a table often come from other tables already present on the database server. -In such cases, data can be [fetched](../query/fetch.md) from the first table and then -inserted into another table, but this results in transfers back and forth between the -database and the local system. -Instead, data can be inserted from one table into another without transfers between the -database and the local system using [queries](../query/principles.md). - -In the example below, a new schema has been created in preparation for phase two of a -project. -Experimental protocols from the first phase of the project will be reused in the second -phase. -Since the entities are already present on the database in the `Protocol` table of the -`phase_one` schema, we can perform a server-side insert into `phase_two.Protocol` -without fetching a local copy. - -```python -# Server-side inserts are faster... -phase_two.Protocol.insert(phase_one.Protocol) - -# ...than fetching before inserting -protocols = phase_one.Protocol.fetch() -phase_two.Protocol.insert(protocols) -``` - -## Object attributes - -Tables with [`object`](../design/tables/object.md) type attributes can be inserted with -local file paths, folder paths, remote URLs, or streams. The content is automatically -copied to object storage. - -```python -# Insert with local file path -Recording.insert1({ - "subject_id": 123, - "session_id": 45, - "raw_data": "/local/path/to/data.dat" -}) - -# Insert with local folder path -Recording.insert1({ - "subject_id": 123, - "session_id": 45, - "raw_data": "/local/path/to/data_folder/" -}) - -# Insert from remote URL (S3, GCS, Azure, HTTP) -Recording.insert1({ - "subject_id": 123, - "session_id": 45, - "raw_data": "s3://source-bucket/path/to/data.dat" -}) - -# Insert remote Zarr store (e.g., from collaborator) -Recording.insert1({ - "subject_id": 123, - "session_id": 45, - "neural_data": "gs://collaborator-bucket/shared/experiment.zarr" -}) - -# Insert from stream with explicit extension -with open("/path/to/data.bin", "rb") as f: - Recording.insert1({ - "subject_id": 123, - "session_id": 45, - "raw_data": (".bin", f) - }) -``` - -Supported remote URL protocols: `s3://`, `gs://`, `az://`, `http://`, `https://` - -### Staged inserts - -For large objects like Zarr arrays, use `staged_insert1` to write directly to storage -without creating a local copy first: - -```python -import zarr - -with Recording.staged_insert1 as staged: - # Set primary key values first - staged.rec['subject_id'] = 123 - staged.rec['session_id'] = 45 - - # Create Zarr array directly in object storage - z = zarr.open(staged.store('raw_data', '.zarr'), mode='w', shape=(10000, 10000)) - z[:] = compute_large_array() - - # Assign to record - staged.rec['raw_data'] = z - -# On successful exit: metadata computed, record inserted -# On exception: storage cleaned up, no record inserted -``` - -The `staged_insert1` context manager provides: - -- `staged.rec`: Dict for setting attribute values -- `staged.store(field, ext)`: Returns fsspec store for Zarr/xarray -- `staged.open(field, ext, mode)`: Returns file handle for writing -- `staged.fs`: Direct fsspec filesystem access - -See the [object type documentation](../design/tables/object.md) for more details. diff --git a/docs/src/archive/manipulation/transactions.md b/docs/src/archive/manipulation/transactions.md deleted file mode 100644 index 58b9a3167..000000000 --- a/docs/src/archive/manipulation/transactions.md +++ /dev/null @@ -1,36 +0,0 @@ -# Transactions - -In some cases, a sequence of several operations must be performed as a single -operation: -interrupting the sequence of such operations halfway would leave the data in an invalid -state. -While the sequence is in progress, other processes accessing the database will not see -the partial results until the transaction is complete. -The sequence may include [data queries](../query/principles.md) and -[manipulations](index.md). - -In such cases, the sequence of operations may be enclosed in a transaction. - -Transactions are formed using the `transaction` property of the connection object. -The connection object may be obtained from any table object. -The `transaction` property can then be used as a context manager in Python's `with` -statement. - -For example, the following code inserts matching entries for the master table `Session` -and its part table `Session.Experimenter`. - -```python -# get the connection object -connection = Session.connection - -# insert Session and Session.Experimenter entries in a transaction -with connection.transaction: - key = {'subject_id': animal_id, 'session_time': session_time} - Session.insert1({**key, 'brain_region':region, 'cortical_layer':layer}) - Session.Experimenter.insert1({**key, 'experimenter': username}) -``` - -Here, to external observers, both inserts will take effect together upon exiting from -the `with` block or will not have any effect at all. -For example, if the second insert fails due to an error, the first insert will be -rolled back. diff --git a/docs/src/archive/manipulation/update.md b/docs/src/archive/manipulation/update.md deleted file mode 100644 index 7faa7cb87..000000000 --- a/docs/src/archive/manipulation/update.md +++ /dev/null @@ -1,48 +0,0 @@ -# Cautious Update - -In database programming, the **update** operation refers to modifying the values of -individual attributes in an entity within a table without replacing the entire entity. -Such an in-place update mechanism is not part of DataJoint's data manipulation model, -because it circumvents data -[dependency constraints](../design/integrity.md#referential-integrity). - -This is not to say that data cannot be changed once they are part of a pipeline. -In DataJoint, data is changed by replacing entire entities rather than by updating the -values of their attributes. -The process of deleting existing entities and inserting new entities with corrected -values ensures the [integrity](../design/integrity.md) of the data throughout the -pipeline. - -This approach applies specifically to automated tables -(see [Auto-populated tables](../compute/populate.md)). -However, manual tables are often edited outside DataJoint through other interfaces. -It is up to the user's discretion to allow updates in manual tables, and the user must -be cognizant of the fact that updates will not trigger re-computation of dependent data. - -## Usage - -For some cases, it becomes necessary to deliberately correct existing values where a -user has chosen to accept the above responsibility despite the caution. - -The `update1` method accomplishes this if the record already exists. Note that updates -to primary key values are not allowed. - -The method should only be used to fix problems, and not as part of a regular workflow. -When updating an entry, make sure that any information stored in dependent tables that -depends on the update values is properly updated as well. - -## Examples - -```python -# with record as a dict specifying the primary and -# secondary attribute values -table.update1(record) - -# update value in record with id as primary key -table.update1({'id': 1, 'value': 3}) - -# reset value to default with id as primary key -table.update1({'id': 1, 'value': None}) -# or -table.update1({'id': 1}) -``` diff --git a/docs/src/archive/publish-data.md b/docs/src/archive/publish-data.md deleted file mode 100644 index 3ec2d7211..000000000 --- a/docs/src/archive/publish-data.md +++ /dev/null @@ -1,34 +0,0 @@ -# Publishing Data - -DataJoint is a framework for building data pipelines that support rigorous flow of -structured data between experimenters, data scientists, and computing agents *during* -data acquisition and processing within a centralized project. -Publishing final datasets for the outside world may require additional steps and -conversion. - -## Provide access to a DataJoint server - -One approach for publishing data is to grant public access to an existing pipeline. -Then public users will be able to query the data pipelines using DataJoint's query -language and output interfaces just like any other users of the pipeline. -For security, this may require synchronizing the data onto a separate read-only public -server. - -## Containerizing as a DataJoint pipeline - -Containerization platforms such as [Docker](https://www.docker.com/) allow convenient -distribution of environments including database services and data. -It is convenient to publish DataJoint pipelines as a docker container that deploys the -populated DataJoint pipeline. -One example of publishing a DataJoint pipeline as a docker container is -> Sinz, F., Ecker, A.S., Fahey, P., Walker, E., Cobos, E., Froudarakis, E., Yatsenko, D., Pitkow, Z., Reimer, J. and Tolias, A., 2018. Stimulus domain transfer in recurrent models for large scale cortical population prediction on video. In Advances in Neural Information Processing Systems (pp. 7198-7209). https://www.biorxiv.org/content/early/2018/10/25/452672 - -The code and the data can be found at [https://github.com/sinzlab/Sinz2018_NIPS](https://github.com/sinzlab/Sinz2018_NIPS). - -## Exporting into a collection of files - -Another option for publishing and archiving data is to export the data from the -DataJoint pipeline into a collection of files. -DataJoint provides features for exporting and importing sections of the pipeline. -Several ongoing projects are implementing the capability to export from DataJoint -pipelines into [Neurodata Without Borders](https://www.nwb.org/) files. diff --git a/docs/src/archive/query/aggregation.md b/docs/src/archive/query/aggregation.md deleted file mode 100644 index e47fd0b33..000000000 --- a/docs/src/archive/query/aggregation.md +++ /dev/null @@ -1,29 +0,0 @@ -# Aggr - -**Aggregation**, performed with the `aggr` operator, is a special form of `proj` with -the additional feature of allowing aggregation calculations on another table. -It has the form `tab.aggr(other, ...)` where `other` is another table. -Without the argument `other`, `aggr` and `proj` are exactly equivalent. -Aggregation allows adding calculated attributes to each entity in `tab` based on -aggregation functions over attributes in the -[matching](./operators.md#matching-entities) entities of `other`. - -Aggregation functions include `count`, `sum`, `min`, `max`, `avg`, `std`, `variance`, -and others. -Aggregation functions can only be used in the definitions of new attributes within the -`aggr` operator. - -As with `proj`, the output of `aggr` has the same entity class, the same primary key, -and the same number of elements as `tab`. -Primary key attributes are always included in the output and may be renamed, just like -in `proj`. - -## Examples - -```python -# Number of students in each course section -Section.aggr(Enroll, n="count(*)") - -# Average grade in each course -Course.aggr(Grade * LetterGrade, avg_grade="avg(points)") -``` diff --git a/docs/src/archive/query/example-schema.md b/docs/src/archive/query/example-schema.md deleted file mode 100644 index 063e36574..000000000 --- a/docs/src/archive/query/example-schema.md +++ /dev/null @@ -1,112 +0,0 @@ -# Example Schema - -The example schema below contains data for a university enrollment system. -Information about students, departments, courses, etc. are organized in multiple tables. - -Warning: - Empty primary keys, such as in the `CurrentTerm` table, are not yet supported by - DataJoint. - This feature will become available in a future release. - See [Issue #113](https://github.com/datajoint/datajoint-python/issues/113) for more - information. - -```python -@schema -class Student (dj.Manual): -definition = """ -student_id : int unsigned # university ID ---- -first_name : varchar(40) -last_name : varchar(40) -sex : enum('F', 'M', 'U') -date_of_birth : date -home_address : varchar(200) # street address -home_city : varchar(30) -home_state : char(2) # two-letter abbreviation -home_zipcode : char(10) -home_phone : varchar(14) -""" - -@schema -class Department (dj.Manual): -definition = """ -dept : char(6) # abbreviated department name, e.g. BIOL ---- -dept_name : varchar(200) # full department name -dept_address : varchar(200) # mailing address -dept_phone : varchar(14) -""" - -@schema -class StudentMajor (dj.Manual): -definition = """ --> Student ---- --> Department -declare_date : date # when student declared her major -""" - -@schema -class Course (dj.Manual): -definition = """ --> Department -course : int unsigned # course number, e.g. 1010 ---- -course_name : varchar(200) # e.g. "Cell Biology" -credits : decimal(3,1) # number of credits earned by completing the course -""" - -@schema -class Term (dj.Manual): -definition = """ -term_year : year -term : enum('Spring', 'Summer', 'Fall') -""" - -@schema -class Section (dj.Manual): -definition = """ --> Course --> Term -section : char(1) ---- -room : varchar(12) # building and room code -""" - -@schema -class CurrentTerm (dj.Manual): -definition = """ ---- --> Term -""" - -@schema -class Enroll (dj.Manual): -definition = """ --> Section --> Student -""" - -@schema -class LetterGrade (dj.Manual): -definition = """ -grade : char(2) ---- -points : decimal(3,2) -""" - -@schema -class Grade (dj.Manual): -definition = """ --> Enroll ---- --> LetterGrade -""" -``` - -## Example schema diagram - -![University example schema](../images/queries_example_diagram.png){: style="align:center"} - -Example schema for a university database. -Tables contain data on students, departments, courses, etc. diff --git a/docs/src/archive/query/fetch.md b/docs/src/archive/query/fetch.md deleted file mode 100644 index 75a50fd0d..000000000 --- a/docs/src/archive/query/fetch.md +++ /dev/null @@ -1,174 +0,0 @@ -# Fetch - -Data queries in DataJoint comprise two distinct steps: - -1. Construct the `query` object to represent the required data using tables and -[operators](operators.md). -2. Fetch the data from `query` into the workspace of the host language -- described in -this section. - -Note that entities returned by `fetch` methods are not guaranteed to be sorted in any -particular order unless specifically requested. -Furthermore, the order is not guaranteed to be the same in any two queries, and the -contents of two identical queries may change between two sequential invocations unless -they are wrapped in a transaction. -Therefore, if you wish to fetch matching pairs of attributes, do so in one `fetch` call. - -The examples below are based on the [example schema](example-schema.md) for this part -of the documentation. - -## Entire table - -The following statement retrieves the entire table as a NumPy -[recarray](https://docs.scipy.org/doc/numpy/reference/generated/numpy.recarray.html). - -```python -data = query.fetch() -``` - -To retrieve the data as a list of `dict`: - -```python -data = query.fetch(as_dict=True) -``` - -In some cases, the amount of data returned by fetch can be quite large; in these cases -it can be useful to use the `size_on_disk` attribute to determine if running a bare -fetch would be wise. -Please note that it is only currently possible to query the size of entire tables -stored directly in the database at this time. - -## As separate variables - -```python -name, img = query.fetch1('name', 'image') # when query has exactly one entity -name, img = query.fetch('name', 'image') # [name, ...] [image, ...] -``` - -## Primary key values - -```python -keydict = tab.fetch1("KEY") # single key dict when tab has exactly one entity -keylist = tab.fetch("KEY") # list of key dictionaries [{}, ...] -``` - -`KEY` can also used when returning attribute values as separate variables, such that -one of the returned variables contains the entire primary keys. - -## Sorting and limiting the results - -To sort the result, use the `order_by` keyword argument. - -```python -# ascending order: -data = query.fetch(order_by='name') -# descending order: -data = query.fetch(order_by='name desc') -# by name first, year second: -data = query.fetch(order_by=('name desc', 'year')) -# sort by the primary key: -data = query.fetch(order_by='KEY') -# sort by name but for same names order by primary key: -data = query.fetch(order_by=('name', 'KEY desc')) -``` - -The `order_by` argument can be a string specifying the attribute to sort by. By default -the sort is in ascending order. Use `'attr desc'` to sort in descending order by -attribute `attr`. The value can also be a sequence of strings, in which case, the sort -performed on all the attributes jointly in the order specified. - -The special attribute name `'KEY'` represents the primary key attributes in order that -they appear in the index. Otherwise, this name can be used as any other argument. - -If an attribute happens to be a SQL reserved word, it needs to be enclosed in -backquotes. For example: - -```python -data = query.fetch(order_by='`select` desc') -``` - -The `order_by` value is eventually passed to the `ORDER BY` -[clause](https://dev.mysql.com/doc/refman/5.7/en/order-by-optimization.html). - -Similarly, the `limit` and `offset` arguments can be used to limit the result to a -subset of entities. - -For example, one could do the following: - -```python -data = query.fetch(order_by='name', limit=10, offset=5) -``` - -Note that an `offset` cannot be used without specifying a `limit` as well. - -## Usage with Pandas - -The [pandas library](http://pandas.pydata.org/) is a popular library for data analysis -in Python which can easily be used with DataJoint query results. -Since the records returned by `fetch()` are contained within a `numpy.recarray`, they -can be easily converted to `pandas.DataFrame` objects by passing them into the -`pandas.DataFrame` constructor. -For example: - -```python -import pandas as pd -frame = pd.DataFrame(tab.fetch()) -``` - -Calling `fetch()` with the argument `format="frame"` returns results as -`pandas.DataFrame` objects indexed by the table's primary key attributes. - -```python -frame = tab.fetch(format="frame") -``` - -Returning results as a `DataFrame` is not possible when fetching a particular subset of -attributes or when `as_dict` is set to `True`. - -## Object Attributes - -When fetching [`object`](../design/tables/object.md) attributes, DataJoint returns an -`ObjectRef` handle instead of the raw data. This allows working with large files without -copying them locally. - -```python -record = Recording.fetch1() -obj = record["raw_data"] - -# Access metadata (no I/O) -print(obj.path) # Storage path -print(obj.size) # Size in bytes -print(obj.is_dir) # True if folder - -# Read content -content = obj.read() # Returns bytes for files - -# Open as file object -with obj.open() as f: - data = f.read() - -# Download to local path -local_path = obj.download("/local/destination/") -``` - -### Integration with Array Libraries - -`ObjectRef` provides direct fsspec access for Zarr and xarray: - -```python -import zarr -import xarray as xr - -obj = Recording.fetch1()["neural_data"] - -# Open as Zarr array -z = zarr.open(obj.store, mode='r') - -# Open with xarray -ds = xr.open_zarr(obj.store) - -# Direct filesystem access -fs = obj.fs -``` - -See the [object type documentation](../design/tables/object.md) for more details. diff --git a/docs/src/archive/query/iteration.md b/docs/src/archive/query/iteration.md deleted file mode 100644 index 60d95f107..000000000 --- a/docs/src/archive/query/iteration.md +++ /dev/null @@ -1,36 +0,0 @@ -# Iteration - -The DataJoint model primarily handles data as sets, in the form of tables. However, it -can sometimes be useful to access or to perform actions such as visualization upon -individual entities sequentially. In DataJoint this is accomplished through iteration. - -In the simple example below, iteration is used to display the names and values of the -attributes of each entity in the simple table or table expression. - -```python -for entity in table: - print(entity) -``` - -This example illustrates the function of the iterator: DataJoint iterates through the -whole table expression, returning the entire entity during each step. In this case, -each entity will be returned as a `dict` containing all attributes. - -At the start of the above loop, DataJoint internally fetches only the primary keys of -the entities. Since only the primary keys are needed to distinguish between entities, -DataJoint can then iterate over the list of primary keys to execute the loop. At each -step of the loop, DataJoint uses a single primary key to fetch an entire entity for use -in the iteration, such that `print(entity)` will print all attributes of each entity. -By first fetching only the primary keys and then fetching each entity individually, -DataJoint saves memory at the cost of network overhead. This can be particularly useful -for tables containing large amounts of data in secondary attributes. - -The memory savings of the above syntax may not be worth the additional network overhead -in all cases, such as for tables with little data stored as secondary attributes. In -the example below, DataJoint fetches all of the attributes of each entity in a single -call and then iterates over the list of entities stored in memory. - -```python -for entity in table.fetch(as_dict=True): - print(entity) -``` diff --git a/docs/src/archive/query/join.md b/docs/src/archive/query/join.md deleted file mode 100644 index d0ab0eae0..000000000 --- a/docs/src/archive/query/join.md +++ /dev/null @@ -1,37 +0,0 @@ -# Join - -## Join operator `*` - -The Join operator `A * B` combines the matching information in `A` and `B`. -The result contains all matching combinations of entities from both arguments. - -### Principles of joins - -1. The operands `A` and `B` must be **join-compatible**. -2. The primary key of the result is the union of the primary keys of the operands. - -### Examples of joins - -Example 1 : When the operands have no common attributes, the result is the cross -product -- all combinations of entities. - -![join-example1](../images/join-example1.png){: style="width:464px; align:center"} - -Example 2 : When the operands have common attributes, only entities with matching -values are kept. - -![join-example2](../images/join-example2.png){: style="width:689px; align:center"} - -Example 3 : Joining on secondary attribute. - -![join-example3](../images/join-example3.png){: style="width:689px; align:center"} - -### Properties of join - -1. When `A` and `B` have the same attributes, the join `A * B` becomes equivalent to -the set intersection `A` ∩ `B`. - Hence, DataJoint does not need a separate intersection operator. - -2. Commutativity: `A * B` is equivalent to `B * A`. - -3. Associativity: `(A * B) * C` is equivalent to `A * (B * C)`. diff --git a/docs/src/archive/query/operators.md b/docs/src/archive/query/operators.md deleted file mode 100644 index ee3549f35..000000000 --- a/docs/src/archive/query/operators.md +++ /dev/null @@ -1,395 +0,0 @@ -# Operators - -[Data queries](principles.md) have the form of expressions using operators to derive -the desired table. -The expressions themselves do not contain any data. -They represent the desired data symbolically. - -Once a query is formed, the [fetch](fetch.md) methods are used to bring the data into -the local workspace. -Since the expressions are only symbolic representations, repeated `fetch` calls may -yield different results as the state of the database is modified. - -DataJoint implements a complete algebra of operators on tables: - -| operator | notation | meaning | -|------------------------------|----------------|-------------------------------------------------------------------------| -| [join](#join) | A * B | All matching information from A and B | -| [restriction](#restriction) | A & cond | The subset of entities from A that meet the condition | -| [restriction](#restriction) | A - cond | The subset of entities from A that do not meet the condition | -| [proj](#proj) | A.proj(...) | Selects and renames attributes from A or computes new attributes | -| [aggr](#aggr) | A.aggr(B, ...) | Same as projection with computations based on matching information in B | -| [union](#union) | A + B | All unique entities from both A and B | -| [universal set](#universal-set)\*| dj.U() | All unique entities from both A and B | -| [top](#top)\*| dj.Top() | The top rows of A - -\*While not technically query operators, it is useful to discuss Universal Set and Top in the -same context. - -## Principles of relational algebra - -DataJoint's algebra improves upon the classical relational algebra and upon other query -languages to simplify and enhance the construction and interpretation of precise and -efficient data queries. - -1. **Entity integrity**: Data are represented and manipulated in the form of tables -representing [well-formed entity sets](../design/integrity.md). - This applies to the inputs and outputs of query operators. - The output of a query operator is an entity set with a well-defined entity type, a - primary key, unique attribute names, etc. -2. **Algebraic closure**: All operators operate on entity sets and yield entity sets. - Thus query expressions may be used as operands in other expressions or may be - assigned to variables to be used in other expressions. -3. **Attributes are identified by names**: All attributes have explicit names. - This includes results of queries. - Operators use attribute names to determine how to perform the operation. - The order of the attributes is not significant. - -## Matching entities - -Binary operators in DataJoint are based on the concept of **matching entities**; this -phrase will be used throughout the documentation. - - Two entities **match** when they have no common attributes or when their common - attributes contain the same values. - -Here **common attributes** are those that have the same names in both entities. -It is usually assumed that the common attributes are of compatible datatypes to allow -equality comparisons. - -Another way to phrase the same definition is - - Two entities match when they have no common attributes whose values differ. - -It may be conceptually convenient to imagine that all tables always have an additional -invisible attribute, `omega` whose domain comprises only one value, 1. -Then the definition of matching entities is simplified: - - Two entities match when their common attributes contain the same values. - -Matching entities can be **merged** into a single entity without any conflicts of -attribute names and values. - -### Examples - -This is a matching pair of entities: - -![matched_tuples1](../images/matched_tuples1.png){: style="width:366px"} - -and so is this one: - -![matched_tuples2](../images/matched_tuples2.png){: style="width:366px"} - -but these entities do *not* match: - -![matched_tuples3](../images/matched_tuples3.png){: style="width:366px"} - -## Join compatibility - -All binary operators with other tables as their two operands require that the operands -be **join-compatible**, which means that: - -1. All common attributes in both operands (attributes with the same name) must be part -of either the primary key or a foreign key. -2. All common attributes in the two relations must be of a compatible datatype for -equality comparisons. - -## Restriction - -The restriction operator `A & cond` selects the subset of entities from `A` that meet -the condition `cond`. The exclusion operator `A - cond` selects the complement of -restriction, i.e. the subset of entities from `A` that do not meet the condition -`cond`. This means that the restriction and exclusion operators are complementary. -The same query could be constructed using either `A & cond` or `A - Not(cond)`. - -
-![Restriction and exclusion.](../../../images/concepts-operators-restriction.png){: style="height:200px"} -
- -The condition `cond` may be one of the following: - -=== "Python" - - - another table - - a mapping, e.g. `dict` - - an expression in a character string - - a collection of conditions as a `list`, `tuple`, or Pandas `DataFrame` - - a Boolean expression (`True` or `False`) - - an `AndList` - - a `Not` object - - a query expression - -??? Warning "Permissive Operators" - - To circumvent compatibility checks, DataJoint offers permissive operators for - Restriction (`^`) and Join (`@`). Use with Caution. - -## Proj - -The `proj` operator represents **projection** and is used to select attributes -(columns) from a table, to rename them, or to create new calculated attributes. - -1. A simple projection *selects a subset of attributes* of the original -table, which may not include the [primary key](../concepts/glossary#primary-key). - -2. A more complex projection *renames an attribute* in another table. This could be -useful when one table should be referenced multiple times in another. A user table, -could contain all personnel. A project table references one person for the lead and -another the coordinator, both referencing the common personnel pool. - -3. Projection can also perform calculations (as available in -[MySQL](https://dev.mysql.com/doc/refman/5.7/en/functions.html)) on a single attribute. - -## Aggr - -**Aggregation** is a special form of `proj` with the added feature of allowing - aggregation calculations on another table. It has the form `table.aggr - (other, ...)` where `other` is another table. Aggregation allows adding calculated - attributes to each entity in `table` based on aggregation functions over attributes - in the matching entities of `other`. - -Aggregation functions include `count`, `sum`, `min`, `max`, `avg`, `std`, `variance`, -and others. - -## Union - -The result of the union operator `A + B` contains all the entities from both operands. - -[Entity normalization](../design/normalization) requires that `A` and `B` are of the same type, -with with the same [primary key](../concepts/glossary#primary-key), using homologous -attributes. Without secondary attributes, the result is the simple set union. With -secondary attributes, they must have the same names and datatypes. The two operands -must also be **disjoint**, without any duplicate primary key values across both inputs. -These requirements prevent ambiguity of attribute values and preserve entity identity. - -??? Note "Principles of union" - - 1. As in all operators, the order of the attributes in the operands is not - significant. - - 2. Operands `A` and `B` must have the same primary key attributes. Otherwise, an - error will be raised. - - 3. Operands `A` and `B` may not have any common non-key attributes. Otherwise, an - error will be raised. - - 4. The result `A + B` will have the same primary key as `A` and `B`. - - 5. The result `A + B` will have all the non-key attributes from both `A` and `B`. - - 6. For entities that are found in both `A` and `B` (based on the primary key), the - secondary attributes will be filled from the corresponding entities in `A` and - `B`. - - 7. For entities that are only found in either `A` or `B`, the other operand's - secondary attributes will filled with null values. - -For union, order does not matter. - -
-![Union Example 1](../../../images/concepts-operators-union1.png){: style="height:200px"} -
-
-![Union Example 2](../../../images/concepts-operators-union2.png){: style="height:200px"} -
- -??? Note "Properties of union" - - 1. Commutative: `A + B` is equivalent to `B + A`. - 2. Associative: `(A + B) + C` is equivalent to `A + (B + C)`. - -## Universal Set - -All of the above operators are designed to preserve their input type. Some queries may -require creating a new entity type not already represented by existing tables. This -means that the new type must be defined as part of the query. - -Universal sets fulfill this role using `dj.U` notation. They denote the set of all -possible entities with given attributes of any possible datatype. Attributes of -universal sets are allowed to be matched to any namesake attributes, even those that do -not come from the same initial source. - -Universal sets should be used sparingly when no suitable base tables already exist. In -some cases, defining a new base table can make queries clearer and more semantically -constrained. - -The examples below will use the table definitions in [table tiers](../reproduce/table-tiers). - - - -## Top - -Similar to the universal set operator, the top operator uses `dj.Top` notation. It is used to -restrict a query by the given `limit`, `order_by`, and `offset` parameters: - -```python -Session & dj.Top(limit=10, order_by='session_date') -``` - -The result of this expression returns the first 10 rows of `Session` and sorts them -by their `session_date` in ascending order. - -### `order_by` - -| Example | Description | -|-------------------------------------------|---------------------------------------------------------------------------------| -| `order_by="session_date DESC"` | Sort by `session_date` in *descending* order | -| `order_by="KEY"` | Sort by the primary key | -| `order_by="KEY DESC"` | Sort by the primary key in *descending* order | -| `order_by=["subject_id", "session_date"]` | Sort by `subject_id`, then sort matching `subject_id`s by their `session_date` | - -The default values for `dj.Top` parameters are `limit=1`, `order_by="KEY"`, and `offset=0`. - -## Restriction - -`&` and `-` operators permit restriction. - -### By a mapping - -For a [Session table](../reproduce/table-tiers#manual-tables), that has the attribute -`session_date`, we can restrict to sessions from January 1st, 2022: - -```python -Session & {'session_date': "2022-01-01"} -``` - -If there were any typos (e.g., using `sess_date` instead of `session_date`), our query -will return all of the entities of `Session`. - -### By a string - -Conditions may include arithmetic operations, functions, range tests, etc. Restriction -of table `A` by a string containing an attribute not found in table `A` produces an -error. - -```python -Session & 'user = "Alice"' # (1) -Session & 'session_date >= "2022-01-01"' # (2) -``` - -1. All the sessions performed by Alice -2. All of the sessions on or after January 1st, 2022 - -### By a collection - -When `cond` is a collection of conditions, the conditions are applied by logical -disjunction (logical OR). Restricting a table by a collection will return all entities -that meet *any* of the conditions in the collection. - -For example, if we restrict the `Session` table by a collection containing two -conditions, one for user and one for date, the query will return any sessions with a -matching user *or* date. - -A collection can be a list, a tuple, or a Pandas `DataFrame`. - -``` python -cond_list = ['user = "Alice"', 'session_date = "2022-01-01"'] # (1) -cond_tuple = ('user = "Alice"', 'session_date = "2022-01-01"') # (2) -import pandas as pd -cond_frame = pd.DataFrame(data={'user': ['Alice'], 'session_date': ['2022-01-01']}) # (3) - -Session() & ['user = "Alice"', 'session_date = "2022-01-01"'] -``` - -1. A list -2. A tuple -3. A data frame - -`dj.AndList` represents logical conjunction(logical AND). Restricting a table by an -`AndList` will return all entities that meet *all* of the conditions in the list. `A & -dj.AndList([c1, c2, c3])` is equivalent to `A & c1 & c2 & c3`. - -```python -Student() & dj.AndList(['user = "Alice"', 'session_date = "2022-01-01"']) -``` - -The above will show all the sessions that Alice conducted on the given day. - -### By a `Not` object - -The special function `dj.Not` represents logical negation, such that `A & dj.Not -(cond)` is equivalent to `A - cond`. - -### By a query - -Restriction by a query object is a generalization of restriction by a table. The example -below creates a query object corresponding to all the users named Alice. The `Session` -table is then restricted by the query object, returning all the sessions performed by -Alice. - -``` python -query = User & 'user = "Alice"' -Session & query -``` - -## Proj - -Renaming an attribute in python can be done via keyword arguments: - -```python -table.proj(new_attr='old_attr') -``` - -This can be done in the context of a table definition: - -```python -@schema -class Session(dj.Manual): - definition = """ - # Experiment Session - -> Animal - session : smallint # session number for the animal - --- - session_datetime : datetime # YYYY-MM-DD HH:MM:SS - session_start_time : float # seconds relative to session_datetime - session_end_time : float # seconds relative to session_datetime - -> User.proj(experimenter='username') - -> User.proj(supervisor='username') - """ -``` - -Or to rename multiple values in a table with the following syntax: -`Table.proj(*existing_attributes,*renamed_attributes)` - -```python -Session.proj('session','session_date',start='session_start_time',end='session_end_time') -``` - -Projection can also be used to to compute new attributes from existing ones. - -```python -Session.proj(duration='session_end_time-session_start_time') & 'duration > 10' -``` - -## Aggr - -For more complicated calculations, we can use aggregation. - -``` python -Subject.aggr(Session,n="count(*)") # (1) -Subject.aggr(Session,average_start="avg(session_start_time)") # (2) -``` - -1. Number of sessions per subject. -2. Average `session_start_time` for each subject - - - -## Universal set - -Universal sets offer the complete list of combinations of attributes. - -``` python -# All home cities of students -dj.U('laser_wavelength', 'laser_power') & Scan # (1) -dj.U('laser_wavelength', 'laser_power').aggr(Scan, n="count(*)") # (2) -dj.U().aggr(Session, n="max(session)") # (3) -``` - -1. All combinations of wavelength and power. -2. Total number of scans for each combination. -3. Largest session number. - -`dj.U()`, as shown in the last example above, is often useful for integer IDs. -For an example of this process, see the source code for -[Element Array Electrophysiology's `insert_new_params`](https://docs.datajoint.com/elements/element-array-ephys/latest/api/element_array_ephys/ephys_acute/#element_array_ephys.ephys_acute.ClusteringParamSet.insert_new_params). diff --git a/docs/src/archive/query/principles.md b/docs/src/archive/query/principles.md deleted file mode 100644 index 9b9fd284d..000000000 --- a/docs/src/archive/query/principles.md +++ /dev/null @@ -1,81 +0,0 @@ -# Query Principles - -**Data queries** retrieve data from the database. -A data query is performed with the help of a **query object**, which is a symbolic -representation of the query that does not in itself contain any actual data. -The simplest query object is an instance of a **table class**, representing the -contents of an entire table. - -For example, if `experiment.Session` is a DataJoint table class, you can create a query -object to retrieve its entire contents as follows: - -```python -query = experiment.Session() -``` - -More generally, a query object may be formed as a **query expression** constructed by -applying [operators](operators.md) to other query objects. - -For example, the following query retrieves information about all experiments and scans -for mouse 102 (excluding experiments with no scans): - -```python -query = experiment.Session * experiment.Scan & 'animal_id = 102' -``` - -Note that for brevity, query operators can be applied directly to class objects rather -than instance objects so that `experiment.Session` may be used in place of -`experiment.Session()`. - -You can preview the contents of the query in Python, Jupyter Notebook, or MATLAB by -simply displaying the object. -In the image below, the object `query` is first defined as a restriction of the table -`EEG` by values of the attribute `eeg_sample_rate` greater than 1000 Hz. -Displaying the object gives a preview of the entities that will be returned by `query`. -Note that this preview only lists a few of the entities that will be returned. -Also, the preview does not contain any data for attributes of datatype `blob`. - -![Query object preview](../images/query_object_preview.png){: style="align:center"} - -Defining a query object and previewing the entities returned by the query. - -Once the desired query object is formed, the query can be executed using its -[fetch](fetch.md) methods. -To **fetch** means to transfer the data represented by the query object from the -database server into the workspace of the host language. - -```python -s = query.fetch() -``` - -Here fetching from the `query` object produces the NumPy record array `s` of the -queried data. - -## Checking for returned entities - -The preview of the query object shown above displayed only a few of the entities -returned by the query but also displayed the total number of entities that would be -returned. -It can be useful to know the number of entities returned by a query, or even whether a -query will return any entities at all, without having to fetch all the data themselves. - -The `bool` function applied to a query object evaluates to `True` if the query returns -any entities and to `False` if the query result is empty. - -The `len` function applied to a query object determines the number of entities returned -by the query. - -```python -# number of sessions since the start of 2018. -n = len(Session & 'session_date >= "2018-01-01"') -``` - -## Normalization in queries - -Query objects adhere to entity [entity normalization](../design/normalization.md) just -like the stored tables do. -The result of a query is a well-defined entity set with an readily identifiable entity -class and designated primary attributes that jointly distinguish any two entities from -each other. -The query [operators](operators.md) are designed to keep the result normalized even in -complex query expressions. diff --git a/docs/src/archive/query/project.md b/docs/src/archive/query/project.md deleted file mode 100644 index 99e5749c7..000000000 --- a/docs/src/archive/query/project.md +++ /dev/null @@ -1,68 +0,0 @@ -# Proj - -The `proj` operator represents **projection** and is used to select attributes -(columns) from a table, to rename them, or to create new calculated attributes. - -## Simple projection - -The simple projection selects a subset of attributes of the original table. -However, the primary key attributes are always included. - -Using the [example schema](example-schema.md), let table `department` have attributes -**dept**, *dept_name*, *dept_address*, and *dept_phone*. -The primary key attribute is in bold. - -Then `department.proj()` will have attribute **dept**. - -`department.proj('dept')` will have attribute **dept**. - -`department.proj('dept_name', 'dept_phone')` will have attributes **dept**, -*dept_name*, and *dept_phone*. - -## Renaming - -In addition to selecting attributes, `proj` can rename them. -Any attribute can be renamed, including primary key attributes. - -This is done using keyword arguments: -`tab.proj(new_attr='old_attr')` - -For example, let table `tab` have attributes **mouse**, **session**, *session_date*, -*stimulus*, and *behavior*. -The primary key attributes are in bold. - -Then - -```python -tab.proj(animal='mouse', 'stimulus') -``` - -will have attributes **animal**, **session**, and *stimulus*. - -Renaming is often used to control the outcome of a [join](join.md). -For example, let `tab` have attributes **slice**, and **cell**. -Then `tab * tab` will simply yield `tab`. -However, - -```python -tab * tab.proj(other='cell') -``` - -yields all ordered pairs of all cells in each slice. - -## Calculations - -In addition to selecting or renaming attributes, `proj` can compute new attributes from -existing ones. - -For example, let `tab` have attributes `mouse`, `scan`, `surface_z`, and `scan_z`. -To obtain the new attribute `depth` computed as `scan_z - surface_z` and then to -restrict to `depth > 500`: - -```python -tab.proj(depth='scan_z-surface_z') & 'depth > 500' -``` - -Calculations are passed to SQL and are not parsed by DataJoint. -For available functions, you may refer to the -[MySQL documentation](https://dev.mysql.com/doc/refman/8.0/en/functions.html). diff --git a/docs/src/archive/query/query-caching.md b/docs/src/archive/query/query-caching.md deleted file mode 100644 index 124381b63..000000000 --- a/docs/src/archive/query/query-caching.md +++ /dev/null @@ -1,42 +0,0 @@ -# Query Caching - -Query caching allows avoiding repeated queries to the database by caching the results -locally for faster retrieval. - -To enable queries, set the query cache local path in `dj.config`, create the directory, -and activate the query caching. - -```python -# set the query cache path -dj.config['query_cache'] = os.path.expanduser('~/dj_query_cache') - -# access the active connection object for the tables -conn = dj.conn() # if queries co-located with tables -conn = module.schema.connection # if schema co-located with tables -conn = module.table.connection # most flexible - -# activate query caching for a namespace called 'main' -conn.set_query_cache(query_cache='main') -``` - -The `query_cache` argument is an arbitrary string serving to differentiate cache -states; setting a new value will effectively start a new cache, triggering retrieval of -new values once. - -To turn off query caching, use the following: - -```python -conn.set_query_cache(query_cache=None) -# or -conn.set_query_cache() -``` - -While query caching is enabled, any insert or delete calls and any transactions are -disabled and will raise an error. This ensures that stale data are not used for -updating the database in violation of data integrity. - -To clear and remove the query cache, use the following: - -```python -conn.purge_query_cache() -``` diff --git a/docs/src/archive/query/restrict.md b/docs/src/archive/query/restrict.md deleted file mode 100644 index f8b61e641..000000000 --- a/docs/src/archive/query/restrict.md +++ /dev/null @@ -1,205 +0,0 @@ -# Restriction - -## Restriction operators `&` and `-` - -The restriction operator `A & cond` selects the subset of entities from `A` that meet -the condition `cond`. -The exclusion operator `A - cond` selects the complement of restriction, i.e. the -subset of entities from `A` that do not meet the condition `cond`. - -Restriction and exclusion. - -![Restriction and exclusion](../images/op-restrict.png){: style="width:400px; align:center"} - -The condition `cond` may be one of the following: - -+ another table -+ a mapping, e.g. `dict` -+ an expression in a character string -+ a collection of conditions as a `list`, `tuple`, or Pandas `DataFrame` -+ a Boolean expression (`True` or `False`) -+ an `AndList` -+ a `Not` object -+ a query expression - -As the restriction and exclusion operators are complementary, queries can be -constructed using both operators that will return the same results. -For example, the queries `A & cond` and `A - Not(cond)` will return the same entities. - -## Restriction by a table - -When restricting table `A` with another table, written `A & B`, the two tables must be -**join-compatible** (see `join-compatible` in [Operators](./operators.md)). -The result will contain all entities from `A` for which there exist a matching entity -in `B`. -Exclusion of table `A` with table `B`, or `A - B`, will contain all entities from `A` -for which there are no matching entities in `B`. - -Restriction by another table. - -![Restriction by another table](../images/restrict-example1.png){: style="width:546px; align:center"} - -Exclusion by another table. - -![Exclusion by another table](../images/diff-example1.png){: style="width:539px; align:center"} - -### Restriction by a table with no common attributes - -Restriction of table `A` with another table `B` having none of the same attributes as -`A` will simply return all entities in `A`, unless `B` is empty as described below. -Exclusion of table `A` with `B` having no common attributes will return no entities, -unless `B` is empty as described below. - -Restriction by a table having no common attributes. - -![Restriction by a table with no common attributes](../images/restrict-example2.png){: style="width:571px; align:center"} - -Exclusion by a table having no common attributes. - -![Exclusion by a table having no common attributes](../images/diff-example2.png){: style="width:571px; align:center"} - -### Restriction by an empty table - -Restriction of table `A` with an empty table will return no entities regardless of -whether there are any matching attributes. -Exclusion of table `A` with an empty table will return all entities in `A`. - -Restriction by an empty table. - -![Restriction by an empty table](../images/restrict-example3.png){: style="width:563px; align:center"} - -Exclusion by an empty table. - -![Exclusion by an empty table](../images/diff-example3.png){: style="width:571px; align:center"} - -## Restriction by a mapping - -A key-value mapping may be used as an operand in restriction. -For each key that is an attribute in `A`, the paired value is treated as part of an -equality condition. -Any key-value pairs without corresponding attributes in `A` are ignored. - -Restriction by an empty mapping or by a mapping with no keys matching the attributes in -`A` will return all the entities in `A`. -Exclusion by an empty mapping or by a mapping with no matches will return no entities. - -For example, let's say that table `Session` has the attribute `session_date` of -[datatype](../design/tables/attributes.md) `datetime`. -You are interested in sessions from January 1st, 2018, so you write the following -restriction query using a mapping. - -```python -Session & {'session_date': "2018-01-01"} -``` - -Our mapping contains a typo omitting the final `e` from `session_date`, so no keys in -our mapping will match any attribute in `Session`. -As such, our query will return all of the entities of `Session`. - -## Restriction by a string - -Restriction can be performed when `cond` is an explicit condition on attribute values, -expressed as a string. -Such conditions may include arithmetic operations, functions, range tests, etc. -Restriction of table `A` by a string containing an attribute not found in table `A` -produces an error. - -```python -# All the sessions performed by Alice -Session & 'user = "Alice"' - -# All the experiments at least one minute long -Experiment & 'duration >= 60' -``` - -## Restriction by a collection - -A collection can be a list, a tuple, or a Pandas `DataFrame`. - -```python -# a list: -cond_list = ['first_name = "Aaron"', 'last_name = "Aaronson"'] - -# a tuple: -cond_tuple = ('first_name = "Aaron"', 'last_name = "Aaronson"') - -# a dataframe: -import pandas as pd -cond_frame = pd.DataFrame( - data={'first_name': ['Aaron'], 'last_name': ['Aaronson']}) -``` - -When `cond` is a collection of conditions, the conditions are applied by logical -disjunction (logical OR). -Thus, restriction of table `A` by a collection will return all entities in `A` that -meet *any* of the conditions in the collection. -For example, if you restrict the `Student` table by a collection containing two -conditions, one for a first and one for a last name, your query will return any -students with a matching first name *or* a matching last name. - -```python -Student() & ['first_name = "Aaron"', 'last_name = "Aaronson"'] -``` - -Restriction by a collection, returning all entities matching any condition in the collection. - -![Restriction by collection](../images/python_collection.png){: style="align:center"} - -Restriction by an empty collection returns no entities. -Exclusion of table `A` by an empty collection returns all the entities of `A`. - -## Restriction by a Boolean expression - -`A & True` and `A - False` are equivalent to `A`. - -`A & False` and `A - True` are empty. - -## Restriction by an `AndList` - -The special function `dj.AndList` represents logical conjunction (logical AND). -Restriction of table `A` by an `AndList` will return all entities in `A` that meet -*all* of the conditions in the list. -`A & dj.AndList([c1, c2, c3])` is equivalent to `A & c1 & c2 & c3`. -Usually, it is more convenient to simply write out all of the conditions, as -`A & c1 & c2 & c3`. -However, when a list of conditions has already been generated, the list can simply be -passed as the argument to `dj.AndList`. - -Restriction of table `A` by an empty `AndList`, as in `A & dj.AndList([])`, will return -all of the entities in `A`. -Exclusion by an empty `AndList` will return no entities. - -## Restriction by a `Not` object - -The special function `dj.Not` represents logical negation, such that `A & dj.Not(cond)` -is equivalent to `A - cond`. - -## Restriction by a query - -Restriction by a query object is a generalization of restriction by a table (which is -also a query object), because DataJoint queries always produce well-defined entity -sets, as described in [entity normalization](../design/normalization.md). -As such, restriction by queries follows the same behavior as restriction by tables -described above. - -The example below creates a query object corresponding to all the sessions performed by -the user Alice. -The `Experiment` table is then restricted by the query object, returning all the -experiments that are part of sessions performed by Alice. - -```python -query = Session & 'user = "Alice"' -Experiment & query -``` - -## Restriction by `dj.Top` - -Restriction by `dj.Top` returns the number of entities specified by the `limit` -argument. These entities can be returned in the order specified by the `order_by` -argument. And finally, the `offset` argument can be used to offset the returned entities -which is useful for pagination in web applications. - -```python -# Return the first 10 sessions in descending order of session date -Session & dj.Top(limit=10, order_by='session_date DESC') -``` diff --git a/docs/src/archive/query/union.md b/docs/src/archive/query/union.md deleted file mode 100644 index 71f0fa687..000000000 --- a/docs/src/archive/query/union.md +++ /dev/null @@ -1,48 +0,0 @@ -# Union - -The union operator is not yet implemented -- this page serves as the specification for -the upcoming implementation. -Union is rarely needed in practice. - -## Union operator `+` - -The result of the union operator `A + B` contains all the entities from both operands. -[Entity normalization](../design/normalization.md) requires that the operands in a -union both belong to the same entity type with the same primary key using homologous -attributes. -In the absence of any secondary attributes, the result of a union is the simple set union. - -When secondary attributes are present, they must have the same names and datatypes in -both operands. -The two operands must also be **disjoint**, without any duplicate primary key values -across both inputs. -These requirements prevent ambiguity of attribute values and preserve entity identity. - -## Principles of union - -1. As in all operators, the order of the attributes in the operands is not significant. -2. Operands `A` and `B` must have the same primary key attributes. - Otherwise, an error will be raised. -3. Operands `A` and `B` may not have any common non-key attributes. - Otherwise, an error will be raised. -4. The result `A + B` will have the same primary key as `A` and `B`. -5. The result `A + B` will have all the non-key attributes from both `A` and `B`. -6. For entities that are found in both `A` and `B` (based on the primary key), the -secondary attributes will be filled from the corresponding entities in `A` and `B`. -7. For entities that are only found in either `A` or `B`, the other operand's secondary -attributes will filled with null values. - -## Examples of union - -Example 1 : Note that the order of the attributes does not matter. - -![union-example1](../images/union-example1.png){: style="width:404px; align:center"} - -Example 2 : Non-key attributes are combined from both tables and filled with NULLs when missing. - -![union-example2](../images/union-example2.png){: style="width:539px; align:center"} - -## Properties of union - -1. Commutative: `A + B` is equivalent to `B + A`. -2. Associative: `(A + B) + C` is equivalent to `A + (B + C)`. diff --git a/docs/src/archive/query/universals.md b/docs/src/archive/query/universals.md deleted file mode 100644 index a9f12dd96..000000000 --- a/docs/src/archive/query/universals.md +++ /dev/null @@ -1,46 +0,0 @@ -# Universal Sets - -All [query operators](operators.md) are designed to preserve the entity types of their -inputs. -However, some queries require creating a new entity type that is not represented by any -stored tables. -This means that a new entity type must be explicitly defined as part of the query. -Universal sets fulfill this role. - -**Universal sets** are used in DataJoint to define virtual tables with arbitrary -primary key structures for use in query expressions. -A universal set, defined using class `dj.U`, denotes the set of all possible entities -with given attributes of any possible datatype. -Universal sets allow query expressions using virtual tables when no suitable base table exists. -Attributes of universal sets are allowed to be matched to any namesake attributes, even -those that do not come from the same initial source. - -For example, you may like to query the university database for the complete list of -students' home cities, along with the number of students from each city. -The [schema](example-schema.md) for the university database does not have a table for -cities and states. -A virtual table can fill the role of the nonexistent base table, allowing queries that -would not be possible otherwise. - -```python -# All home cities of students -dj.U('home_city', 'home_state') & Student - -# Total number of students from each city -dj.U('home_city', 'home_state').aggr(Student, n="count(*)") - -# Total number of students from each state -U('home_state').aggr(Student, n="count(*)") - -# Total number of students in the database -U().aggr(Student, n="count(*)") -``` - -The result of aggregation on a universal set is restricted to the entities with matches -in the aggregated table, such as `Student` in the example above. -In other words, `X.aggr(A, ...)` is interpreted as `(X & A).aggr(A, ...)` for universal -set `X`. -All attributes of a universal set are considered primary. - -Universal sets should be used sparingly when no suitable base tables already exist. -In some cases, defining a new base table can make queries clearer and more semantically constrained. diff --git a/docs/src/archive/quick-start.md b/docs/src/archive/quick-start.md deleted file mode 100644 index 17f783405..000000000 --- a/docs/src/archive/quick-start.md +++ /dev/null @@ -1,466 +0,0 @@ -# Quick Start Guide - -## Tutorials - -The easiest way to get started is through the [DataJoint -Tutorials](https://github.com/datajoint/datajoint-tutorials). These tutorials are -configured to run using [GitHub Codespaces](https://github.com/features/codespaces) -where the full environment including the database is already set up. - -Advanced users can install DataJoint locally. Please see the installation instructions below. - -## Installation - -First, please [install Python](https://www.python.org/downloads/) version -3.10 or later. - -Next, please install DataJoint via one of the following: - -=== "conda" - - Pre-Requisites - - Ensure you have [conda](https://conda.io/projects/conda/en/latest/user-guide/install/index.html#regular-installation) - installed. - - To add the `conda-forge` channel: - - ```bash - conda config --add channels conda-forge - ``` - - To install: - - ```bash - conda install -c conda-forge datajoint - ``` - -=== "pip + :fontawesome-brands-windows:" - - Pre-Requisites - - Ensure you have [pip](https://pip.pypa.io/en/stable/installation/) installed. - - Install [graphviz](https://graphviz.org/download/#windows) pre-requisite for - diagram visualization. - - To install: - - ```bash - pip install datajoint - ``` - -=== "pip + :fontawesome-brands-apple:" - - Pre-Requisites - - Ensure you have [pip](https://pip.pypa.io/en/stable/installation/) installed. - - Install [graphviz](https://graphviz.org/download/#mac) pre-requisite for - diagram visualization. - - To install: - - ```bash - pip install datajoint - ``` - -=== "pip + :fontawesome-brands-linux:" - - Pre-Requisites - - Ensure you have [pip](https://pip.pypa.io/en/stable/installation/) installed. - - Install [graphviz](https://graphviz.org/download/#linux) pre-requisite for - diagram visualization. - - To install: - - ```bash - pip install datajoint - ``` - -## Connection - -=== "environment variables" - - Before using `datajoint`, set the following environment variables like so: - - ```bash linenums="1" - DJ_HOST={host_address} - DJ_USER={user} - DJ_PASS={password} - ``` - -=== "memory" - - To set connection settings within Python, perform: - - ```python linenums="1" - import datajoint as dj - - dj.config.database.host = "{host_address}" - dj.config.database.user = "{user}" - dj.config.database.password = "{password}" - ``` - - Note: Credentials set this way are not persisted. For persistent configuration, - use environment variables or a config file. - -=== "file" - - Create a file named `datajoint.json` in your project root: - - ```json linenums="1" - { - "database": { - "host": "{host_address}" - } - } - ``` - - **Important:** Never store credentials in config files. Use environment variables - (`DJ_USER`, `DJ_PASS`) or a `.secrets/` directory instead. - - DataJoint searches for `datajoint.json` starting from the current directory and - moving up through parent directories until it finds the file or reaches a `.git` - directory. - -## Data Pipeline Definition - -Let's definite a simple data pipeline. - -```python linenums="1" -import datajoint as dj -schema = dj.Schema(f"{dj.config['database.user']}_shapes") # This statement creates the database schema `{username}_shapes` on the server. - -@schema # The `@schema` decorator for DataJoint classes creates the table on the server. -class Rectangle(dj.Manual): - definition = """ # The table is defined by the the `definition` property. - shape_id: int - --- - shape_height: float - shape_width: float - """ - -@schema -class Area(dj.Computed): - definition = """ - -> Rectangle - --- - shape_area: float - """ - def make(self, key): - rectangle = (Rectangle & key).fetch1() - Area.insert1( - dict( - shape_id=rectangle["shape_id"], - shape_area=rectangle["shape_height"] * rectangle["shape_width"], - ) - ) -``` - -It is a common practice to have a separate Python module for each schema. Therefore, -each such module has only one `dj.Schema` object defined and is usually named -`schema`. - -The `dj.Schema` constructor can take a number of optional parameters -after the schema name. - -- `context` - Dictionary for looking up foreign key references. - Defaults to `None` to use local context. -- `connection` - Specifies the DataJoint connection object. Defaults - to `dj.conn()`. -- `create_schema` - When `False`, the schema object will not create a - schema on the database and will raise an error if one does not - already exist. Defaults to `True`. -- `create_tables` - When `False`, the schema object will not create - tables on the database and will raise errors when accessing missing - tables. Defaults to `True`. - -The `@schema` decorator uses the class name and the data tier to check whether an -appropriate table exists on the database. If a table does not already exist, the -decorator creates one on the database using the definition property. The decorator -attaches the information about the table to the class, and then returns the class. - -## Diagram - -### Display - -The diagram displays the relationship of the data model in the data pipeline. - -This can be done for an entire schema: - -```python -import datajoint as dj -schema = dj.Schema('my_database') -dj.Diagram(schema) -``` - -![pipeline](./images/shapes_pipeline.svg) - -Or for individual or sets of tables: -```python -dj.Diagram(schema.Rectangle) -dj.Diagram(schema.Rectangle) + dj.Diagram(schema.Area) -``` - -What if I don't see the diagram? - -Some Python interfaces may require additional `draw` method. - -```python -dj.Diagram(schema).draw() -``` - -Calling the `.draw()` method is not necessary when working in a Jupyter notebook by -entering `dj.Diagram(schema)` in a notebook cell. The Diagram will automatically -render in the notebook by calling its `_repr_html_` method. A Diagram displayed -without `.draw()` will be rendered as an SVG, and hovering the mouse over a table -will reveal a compact version of the output of the `.describe()` method. - -For more information about diagrams, see [this article](../design/diagrams). - -### Customize - -Adding or subtracting a number to a diagram object adds nodes downstream or upstream, -respectively, in the pipeline. - -```python -(dj.Diagram(schema.Rectangle)+1).draw() # Plot all the tables directly downstream from `schema.Rectangle` -``` - -```python -(dj.Diagram('my_schema')-1+1).draw() # Plot all tables directly downstream of those directly upstream of this schema. -``` - -### Save - -The diagram can be saved as either `png` or `svg`. - -```python -dj.Diagram(schema).save(filename='my-diagram', format='png') -``` - -## Insert data - -Data entry is as easy as providing the appropriate data structure to a permitted -[table](./design/tables/tiers.md). - -Let's add data for a rectangle: - -```python -Rectangle.insert1(dict(shape_id=1, shape_height=2, shape_width=4)) -``` - -Given the following [table definition](./design/tables/declare.md), we can insert data -as tuples, dicts, pandas dataframes, or pathlib `Path` relative paths to local CSV -files. - -```python -mouse_id: int # unique mouse id ---- -dob: date # mouse date of birth -sex: enum('M', 'F', 'U') # sex of mouse - Male, Female, or Unknown -``` - -=== "Tuple" - - ```python - mouse.insert1( (0, '2017-03-01', 'M') ) # Single entry - data = [ - (1, '2016-11-19', 'M'), - (2, '2016-11-20', 'U'), - (5, '2016-12-25', 'F') - ] - mouse.insert(data) # Multi-entry - ``` - -=== "Dict" - - ```python - mouse.insert1( dict(mouse_id=0, dob='2017-03-01', sex='M') ) # Single entry - data = [ - {'mouse_id':1, 'dob':'2016-11-19', 'sex':'M'}, - {'mouse_id':2, 'dob':'2016-11-20', 'sex':'U'}, - {'mouse_id':5, 'dob':'2016-12-25', 'sex':'F'} - ] - mouse.insert(data) # Multi-entry - ``` - -=== "Pandas" - - ```python - import pandas as pd - data = pd.DataFrame( - [[1, "2016-11-19", "M"], [2, "2016-11-20", "U"], [5, "2016-12-25", "F"]], - columns=["mouse_id", "dob", "sex"], - ) - mouse.insert(data) - ``` - -=== "CSV" - - Given the following CSV in the current working directory as `mice.csv` - - ```console - mouse_id,dob,sex - 1,2016-11-19,M - 2,2016-11-20,U - 5,2016-12-25,F - ``` - - We can import as follows: - - ```python - from pathlib import Path - mouse.insert(Path('./mice.csv')) - ``` - -## Run computation - -Let's start the computations on our entity: `Area`. - -```python -Area.populate(display_progress=True) -``` - -The `make` method populates automated tables from inserted data. Read more in the -full article [here](./compute/make.md) - -## Query - -Let's inspect the results. - -```python -Area & "shape_area >= 8" -``` - -| shaped_id | shape_area | -| --- | --- | -| 1 | 8.0 | - -## Fetch - -Data queries in DataJoint comprise two distinct steps: - -1. Construct the `query` object to represent the required data using - tables and [operators](../query/operators). -2. Fetch the data from `query` into the workspace of the host language. - -Note that entities returned by `fetch` methods are not guaranteed to be sorted in any -particular order unless specifically requested. Furthermore, the order is not -guaranteed to be the same in any two queries, and the contents of two identical queries -may change between two sequential invocations unless they are wrapped in a transaction. -Therefore, if you wish to fetch matching pairs of attributes, do so in one `fetch` -call. - -```python -data = query.fetch() -``` - -### Entire table - -A `fetch` command can either retrieve table data as a NumPy -[recarray](https://docs.scipy.org/doc/numpy/reference/generated/numpy.recarray.html) -or a as a list of `dict` - -```python -data = query.fetch() # NumPy recarray -data = query.fetch(as_dict=True) # List of `dict` -``` - -In some cases, the amount of data returned by fetch can be quite large; it can be -useful to use the `size_on_disk` attribute to determine if running a bare fetch -would be wise. Please note that it is only currently possible to query the size of -entire tables stored directly in the database at this time. - -### Separate variables - -```python -name, img = query.fetch1('mouse_id', 'dob') # when query has exactly one entity -name, img = query.fetch('mouse_id', 'dob') # [mouse_id, ...] [dob, ...] -``` - -### Primary key values - -```python -keydict = tab.fetch1("KEY") # single key dict when tab has exactly one entity -keylist = tab.fetch("KEY") # list of key dictionaries [{}, ...] -``` - -`KEY` can also used when returning attribute values as separate -variables, such that one of the returned variables contains the entire -primary keys. - -### Sorting results - -To sort the result, use the `order_by` keyword argument. - -```python -data = query.fetch(order_by='mouse_id') # ascending order -data = query.fetch(order_by='mouse_id desc') # descending order -data = query.fetch(order_by=('mouse_id', 'dob')) # by ID first, dob second -data = query.fetch(order_by='KEY') # sort by the primary key -``` - -The `order_by` argument can be a string specifying the attribute to sort by. By default -the sort is in ascending order. Use `'attr desc'` to sort in descending order by -attribute `attr`. The value can also be a sequence of strings, in which case, the sort -performed on all the attributes jointly in the order specified. - -The special attribute named `'KEY'` represents the primary key attributes in order that -they appear in the index. Otherwise, this name can be used as any other argument. - -If an attribute happens to be a SQL reserved word, it needs to be enclosed in -backquotes. For example: - -```python -data = query.fetch(order_by='`select` desc') -``` - -The `order_by` value is eventually passed to the `ORDER BY` -[clause](https://dev.mysql.com/doc/refman/8.0/en/order-by-optimization.html). - -### Limiting results - -Similar to sorting, the `limit` and `offset` arguments can be used to limit the result -to a subset of entities. - -```python -data = query.fetch(order_by='mouse_id', limit=10, offset=5) -``` - -Note that an `offset` cannot be used without specifying a `limit` as -well. - -### Usage with Pandas - -The `pandas` [library](http://pandas.pydata.org/) is a popular library for data analysis -in Python which can easily be used with DataJoint query results. Since the records -returned by `fetch()` are contained within a `numpy.recarray`, they can be easily -converted to `pandas.DataFrame` objects by passing them into the `pandas.DataFrame` -constructor. For example: - -```python -import pandas as pd -frame = pd.DataFrame(tab.fetch()) -``` - -Calling `fetch()` with the argument `format="frame"` returns results as -`pandas.DataFrame` objects indexed by the table's primary key attributes. - -```python -frame = tab.fetch(format="frame") -``` - -Returning results as a `DataFrame` is not possible when fetching a particular subset of -attributes or when `as_dict` is set to `True`. - -## Drop - -The `drop` method completely removes a table from the database, including its -definition. It also removes all dependent tables, recursively. DataJoint will first -display the tables being dropped and the number of entities in each before prompting -the user for confirmation to proceed. - -The `drop` method is often used during initial design to allow altered -table definitions to take effect. - -```python -# drop the Person table from its schema -Person.drop() -``` diff --git a/docs/src/archive/sysadmin/bulk-storage.md b/docs/src/archive/sysadmin/bulk-storage.md deleted file mode 100644 index 12af44791..000000000 --- a/docs/src/archive/sysadmin/bulk-storage.md +++ /dev/null @@ -1,104 +0,0 @@ -# Bulk Storage Systems - -## Why External Bulk Storage? - -DataJoint supports the storage of large data objects associated with -relational records externally from the MySQL Database itself. This is -significant and useful for a number of reasons. - -### Cost - -One reason is that the high-performance storage commonly used in database systems is -more expensive than typical commodity storage. Therefore, storing the smaller identifying -information typically used in queries on fast, relational database storage and storing -the larger bulk data used for analysis or processing on lower cost commodity storage -enables large savings in storage expense. - -### Flexibility - -Storing bulk data separately also facilitates more flexibility in -usage, since the bulk data can managed using separate maintenance -processes than those in the relational storage. - -For example, larger relational databases may require many hours to be -restored in the event of system failures. If the relational portion of -the data is stored separately, with the larger bulk data stored on -another storage system, this downtime can be reduced to a matter of -minutes. Similarly, due to the lower cost of bulk commodity storage, -more emphasis can be put into redundancy of this data and backups to -help protect the non-relational data. - -### Performance - -Storing the non-relational bulk data separately can have system -performance impacts by removing data transfer, disk I/O, and memory -load from the database server and shifting these to the bulk storage -system. Additionally, DataJoint supports caching of bulk data records -which can allow for faster processing of records which already have -been retrieved in previous queries. - -### Data Sharing - -DataJoint provides pluggable support for different external bulk storage backends, -allowing data sharing by publishing bulk data to S3-Protocol compatible data shares both -in the cloud and on locally managed systems and other common tools for data sharing, -such as Globus, etc. - -## Bulk Storage Scenarios - -Typical bulk storage considerations relate to the cost of the storage -backend per unit of storage, the amount of data which will be stored, -the desired focus of the shared data (system performance, data -flexibility, data sharing), and data access. Some common scenarios are -given in the following table: - -| Scenario | Storage Solution | System Requirements | Notes | -| -- | -- | -- | -- | -| Local Object Cache | Local External Storage | Local Hard Drive | Used to Speed Access to other Storage | -| LAN Object Cache | Network External Storage | Local Network Share | Used to Speed Access to other storage, reduce Cloud/Network Costs/Overhead | -| Local Object Store | Local/Network External Storage | Local/Network Storage | Used to store objects externally from the database | -| Local S3-Compatible Store | Local S3-Compatible Server | Network S3-Server | Used to host S3-Compatible services locally (e.g. minio) for internal use or to lower cloud costs | -| Cloud S3-Compatible Storage | Cloud Provider | Internet Connectivity | Used to reduce/remove requirement for external storage management, data sharing | -| Globus Storage | Globus Endpoint | Local/Local Network Storage, Internet Connectivity | Used for institutional data transfer or publishing. | - -## Bulk Storage Considerations - -Although external bulk storage provides a variety of advantages for -storage cost and data sharing, it also uses slightly different data -input/retrieval semantics and as such has different performance -characteristics. - -### Performance Characteristics - -In the direct database connection scenario, entire result sets are -either added or retrieved from the database in a single stream -action. In the case of external storage, individual record components -are retrieved in a set of sequential actions per record, each one -subject to the network round trip to the given storage medium. As -such, tables using many small records may be ill suited to external -storage usage in the absence of a caching mechanism. While some of -these impacts may be addressed by code changes in a future release of -DataJoint, to some extent, the impact is directly related from needing -to coordinate the activities of the database data stream with the -external storage system, and so cannot be avoided. - -### Network Traffic - -Some of the external storage solutions mentioned above incur cost both -at a data volume and transfer bandwidth level. The number of users -querying the database, data access, and use of caches should be -considered in these cases to reduce this cost if applicable. - -### Data Coherency - -When storing all data directly in the relational data store, it is -relatively easy to ensure that all data in the database is consistent -in the event of system issues such as crash recoveries, since MySQL’s -relational storage engine manages this for you. When using external -storage however, it is important to ensure that any data recoveries of -the database system are paired with a matching point-in-time of the -external storage system. While DataJoint does use hashing to help -facilitate a guarantee that external files are uniquely named -throughout their lifecycle, the pairing of a given relational dataset -against a given filesystem state is loosely coupled, and so an -incorrect pairing could result in processing failures or other issues. diff --git a/docs/src/archive/sysadmin/database-admin.md b/docs/src/archive/sysadmin/database-admin.md deleted file mode 100644 index 352a3af11..000000000 --- a/docs/src/archive/sysadmin/database-admin.md +++ /dev/null @@ -1,364 +0,0 @@ -# Database Administration - -## Hosting - -Let’s say a person, a lab, or a multi-lab consortium decide to use DataJoint as their -data pipeline platform. -What IT resources and support will be required? - -DataJoint uses a MySQL-compatible database server such as MySQL, MariaDB, Percona -Server, or Amazon Aurora to store the structured data used for all relational -operations. -Large blocks of data associated with these records such as multidimensional numeric -arrays (signals, images, scans, movies, etc) can be stored within the database or -stored in additionally configured [bulk storage](../client/stores.md). - -The first decisions you need to make are where this server will be hosted and how it -will be administered. -The server may be hosted on your personal computer, on a dedicated machine in your lab, -or in a cloud-based database service. - -### Cloud hosting - -Increasingly, many teams make use of cloud-hosted database services, which allow great -flexibility and easy administration of the database server. -A cloud hosting option will be provided through https://works.datajoint.com. -DataJoint Works simplifies the setup for labs that wish to host their data pipelines in -the cloud and allows sharing pipelines between multiple groups and locations. -Being an open-source solution, other cloud services such as Amazon RDS can also be used -in this role, albeit with less DataJoint-centric customization. - -### Self hosting - -In the most basic configuration, the relational database management system (database -server) is installed on an individual user's personal computer. -To support a group of users, a specialized machine can be configured as a dedicated -database server. -This server can be accessed by multiple DataJoint clients to query the data and perform -computations. - -For larger groups and multi-site collaborations with heavy workloads, the database -server cluster may be configured in the cloud or on premises. -The following section provides some basic guidelines for these configurations here and -in the subsequent sections of the documentation. - -### General server / hardware support requirements - -The following table lists some likely scenarios for DataJoint database server -deployments and some reasonable estimates of the required computer hardware. -The required IT/systems support needed to ensure smooth operations in the absence of -local database expertise is also listed. - -#### IT infrastructures - -| Usage Scenario | DataJoint Database Computer | Required IT Support | -| -- | -- | -- | -| Single User | Personal Laptop or Workstation | Self-Supported or Ad-Hoc General IT Support | -| Small Group (e.g. 2-10 Users) | Workstation or Small Server | Ad-Hoc General or Experienced IT Support | -| Medium Group (e.g. 10-30 Users) | Small to Medium Server | Ad-Hoc/Part Time Experienced or Specialized IT Support | -| Large Group/Department (e.g. 30-50+ Users) | Medium/Large Server or Multi-Server Replication | Part Time/Dedicated Experienced or Specialized IT Support | -| Multi-Location Collaboration (30+ users, Geographically Distributed) | Large Server, Advanced Replication | Dedicated Specialized IT Support | - -## Configuration - -### Hardware considerations - -As in any computer system, CPU, RAM memory, disk storage, and network speed are -important components of performance. -The relational database component of DataJoint is no exception to this rule. -This section discusses the various factors relating to selecting a server for your -DataJoint pipelines. - -#### CPU - -CPU speed and parallelism (number of cores/threads) will impact the speed of queries -and the number of simultaneous queries which can be efficiently supported by the system. -It is a good rule of thumb to have enough cores to support the number of active users -and background tasks you expect to have running during a typical 'busy' day of usage. -For example, a team of 10 people might want to have 8 cores to support a few active -queries and background tasks. - -#### RAM - -The amount of RAM will impact the amount of DataJoint data kept in memory, allowing for -faster querying of data since the data can be searched and returned to the user without -needing to access the slower disk drives. -It is a good idea to get enough memory to fully store the more important and frequently -accessed portions of your dataset with room to spare, especially if in-database blob -storage is used instead of external [bulk storage](bulk-storage.md). - -#### Disk - -The disk storage for a DataJoint database server should have fast random access, -ideally with flash-based storage to eliminate the rotational delay of mechanical hard -drives. - -#### Networking - -When network connections are used, network speed and latency are important to ensure -that large query results can be quickly transferred across the network and that delays -due to data entry/query round-trip have minimal impact on the runtime of the program. - -#### General recommendations - -DataJoint datasets can consist of many thousands or even millions of records. -Generally speaking one would want to make sure that the relational database system has -sufficient CPU speed and parallelism to support a typical number of concurrent users -and to execute searches quickly. -The system should have enough RAM to store the primary key values of commonly used -tables and operating system caches. -Disk storage should be fast enough to support quick loading of and searching through -the data. -Lastly, network bandwidth must be sufficient to support transferring user records -quickly. - -### Large-scale installations - -Database replication may be beneficial if system downtime or precise database -responsiveness is a concern -Replication can allow for easier coordination of maintenance activities, faster -recovery in the event of system problems, and distribution of the database workload -across server machines to increase throughput and responsiveness. - -#### Multi-master replication - -Multi-master replication configurations allow for all replicas to be used in a read/ -write fashion, with the workload being distributed among all machines. -However, multi-master replication is also more complicated, requiring front-end -machines to distribute the workload, similar performance characteristics on all -replicas to prevent bottlenecks, and redundant network connections to ensure the -replicated machines are always in sync. - -### Recommendations - -It is usually best to go with the simplest solution which can suit the requirements of -the installation, adjusting workloads where possible and adding complexity only as -needs dictate. - -Resource requirements of course depend on the data collection and processing needs of -the given pipeline, but there are general size guidelines that can inform any system -configuration decisions. -A reasonably powerful workstation or small server should support the needs of a small -group (2-10 users). -A medium or large server should support the needs of a larger user community (10-30 -users). -A replicated or distributed setup of 2 or more medium or large servers may be required -in larger cases. -These requirements can be reduced through the use of external or cloud storage, which -is discussed in the subsequent section. - -| Usage Scenario | DataJoint Database Computer | Hardware Recommendation | -| -- | -- | -- | -| Single User | Personal Laptop or Workstation | 4 Cores, 8-16GB or more of RAM, SSD or better storage | -| Small Group (e.g. 2-10 Users) | Workstation or Small Server | 8 or more Cores, 16GB or more of RAM, SSD or better storage | -| Medium Group (e.g. 10-30 Users) | Small to Medium Server | 8-16 or more Cores, 32GB or more of RAM, SSD/RAID or better storage | -| Large Group/Department (e.g. 30-50+ Users) | Medium/Large Server or Multi-Server Replication | 16-32 or more Cores, 64GB or more of RAM, SSD Raid storage, multiple machines | -| Multi-Location Collaboration (30+ users, Geographically Distributed) | Large Server, Advanced Replication | 16-32 or more Cores, 64GB or more of RAM, SSD Raid storage, multiple machines; potentially multiple machines in multiple locations | - -### Docker - -A Docker image is available for a MySQL server configured to work with DataJoint: https://github.com/datajoint/mysql-docker. - -## User Management - -Create user accounts on the MySQL server. For example, if your -username is alice, the SQL code for this step is: - -```mysql -CREATE USER 'alice'@'%' IDENTIFIED BY 'alices-secret-password'; -``` - -Existing users can be listed using the following SQL: - -```mysql -SELECT user, host from mysql.user; -``` - -Teams that use DataJoint typically divide their data into schemas -grouped together by common prefixes. For example, a lab may have a -collection of schemas that begin with `common_`. Some common -processing may be organized into several schemas that begin with -`pipeline_`. Typically each user has all privileges to schemas that -begin with their username. - -For example, alice may have privileges to select and insert data from -the common schemas (but not create new tables), and have all -privileges to the pipeline schemas. - -Then the SQL code to grant her privileges might look like: - -```mysql -GRANT SELECT, INSERT ON `common\_%`.* TO 'alice'@'%'; -GRANT ALL PRIVILEGES ON `pipeline\_%`.* TO 'alice'@'%'; -GRANT ALL PRIVILEGES ON `alice\_%`.* TO 'alice'@'%'; -``` - -To note, the ```ALL PRIVILEGES``` option allows the user to create -and remove databases without administrator intervention. - -Once created, a user's privileges can be listed using the ```SHOW GRANTS``` -statement. - -```mysql -SHOW GRANTS FOR 'alice'@'%'; -``` - -### Grouping with Wildcards - -Depending on the complexity of your installation, using additional -wildcards to group access rules together might make managing user -access rules simpler. For example, the following equivalent -convention: - -```mysql -GRANT ALL PRIVILEGES ON `user_alice\_%`.* TO 'alice'@'%'; -``` - -Could then facilitate using a rule like: - -```mysql -GRANT SELECT ON `user\_%\_%`.* TO 'bob'@'%'; -``` - -to enable `bob` to query all other users tables using the -`user_username_database` convention without needing to explicitly -give him access to `alice\_%`, `charlie\_%`, and so on. - -This convention can be further expanded to create notions of groups -and protected schemas for background processing, etc. For example: - -```mysql -GRANT ALL PRIVILEGES ON `group\_shared\_%`.* TO 'alice'@'%'; -GRANT ALL PRIVILEGES ON `group\_shared\_%`.* TO 'bob'@'%'; - -GRANT ALL PRIVILEGES ON `group\_wonderland\_%`.* TO 'alice'@'%'; -GRANT SELECT ON `group\_wonderland\_%`.* TO 'alice'@'%'; -``` - -could allow both bob an alice to read/write into the -```group\_shared``` databases, but in the case of the -```group\_wonderland``` databases, read write access is restricted -to alice. - -## Backups and Recovery - -Backing up your DataJoint installation is critical to ensuring that your work is safe -and can be continued in the event of system failures, and several mechanisms are -available to use. - -Much like your live installation, your backup will consist of two portions: - -- Backup of the Relational Data -- Backup of optional external bulk storage - -This section primarily deals with backup of the relational data since most of the -optional bulk storage options use "regular" flat-files for storage and can be backed up -via any "normal" disk backup regime. - -There are many options to backup MySQL; subsequent sections discuss a few options. - -### Cloud hosted backups - -In the case of cloud-hosted options, many cloud vendors provide automated backup of -your data, and some facility for downloading such backups externally. -Due to the wide variety of cloud-specific options, discussion of these options falls -outside of the scope of this documentation. -However, since the cloud server is also a MySQL server, other options listed here may -work for your situation. - -### Disk-based backup - -The simplest option for many cases is to perform a disk-level backup of your MySQL -installation using standard disk backup tools. -It should be noted that all database activity should be stopped for the duration of the -backup to prevent errors with the backed up data. -This can be done in one of two ways: - -- Stopping the MySQL server program -- Using database locks - -These methods are required since MySQL data operations can be ongoing in the background -even when no user activity is ongoing. -To use a database lock to perform a backup, the following commands can be used as the -MySQL administrator: - -```mysql -FLUSH TABLES WITH READ LOCK; -UNLOCK TABLES; -``` - -The backup should be performed between the issuing of these two commands, ensuring the -database data is consistent on disk when it is backed up. - -### MySQLDump - -Disk based backups may not be feasible for every installation, or a database may -require constant activity such that stopping it for backups is not feasible. -In such cases, the simplest option is -[MySQLDump](https://dev.mysql.com/doc/mysql-backup-excerpt/8.0/en/using-mysqldump.html), - a command line tool that prints the contents of your database contents in SQL form. - -This tool is generally acceptable for most cases and is especially well suited for -smaller installations due to its simplicity and ease of use. - -For larger installations, the lower speed of MySQLDump can be a limitation, since it -has to convert the database contents to and from SQL rather than dealing with the -database files directly. -Additionally, since backups are performed within a transaction, the backup will be -valid up to the time the backup began rather than to its completion, which can make -ensuring that the latest data are fully backed up more difficult as the time it takes -to run a backup grows. - -### Percona XTraBackup - -The Percona `xtrabackup` tool provides near-realtime backup capability of a MySQL -installation, with extended support for replicated databases, and is a good tool for -backing up larger databases. - -However, this tool requires local disk access as well as reasonably fast backup media, -since it builds an ongoing transaction log in real time to ensure that backups are -valid up to the point of their completion. -This strategy fails if it cannot keep up with the write speed of the database. -Further, the backups it generates are in binary format and include incomplete database -transactions, which require careful attention to detail when restoring. - -As such, this solution is recommended only for advanced use cases or larger databases -where limitations of the other solutions may apply. - -### Locking and DDL issues - -One important thing to note is that at the time of writing, MySQL's transactional -system is not `data definition language` aware, meaning that changes to table -structures occurring during some backup schemes can result in corrupted backup copies. -If schema changes will be occurring during your backup window, it is a good idea to -ensure that appropriate locking mechanisms are used to prevent these changes during -critical steps of the backup process. - -However, on busy installations which cannot be stopped, the use of locks in many backup -utilities may cause issues if your programs expect to write data to the database during -the backup window. - -In such cases it might make sense to review the given backup tools for locking related -options or to use other mechanisms such as replicas or alternate backup tools to -prevent interaction of the database. - -### Replication and snapshots for backup - -Larger databases consisting of many Terabytes of data may take many hours or even days -to backup and restore, and so downtime resulting from system failure can create major -impacts to ongoing work. - -While not backup tools per-se, use of MySQL replication and disk snapshots -can be useful to assist in reducing the downtime resulting from a full database outage. - -Replicas can be configured so that one copy of the data is immediately online in the -event of server crash. -When a server fails in this case, users and programs simply restart and point to the -new server before resuming work. - -Replicas can also reduce the system load generated by regular backup procedures, since -they can be backed up instead of the main server. -Additionally they can allow more flexibility in a given backup scheme, such as allowing -for disk snapshots on a busy system that would not otherwise be able to be stopped. -A replica copy can be stopped temporarily and then resumed while a disk snapshot or -other backup operation occurs. diff --git a/docs/src/archive/sysadmin/external-store.md b/docs/src/archive/sysadmin/external-store.md deleted file mode 100644 index aac61fe24..000000000 --- a/docs/src/archive/sysadmin/external-store.md +++ /dev/null @@ -1,293 +0,0 @@ -# External Store - -DataJoint organizes most of its data in a relational database. -Relational databases excel at representing relationships between entities and storing -structured data. -However, relational databases are not particularly well-suited for storing large -continuous chunks of data such as images, signals, and movies. -An attribute of type `longblob` can contain an object up to 4 GiB in size (after -compression) but storing many such large objects may hamper the performance of queries -on the entire table. -A good rule of thumb is that objects over 10 MiB in size should not be put in the -relational database. -In addition, storing data in cloud-hosted relational databases (e.g. AWS RDS) may be -more expensive than in cloud-hosted simple storage systems (e.g. AWS S3). - -DataJoint allows the use of `external` storage to store large data objects within its -relational framework but outside of the main database. - -Defining an externally-stored attribute is used using the notation `blob@storename` -(see also: [definition syntax](../design/tables/declare.md)) and works the same way as -a `longblob` attribute from the users perspective. However, its data are stored in an -external storage system rather than in the relational database. - -Various systems can play the role of external storage, including a shared file system -accessible to all team members with access to these objects or a cloud storage -solutions such as AWS S3. - -For example, the following table stores motion-aligned two-photon movies. - -```python -# Motion aligned movies --> twophoton.Scan ---- -aligned_movie : blob@external # motion-aligned movie in 'external' store -``` - -All [insert](../manipulation/insert.md) and [fetch](../query/fetch.md) operations work -identically for `external` attributes as they do for `blob` attributes, with the same -serialization protocol. -Similar to `blobs`, `external` attributes cannot be used in restriction conditions. - -Multiple external storage configurations may be used simultaneously with the -`@storename` portion of the attribute definition determining the storage location. - -```python -# Motion aligned movies --> twophoton.Scan ---- -aligned_movie : blob@external-raw # motion-aligned movie in 'external-raw' store -``` - -## Principles of operation - -External storage is organized to emulate individual attribute values in the relational -database. -DataJoint organizes external storage to preserve the same data integrity principles as -in relational storage. - -1. The external storage locations are specified in the DataJoint connection -configuration with one specification for each store. - - ```python - dj.config['stores'] = { - 'external': dict( # 'regular' external storage for this pipeline - protocol='s3', - endpoint='s3.amazonaws.com:9000', - bucket = 'testbucket', - location = 'datajoint-projects/lab1', - access_key='1234567', - secret_key='foaf1234'), - 'external-raw': dict( # 'raw' storage for this pipeline - protocol='file', - location='/net/djblobs/myschema') - } - # external object cache - see fetch operation below for details. - dj.config['cache'] = '/net/djcache' - ``` - -2. Each schema corresponds to a dedicated folder at the storage location with the same -name as the database schema. - -3. Stored objects are identified by the [SHA-256](https://en.wikipedia.org/wiki/SHA-2) -hashes (in web-safe base-64 ASCII) of their serialized contents. - This scheme allows for the same object—used multiple times in the same schema—to be - stored only once. - -4. In the `external-raw` storage, the objects are saved as files with the hash as the -filename. - -5. In the `external` storage, external files are stored in a directory layout -corresponding to the hash of the filename. By default, this corresponds to the first 2 -characters of the hash, followed by the second 2 characters of the hash, followed by -the actual file. - -6. Each database schema has an auxiliary table named `~external_` for each -configured external store. - - It is automatically created the first time external storage is used. - The primary key of `~external_` is the hash of the data (for blobs and - attachments) or of the relative paths to the files for filepath-based storage. - Other attributes are the `count` of references by tables in the schema, the `size` - of the object in bytes, and the `timestamp` of the last event (creation, update, or - deletion). - - Below are sample entries in `~external_`. - - | HASH | size | filepath | contents_hash | timestamp | - | -- | -- | -- | -- | -- | - | 1GEqtEU6JYEOLS4sZHeHDxWQ3JJfLlH VZio1ga25vd2 | 1039536788 | NULL | NULL | 2017-06-07 23:14:01 | - - The fields `filepath` and `contents_hash` relate to the - [filepath](../design/tables/filepath.md) datatype, which will be discussed - separately. - -7. Attributes of type `@` are declared as renamed -[foreign keys](../design/tables/dependencies.md) referencing the -`~external_` table (but are not shown as such to the user). - -8. The [insert](../manipulation/insert.md) operation encodes and hashes the blob data. -If an external object is not present in storage for the same hash, the object is saved -and if the save operation is successful, corresponding entities in table -`~external_` for that store are created. - -9. The [delete](../manipulation/delete.md) operation first deletes the foreign key -reference in the target table. The external table entry and actual external object is -not actually deleted at this time (`soft-delete`). - -10. The [fetch](../query/fetch.md) operation uses the hash values to find the data. - In order to prevent excessive network overhead, a special external store named - `cache` can be configured. - If the `cache` is enabled, the `fetch` operation need not access - `~external_` directly. - Instead `fetch` will retrieve the cached object without downloading directly from - the `real` external store. - -11. Cleanup is performed regularly when the database is in light use or off-line. - -12. DataJoint never removes objects from the local `cache` folder. - The `cache` folder may just be periodically emptied entirely or based on file - access date. - If dedicated `cache` folders are maintained for each schema, then a special - procedure will be provided to remove all objects that are no longer listed in - `~external_`. - -Data removal from external storage is separated from the delete operations to ensure -that data are not lost in race conditions between inserts and deletes of the same -objects, especially in cases of transactional processing or in processes that are -likely to get terminated. -The cleanup steps are performed in a separate process when the risks of race conditions -are minimal. -The process performing the cleanups must be isolated to prevent interruptions resulting -in loss of data integrity. - -## Configuration - -The following steps must be performed to enable external storage: - -1. Assign external location settings for each storage as shown in the -[Step 1](#principles-of-operation) example above. Use `dj.config` for configuration. - - - `protocol` [`s3`, `file`] Specifies whether `s3` or `file` external storage is - desired. - - `endpoint` [`s3`] Specifies the remote endpoint to the external data for all - schemas as well as the target port. - - `bucket` [`s3`] Specifies the appropriate `s3` bucket organization. - - `location` [`s3`, `file`] Specifies the subdirectory within the root or bucket of - store to preserve data. External objects are thus stored remotely with the following - path structure: - `////`. - - `access_key` [`s3`] Specifies the access key credentials for accessing the external - location. - - `secret_key` [`s3`] Specifies the secret key credentials for accessing the external - location. - - `secure` [`s3`] Optional specification to establish secure external storage - connection with TLS (aka SSL, HTTPS). Defaults to `False`. - -2. Optionally, for each schema specify the `cache` folder for local fetch cache. - - This is done by saving the path in the `cache` key of the DataJoint configuration - dictionary: - - ```python - dj.config['cache'] = '/temp/dj-cache' - ``` - -## Cleanup - -Deletion of records containing externally stored blobs is a `soft-delete` which only -removes the database-side records from the database. -To cleanup the external tracking table or the actual external files, a separate process -is provided as follows. - -To remove only the tracking entries in the external table, call `delete` -on the `~external_` table for the external configuration with the argument -`delete_external_files=False`. - -Note: Currently, cleanup operations on a schema's external table are not 100% - transaction safe and so must be run when there is no write activity occurring - in tables which use a given schema / external store pairing. - -```python -schema.external['external_raw'].delete(delete_external_files=False) -``` - -To remove the tracking entries as well as the underlying files, call `delete` -on the external table for the external configuration with the argument -`delete_external_files=True`. - -```python -schema.external['external_raw'].delete(delete_external_files=True) -``` - -Note: Setting `delete_external_files=True` will always attempt to delete - the underlying data file, and so should not typically be used with - the `filepath` datatype. - -## Migration between DataJoint v0.11 and v0.12 - -Note: Please read carefully if you have used external storage in DataJoint v0.11! - -The initial implementation of external storage was reworked for -DataJoint v0.12. These changes are backward-incompatible with DataJoint -v0.11 so care should be taken when upgrading. This section outlines -some details of the change and a general process for upgrading to a -format compatible with DataJoint v0.12 when a schema rebuild is not -desired. - -The primary changes to the external data implementation are: - -- The external object tracking mechanism was modified. Tracking tables -were extended for additional external datatypes and split into -per-store tables to improve database performance in schemas with -many external objects. - -- The external storage format was modified to use a nested subfolder -structure (`folding`) to improve performance and interoperability -with some filesystems that have limitations or performance problems -when storing large numbers of files in single directories. - -Depending on the circumstances, the simplest way to migrate data to -v0.12 may be to drop and repopulate the affected schemas. This will construct -the schema and storage structure in the v0.12 format and save the need for -database migration. When recreation is not possible or is not preferred -to upgrade to DataJoint v0.12, the following process should be followed: - - 1. Stop write activity to all schemas using external storage. - - 2. Perform a full backup of your database(s). - - 3. Upgrade your DataJoint installation to v0.12 - - 4. Adjust your external storage configuration (in `datajoint.config`) - to the new v0.12 configuration format (see above). - - 5. Migrate external tracking tables for each schema to use the new format. For - instance in Python: - - ```python - import datajoint.migrate as migrate - db_schema_name='schema_1' - external_store='raw' - migrate.migrate_dj011_external_blob_storage_to_dj012(db_schema_name, external_store) - ``` - - 6. Verify pipeline functionality after this process has completed. For instance in - Python: - - ```python - x = myschema.TableWithExternal.fetch('external_field', limit=1)[0] - ``` - -Note: This migration function is provided on a best-effort basis, and will - convert the external tracking tables into a format which is compatible - with DataJoint v0.12. While we have attempted to ensure correctness - of the process, all use-cases have not been heavily tested. Please be sure to fully - back-up your data and be prepared to investigate problems with the - migration, should they occur. - -Please note: - -- The migration only migrates the tracking table format and does not -modify the backing file structure to support `folding`. The DataJoint -v0.12 logic is able to work with this format, but to take advantage -of the new backend storage, manual adjustment of the tracking table -and files, or a full rebuild of the schema should be performed. - -- Additional care to ensure all clients are using v0.12 should be -taken after the upgrade. Legacy clients may incorrectly create data -in the old format which would then need to be combined or otherwise -reconciled with the data in v0.12 format. You might wish to take -the opportunity to version-pin your installations so that future -changes requiring controlled upgrades can be coordinated on a system -wide basis. diff --git a/docs/src/archive/tutorials/dj-top.ipynb b/docs/src/archive/tutorials/dj-top.ipynb deleted file mode 100644 index 5920a9f25..000000000 --- a/docs/src/archive/tutorials/dj-top.ipynb +++ /dev/null @@ -1,1015 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Using the dj.Top restriction" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "First you will need to [install](../../getting-started/#installation) and [connect](../../getting-started/#connection) to a DataJoint [data pipeline](https://docs.datajoint.com/core/datajoint-python/latest/concepts/data-pipelines/#what-is-a-data-pipeline).\n", - "\n", - "Now let's start by importing the `datajoint` client." - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "[2024-12-20 11:10:20,120][INFO]: Connecting root@127.0.0.1:3306\n", - "[2024-12-20 11:10:20,259][INFO]: Connected root@127.0.0.1:3306\n" - ] - } - ], - "source": [ - "import datajoint as dj\n", - "\n", - "dj.config[\"database.host\"] = \"127.0.0.1\"\n", - "schema = dj.Schema(\"university\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Table Definition" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [], - "source": [ - "@schema\n", - "class Student(dj.Manual):\n", - " definition = \"\"\"\n", - " student_id : int unsigned # university-wide ID number\n", - " ---\n", - " first_name : varchar(40)\n", - " last_name : varchar(40)\n", - " sex : enum('F', 'M', 'U')\n", - " date_of_birth : date\n", - " home_address : varchar(120) # mailing street address\n", - " home_city : varchar(60) # mailing address\n", - " home_state : char(2) # US state acronym: e.g. OH\n", - " home_zip : char(10) # zipcode e.g. 93979-4979\n", - " home_phone : varchar(20) # e.g. 414.657.6883x0881\n", - " \"\"\"" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [], - "source": [ - "@schema\n", - "class Department(dj.Manual):\n", - " definition = \"\"\"\n", - " dept : varchar(6) # abbreviated department name, e.g. BIOL\n", - " ---\n", - " dept_name : varchar(200) # full department name\n", - " dept_address : varchar(200) # mailing address\n", - " dept_phone : varchar(20)\n", - " \"\"\"" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [], - "source": [ - "@schema\n", - "class StudentMajor(dj.Manual):\n", - " definition = \"\"\"\n", - " -> Student\n", - " ---\n", - " -> Department\n", - " declare_date : date # when student declared her major\n", - " \"\"\"" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": {}, - "outputs": [], - "source": [ - "@schema\n", - "class Course(dj.Manual):\n", - " definition = \"\"\"\n", - " -> Department\n", - " course : int unsigned # course number, e.g. 1010\n", - " ---\n", - " course_name : varchar(200) # e.g. \"Neurobiology of Sensation and Movement.\"\n", - " credits : decimal(3,1) # number of credits earned by completing the course\n", - " \"\"\"\n", - "\n", - "\n", - "@schema\n", - "class Term(dj.Manual):\n", - " definition = \"\"\"\n", - " term_year : year\n", - " term : enum('Spring', 'Summer', 'Fall')\n", - " \"\"\"\n", - "\n", - "\n", - "@schema\n", - "class Section(dj.Manual):\n", - " definition = \"\"\"\n", - " -> Course\n", - " -> Term\n", - " section : char(1)\n", - " ---\n", - " auditorium : varchar(12)\n", - " \"\"\"\n", - "\n", - "\n", - "@schema\n", - "class CurrentTerm(dj.Manual):\n", - " definition = \"\"\"\n", - " -> Term\n", - " \"\"\"\n", - "\n", - "\n", - "@schema\n", - "class Enroll(dj.Manual):\n", - " definition = \"\"\"\n", - " -> Student\n", - " -> Section\n", - " \"\"\"\n", - "\n", - "\n", - "@schema\n", - "class LetterGrade(dj.Lookup):\n", - " definition = \"\"\"\n", - " grade : char(2)\n", - " ---\n", - " points : decimal(3,2)\n", - " \"\"\"\n", - " contents = [\n", - " [\"A\", 4.00],\n", - " [\"A-\", 3.67],\n", - " [\"B+\", 3.33],\n", - " [\"B\", 3.00],\n", - " [\"B-\", 2.67],\n", - " [\"C+\", 2.33],\n", - " [\"C\", 2.00],\n", - " [\"C-\", 1.67],\n", - " [\"D+\", 1.33],\n", - " [\"D\", 1.00],\n", - " [\"F\", 0.00],\n", - " ]\n", - "\n", - "\n", - "@schema\n", - "class Grade(dj.Manual):\n", - " definition = \"\"\"\n", - " -> Enroll\n", - " ---\n", - " -> LetterGrade\n", - " \"\"\"" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Insert" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [], - "source": [ - "from tqdm import tqdm\n", - "import faker\n", - "import random\n", - "import datetime\n", - "\n", - "fake = faker.Faker()" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [], - "source": [ - "def yield_students():\n", - " fake_name = {\"F\": fake.name_female, \"M\": fake.name_male}\n", - " while True: # ignore invalid values\n", - " try:\n", - " sex = random.choice((\"F\", \"M\"))\n", - " first_name, last_name = fake_name[sex]().split(\" \")[:2]\n", - " street_address, city = fake.address().split(\"\\n\")\n", - " city, state = city.split(\", \")\n", - " state, zipcode = state.split(\" \")\n", - " except ValueError:\n", - " continue\n", - " else:\n", - " yield dict(\n", - " first_name=first_name,\n", - " last_name=last_name,\n", - " sex=sex,\n", - " home_address=street_address,\n", - " home_city=city,\n", - " home_state=state,\n", - " home_zip=zipcode,\n", - " date_of_birth=str(fake.date_time_between(start_date=\"-35y\", end_date=\"-15y\").date()),\n", - " home_phone=fake.phone_number()[:20],\n", - " )" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [], - "source": [ - "Student.insert(dict(k, student_id=i) for i, k in zip(range(100, 300), yield_students()))\n", - "\n", - "Department.insert(\n", - " dict(\n", - " dept=dept,\n", - " dept_name=name,\n", - " dept_address=fake.address(),\n", - " dept_phone=fake.phone_number()[:20],\n", - " )\n", - " for dept, name in [\n", - " [\"CS\", \"Computer Science\"],\n", - " [\"BIOL\", \"Life Sciences\"],\n", - " [\"PHYS\", \"Physics\"],\n", - " [\"MATH\", \"Mathematics\"],\n", - " ]\n", - ")\n", - "\n", - "StudentMajor.insert(\n", - " {**s, **d, \"declare_date\": fake.date_between(start_date=datetime.date(1999, 1, 1))}\n", - " for s, d in zip(Student.fetch(\"KEY\"), random.choices(Department.fetch(\"KEY\"), k=len(Student())))\n", - " if random.random() < 0.75\n", - ")\n", - "\n", - "# from https://www.utah.edu/\n", - "Course.insert(\n", - " [\n", - " [\"BIOL\", 1006, \"World of Dinosaurs\", 3],\n", - " [\"BIOL\", 1010, \"Biology in the 21st Century\", 3],\n", - " [\"BIOL\", 1030, \"Human Biology\", 3],\n", - " [\"BIOL\", 1210, \"Principles of Biology\", 4],\n", - " [\"BIOL\", 2010, \"Evolution & Diversity of Life\", 3],\n", - " [\"BIOL\", 2020, \"Principles of Cell Biology\", 3],\n", - " [\"BIOL\", 2021, \"Principles of Cell Science\", 4],\n", - " [\"BIOL\", 2030, \"Principles of Genetics\", 3],\n", - " [\"BIOL\", 2210, \"Human Genetics\", 3],\n", - " [\"BIOL\", 2325, \"Human Anatomy\", 4],\n", - " [\"BIOL\", 2330, \"Plants & Society\", 3],\n", - " [\"BIOL\", 2355, \"Field Botany\", 2],\n", - " [\"BIOL\", 2420, \"Human Physiology\", 4],\n", - " [\"PHYS\", 2040, \"Classcal Theoretical Physics II\", 4],\n", - " [\"PHYS\", 2060, \"Quantum Mechanics\", 3],\n", - " [\"PHYS\", 2100, \"General Relativity and Cosmology\", 3],\n", - " [\"PHYS\", 2140, \"Statistical Mechanics\", 4],\n", - " [\"PHYS\", 2210, \"Physics for Scientists and Engineers I\", 4],\n", - " [\"PHYS\", 2220, \"Physics for Scientists and Engineers II\", 4],\n", - " [\"PHYS\", 3210, \"Physics for Scientists I (Honors)\", 4],\n", - " [\"PHYS\", 3220, \"Physics for Scientists II (Honors)\", 4],\n", - " [\"MATH\", 1250, \"Calculus for AP Students I\", 4],\n", - " [\"MATH\", 1260, \"Calculus for AP Students II\", 4],\n", - " [\"MATH\", 1210, \"Calculus I\", 4],\n", - " [\"MATH\", 1220, \"Calculus II\", 4],\n", - " [\"MATH\", 2210, \"Calculus III\", 3],\n", - " [\"MATH\", 2270, \"Linear Algebra\", 4],\n", - " [\"MATH\", 2280, \"Introduction to Differential Equations\", 4],\n", - " [\"MATH\", 3210, \"Foundations of Analysis I\", 4],\n", - " [\"MATH\", 3220, \"Foundations of Analysis II\", 4],\n", - " [\"CS\", 1030, \"Foundations of Computer Science\", 3],\n", - " [\"CS\", 1410, \"Introduction to Object-Oriented Programming\", 4],\n", - " [\"CS\", 2420, \"Introduction to Algorithms & Data Structures\", 4],\n", - " [\"CS\", 2100, \"Discrete Structures\", 3],\n", - " [\"CS\", 3500, \"Software Practice\", 4],\n", - " [\"CS\", 3505, \"Software Practice II\", 3],\n", - " [\"CS\", 3810, \"Computer Organization\", 4],\n", - " [\"CS\", 4400, \"Computer Systems\", 4],\n", - " [\"CS\", 4150, \"Algorithms\", 3],\n", - " [\"CS\", 3100, \"Models of Computation\", 3],\n", - " [\"CS\", 3200, \"Introduction to Scientific Computing\", 3],\n", - " [\"CS\", 4000, \"Senior Capstone Project - Design Phase\", 3],\n", - " [\"CS\", 4500, \"Senior Capstone Project\", 3],\n", - " [\"CS\", 4940, \"Undergraduate Research\", 3],\n", - " [\"CS\", 4970, \"Computer Science Bachelors Thesis\", 3],\n", - " ]\n", - ")\n", - "\n", - "Term.insert(dict(term_year=year, term=term) for year in range(1999, 2019) for term in [\"Spring\", \"Summer\", \"Fall\"])\n", - "\n", - "Term().fetch(order_by=(\"term_year DESC\", \"term DESC\"), as_dict=True, limit=1)[0]\n", - "\n", - "CurrentTerm().insert1({**Term().fetch(order_by=(\"term_year DESC\", \"term DESC\"), as_dict=True, limit=1)[0]})\n", - "\n", - "\n", - "def make_section(prob):\n", - " for c in (Course * Term).proj():\n", - " for sec in \"abcd\":\n", - " if random.random() < prob:\n", - " break\n", - " yield {\n", - " **c,\n", - " \"section\": sec,\n", - " \"auditorium\": random.choice(\"ABCDEF\") + str(random.randint(1, 100)),\n", - " }\n", - "\n", - "\n", - "Section.insert(make_section(0.5))" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "100%|██████████| 200/200 [00:27<00:00, 7.17it/s]\n" - ] - } - ], - "source": [ - "# Enrollment\n", - "terms = Term().fetch(\"KEY\")\n", - "quit_prob = 0.1\n", - "for student in tqdm(Student.fetch(\"KEY\")):\n", - " start_term = random.randrange(len(terms))\n", - " for term in terms[start_term:]:\n", - " if random.random() < quit_prob:\n", - " break\n", - " else:\n", - " sections = ((Section & term) - (Course & (Enroll & student))).fetch(\"KEY\")\n", - " if sections:\n", - " Enroll.insert(\n", - " {**student, **section} for section in random.sample(sections, random.randrange(min(5, len(sections))))\n", - " )\n", - "\n", - "# assign random grades\n", - "grades = LetterGrade.fetch(\"grade\")\n", - "\n", - "grade_keys = Enroll.fetch(\"KEY\")\n", - "random.shuffle(grade_keys)\n", - "grade_keys = grade_keys[: len(grade_keys) * 9 // 10]\n", - "\n", - "Grade.insert({**key, \"grade\": grade} for key, grade in zip(grade_keys, random.choices(grades, k=len(grade_keys))))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# dj.Top Restriction" - ] - }, - { - "cell_type": "code", - "execution_count": 29, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " \n", - "
\n", - " \n", - " \n", - " \n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "
\n", - "

student_id

\n", - " university-wide ID number\n", - "
\n", - "

dept

\n", - " abbreviated department name, e.g. BIOL\n", - "
\n", - "

course

\n", - " course number, e.g. 1010\n", - "
\n", - "

term_year

\n", - " \n", - "
\n", - "

term

\n", - " \n", - "
\n", - "

section

\n", - " \n", - "
\n", - "

grade

\n", - " \n", - "
\n", - "

points

\n", - " \n", - "
100MATH22802018FallaA-3.67
191MATH22102018SpringbA4.00
211CS21002018FallaA4.00
273PHYS21002018SpringaA4.00
282BIOL20212018SpringdA4.00
\n", - " \n", - "

Total: 5

\n", - " " - ], - "text/plain": [ - "*student_id *dept *course *term_year *term *section *grade points \n", - "+------------+ +------+ +--------+ +-----------+ +--------+ +---------+ +-------+ +--------+\n", - "100 MATH 2280 2018 Fall a A- 3.67 \n", - "191 MATH 2210 2018 Spring b A 4.00 \n", - "211 CS 2100 2018 Fall a A 4.00 \n", - "273 PHYS 2100 2018 Spring a A 4.00 \n", - "282 BIOL 2021 2018 Spring d A 4.00 \n", - " (Total: 5)" - ] - }, - "execution_count": 29, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "(Grade * LetterGrade) & \"term_year='2018'\" & dj.Top(limit=5, order_by=\"points DESC\", offset=5)" - ] - }, - { - "cell_type": "code", - "execution_count": 35, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "\"SELECT `grade`,`student_id`,`dept`,`course`,`term_year`,`term`,`section`,`points` FROM `university`.`#letter_grade` NATURAL JOIN `university`.`grade` WHERE ( (term_year='2018')) ORDER BY `points` DESC LIMIT 10\"" - ] - }, - "execution_count": 35, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "((LetterGrade * Grade) & \"term_year='2018'\" & dj.Top(limit=10, order_by=\"points DESC\", offset=0)).make_sql()" - ] - }, - { - "cell_type": "code", - "execution_count": 44, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "\"SELECT `student_id`,`dept`,`course`,`term_year`,`term`,`section`,`grade`,`points` FROM `university`.`grade` NATURAL JOIN `university`.`#letter_grade` WHERE ( (term_year='2018')) ORDER BY `points` DESC LIMIT 20\"" - ] - }, - "execution_count": 44, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "((Grade * LetterGrade) & \"term_year='2018'\" & dj.Top(limit=20, order_by=\"points DESC\", offset=0)).make_sql()" - ] - }, - { - "cell_type": "code", - "execution_count": 47, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " \n", - "
\n", - " \n", - " \n", - " \n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "
\n", - "

student_id

\n", - " university-wide ID number\n", - "
\n", - "

dept

\n", - " abbreviated department name, e.g. BIOL\n", - "
\n", - "

course

\n", - " course number, e.g. 1010\n", - "
\n", - "

term_year

\n", - " \n", - "
\n", - "

term

\n", - " \n", - "
\n", - "

section

\n", - " \n", - "
\n", - "

grade

\n", - " \n", - "
\n", - "

points

\n", - " \n", - "
100CS32002018FallcA4.00
100MATH22802018FallaA-3.67
100PHYS22102018SpringdA4.00
122CS10302018FallcB+3.33
131BIOL20302018SpringaA4.00
131CS32002018FallbB+3.33
136BIOL22102018SpringcB+3.33
136MATH22102018FallbB+3.33
141BIOL20102018SummercB+3.33
141CS24202018FallbA4.00
141CS32002018FallbA-3.67
182CS14102018SummercA-3.67
\n", - "

...

\n", - "

Total: 20

\n", - " " - ], - "text/plain": [ - "*student_id *dept *course *term_year *term *section *grade points \n", - "+------------+ +------+ +--------+ +-----------+ +--------+ +---------+ +-------+ +--------+\n", - "100 CS 3200 2018 Fall c A 4.00 \n", - "100 MATH 2280 2018 Fall a A- 3.67 \n", - "100 PHYS 2210 2018 Spring d A 4.00 \n", - "122 CS 1030 2018 Fall c B+ 3.33 \n", - "131 BIOL 2030 2018 Spring a A 4.00 \n", - "131 CS 3200 2018 Fall b B+ 3.33 \n", - "136 BIOL 2210 2018 Spring c B+ 3.33 \n", - "136 MATH 2210 2018 Fall b B+ 3.33 \n", - "141 BIOL 2010 2018 Summer c B+ 3.33 \n", - "141 CS 2420 2018 Fall b A 4.00 \n", - "141 CS 3200 2018 Fall b A- 3.67 \n", - "182 CS 1410 2018 Summer c A- 3.67 \n", - " ...\n", - " (Total: 20)" - ] - }, - "execution_count": 47, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "(Grade * LetterGrade) & \"term_year='2018'\" & dj.Top(limit=20, order_by=\"points DESC\", offset=0)" - ] - }, - { - "cell_type": "code", - "execution_count": 41, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " \n", - "
\n", - " \n", - " \n", - " \n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "
\n", - "

grade

\n", - " \n", - "
\n", - "

student_id

\n", - " university-wide ID number\n", - "
\n", - "

dept

\n", - " abbreviated department name, e.g. BIOL\n", - "
\n", - "

course

\n", - " course number, e.g. 1010\n", - "
\n", - "

term_year

\n", - " \n", - "
\n", - "

term

\n", - " \n", - "
\n", - "

section

\n", - " \n", - "
\n", - "

points

\n", - " \n", - "
A100CS32002018Fallc4.00
A100PHYS22102018Springd4.00
A131BIOL20302018Springa4.00
A141CS24202018Fallb4.00
A186PHYS22102018Springa4.00
A191MATH22102018Springb4.00
A211CS21002018Falla4.00
A273PHYS21002018Springa4.00
A282BIOL20212018Springd4.00
A-100MATH22802018Falla3.67
A-141CS32002018Fallb3.67
A-182CS14102018Summerc3.67
\n", - "

...

\n", - "

Total: 20

\n", - " " - ], - "text/plain": [ - "*grade *student_id *dept *course *term_year *term *section points \n", - "+-------+ +------------+ +------+ +--------+ +-----------+ +--------+ +---------+ +--------+\n", - "A 100 CS 3200 2018 Fall c 4.00 \n", - "A 100 PHYS 2210 2018 Spring d 4.00 \n", - "A 131 BIOL 2030 2018 Spring a 4.00 \n", - "A 141 CS 2420 2018 Fall b 4.00 \n", - "A 186 PHYS 2210 2018 Spring a 4.00 \n", - "A 191 MATH 2210 2018 Spring b 4.00 \n", - "A 211 CS 2100 2018 Fall a 4.00 \n", - "A 273 PHYS 2100 2018 Spring a 4.00 \n", - "A 282 BIOL 2021 2018 Spring d 4.00 \n", - "A- 100 MATH 2280 2018 Fall a 3.67 \n", - "A- 141 CS 3200 2018 Fall b 3.67 \n", - "A- 182 CS 1410 2018 Summer c 3.67 \n", - " ...\n", - " (Total: 20)" - ] - }, - "execution_count": 41, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "(LetterGrade * Grade) & \"term_year='2018'\" & dj.Top(limit=20, order_by=\"points DESC\", offset=0)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - } - ], - "metadata": { - "kernelspec": { - "display_name": "elements", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.8" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/docs/src/archive/tutorials/json.ipynb b/docs/src/archive/tutorials/json.ipynb deleted file mode 100644 index 9c5feebf6..000000000 --- a/docs/src/archive/tutorials/json.ipynb +++ /dev/null @@ -1,1080 +0,0 @@ -{ - "cells": [ - { - "attachments": {}, - "cell_type": "markdown", - "id": "7fe24127-c0d0-4ff8-96b4-6ab0d9307e73", - "metadata": {}, - "source": [ - "# Using the json type" - ] - }, - { - "cell_type": "markdown", - "id": "62450023", - "metadata": {}, - "source": [ - "> ⚠️ Note the following before using the `json` type\n", - "> - Supported only for MySQL >= 8.0 when [JSON_VALUE](https://dev.mysql.com/doc/refman/8.0/en/json-search-functions.html#function_json-value) introduced.\n", - "> - Equivalent Percona is fully-compatible.\n", - "> - MariaDB is not supported since [JSON_VALUE](https://mariadb.com/kb/en/json_value/#syntax) does not allow type specification like MySQL's.\n", - "> - Not yet supported in DataJoint MATLAB" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "67cf93d2", - "metadata": {}, - "source": [ - "First you will need to [install](../../getting-started/#installation) and [connect](../../getting-started/#connection) to a DataJoint [data pipeline](https://docs.datajoint.com/core/datajoint-python/latest/concepts/data-pipelines/#what-is-a-data-pipeline).\n", - "\n", - "Now let's start by importing the `datajoint` client." - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "id": "bc0b6f54-8f11-45f4-bf8d-e1058ee0056f", - "metadata": {}, - "outputs": [], - "source": [ - "import datajoint as dj" - ] - }, - { - "cell_type": "markdown", - "id": "3544cab9-f2db-458a-9431-939bea5affc5", - "metadata": {}, - "source": [ - "## Table Definition" - ] - }, - { - "cell_type": "markdown", - "id": "a2998c71", - "metadata": {}, - "source": [ - "For this exercise, let's imagine we work for an awesome company that is organizing a fun RC car race across various teams in the company. Let's see which team has the fastest car! 🏎️\n", - "\n", - "This establishes 2 important entities: a `Team` and a `Car`. Normally the entities are mapped to their own dedicated table, however, let's assume that `Team` is well-structured but `Car` is less structured than we'd prefer. In other words, the structure for what makes up a *car* is varying too much between entries (perhaps because users of the pipeline haven't agreed yet on the definition? 🤷).\n", - "\n", - "This would make it a good use-case to keep `Team` as a table but make `Car` a `json` type defined within the `Team` table.\n", - "\n", - "Let's begin." - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "id": "dc318298-b819-4f06-abbd-7bb7544dd431", - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "[2023-02-12 00:14:33,027][INFO]: Connecting root@fakeservices.datajoint.io:3306\n", - "[2023-02-12 00:14:33,039][INFO]: Connected root@fakeservices.datajoint.io:3306\n" - ] - } - ], - "source": [ - "schema = dj.Schema(f\"{dj.config['database.user']}_json\")" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "id": "4aaf96db-85d9-4e94-a4c3-3558f4cc6671", - "metadata": {}, - "outputs": [], - "source": [ - "@schema\n", - "class Team(dj.Lookup):\n", - " definition = \"\"\"\n", - " # A team within a company\n", - " name: varchar(40) # team name\n", - " ---\n", - " car=null: json # A car belonging to a team (null to allow registering first but specifying car later)\n", - " \n", - " unique index(car.length:decimal(4, 1)) # Add an index if this key is frequently accessed\n", - " \"\"\"" - ] - }, - { - "cell_type": "markdown", - "id": "640bf7a7-9e07-4953-9c8a-304e55c467f8", - "metadata": {}, - "source": [ - "## Insert" - ] - }, - { - "cell_type": "markdown", - "id": "7081e577", - "metadata": {}, - "source": [ - "Let's suppose that engineering is first up to register their car." - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "id": "30f0d62e", - "metadata": {}, - "outputs": [], - "source": [ - "Team.insert1(\n", - " {\n", - " \"name\": \"engineering\",\n", - " \"car\": {\n", - " \"name\": \"Rever\",\n", - " \"length\": 20.5,\n", - " \"inspected\": True,\n", - " \"tire_pressure\": [32, 31, 33, 34],\n", - " \"headlights\": [\n", - " {\n", - " \"side\": \"left\",\n", - " \"hyper_white\": None,\n", - " },\n", - " {\n", - " \"side\": \"right\",\n", - " \"hyper_white\": None,\n", - " },\n", - " ],\n", - " },\n", - " }\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "ee5e4dcf", - "metadata": {}, - "source": [ - "Next, business and marketing teams are up and register their cars.\n", - "\n", - "A few points to notice below:\n", - "- The person signing up on behalf of marketing does not know the specifics of the car during registration but another team member will be updating this soon before the race.\n", - "- Notice how the `business` and `engineering` teams appear to specify the same property but refer to it as `safety_inspected` and `inspected` respectfully." - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "id": "b532e16c", - "metadata": {}, - "outputs": [], - "source": [ - "Team.insert(\n", - " [\n", - " {\n", - " \"name\": \"marketing\",\n", - " \"car\": None,\n", - " },\n", - " {\n", - " \"name\": \"business\",\n", - " \"car\": {\n", - " \"name\": \"Chaching\",\n", - " \"length\": 100,\n", - " \"safety_inspected\": False,\n", - " \"tire_pressure\": [34, 30, 27, 32],\n", - " \"headlights\": [\n", - " {\n", - " \"side\": \"left\",\n", - " \"hyper_white\": True,\n", - " },\n", - " {\n", - " \"side\": \"right\",\n", - " \"hyper_white\": True,\n", - " },\n", - " ],\n", - " },\n", - " },\n", - " ]\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "57365de7", - "metadata": {}, - "source": [ - "We can preview the table data much like normal but notice how the value of `car` behaves like other BLOB-like attributes." - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "id": "0e3b517c", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " A team within a company\n", - "
\n", - " \n", - " \n", - " \n", - "\n", - "\n", - "\n", - "
\n", - "

name

\n", - " team name\n", - "
\n", - "

car

\n", - " A car belonging to a team (null to allow registering first but specifying car later)\n", - "
marketing=BLOB=
engineering=BLOB=
business=BLOB=
\n", - " \n", - "

Total: 3

\n", - " " - ], - "text/plain": [ - "*name car \n", - "+------------+ +--------+\n", - "marketing =BLOB= \n", - "engineering =BLOB= \n", - "business =BLOB= \n", - " (Total: 3)" - ] - }, - "execution_count": 6, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "Team()" - ] - }, - { - "cell_type": "markdown", - "id": "c95cbbee-4ef7-4870-ad42-a60345a3644f", - "metadata": {}, - "source": [ - "## Restriction" - ] - }, - { - "cell_type": "markdown", - "id": "8b454996", - "metadata": {}, - "source": [ - "Now let's see what kinds of queries we can form to demostrate how we can query this pipeline." - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "id": "81efda24", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " A team within a company\n", - "
\n", - " \n", - " \n", - " \n", - "\n", - "
\n", - "

name

\n", - " team name\n", - "
\n", - "

car

\n", - " A car belonging to a team (null to allow registering first but specifying car later)\n", - "
business=BLOB=
\n", - " \n", - "

Total: 1

\n", - " " - ], - "text/plain": [ - "*name car \n", - "+----------+ +--------+\n", - "business =BLOB= \n", - " (Total: 1)" - ] - }, - "execution_count": 7, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Which team has a `car` equal to 100 inches long?\n", - "Team & {\"car.length\": 100}" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "id": "fd7b855d", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " A team within a company\n", - "
\n", - " \n", - " \n", - " \n", - "\n", - "
\n", - "

name

\n", - " team name\n", - "
\n", - "

car

\n", - " A car belonging to a team (null to allow registering first but specifying car later)\n", - "
engineering=BLOB=
\n", - " \n", - "

Total: 1

\n", - " " - ], - "text/plain": [ - "*name car \n", - "+------------+ +--------+\n", - "engineering =BLOB= \n", - " (Total: 1)" - ] - }, - "execution_count": 8, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Which team has a `car` less than 50 inches long?\n", - "Team & \"car->>'$.length' < 50\"" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "id": "b76ebb75", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " A team within a company\n", - "
\n", - " \n", - " \n", - " \n", - "\n", - "
\n", - "

name

\n", - " team name\n", - "
\n", - "

car

\n", - " A car belonging to a team (null to allow registering first but specifying car later)\n", - "
engineering=BLOB=
\n", - " \n", - "

Total: 1

\n", - " " - ], - "text/plain": [ - "*name car \n", - "+------------+ +--------+\n", - "engineering =BLOB= \n", - " (Total: 1)" - ] - }, - "execution_count": 9, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Any team that has had their car inspected?\n", - "Team & [{\"car.inspected:unsigned\": True}, {\"car.safety_inspected:unsigned\": True}]" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "id": "b787784c", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " A team within a company\n", - "
\n", - " \n", - " \n", - " \n", - "\n", - "\n", - "
\n", - "

name

\n", - " team name\n", - "
\n", - "

car

\n", - " A car belonging to a team (null to allow registering first but specifying car later)\n", - "
engineering=BLOB=
marketing=BLOB=
\n", - " \n", - "

Total: 2

\n", - " " - ], - "text/plain": [ - "*name car \n", - "+------------+ +--------+\n", - "engineering =BLOB= \n", - "marketing =BLOB= \n", - " (Total: 2)" - ] - }, - "execution_count": 10, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Which teams do not have hyper white lights for their first head light?\n", - "Team & {\"car.headlights[0].hyper_white\": None}" - ] - }, - { - "cell_type": "markdown", - "id": "5bcf0b5d", - "metadata": {}, - "source": [ - "Notice that the previous query will satisfy the `None` check if it experiences any of the following scenarious:\n", - "- if entire record missing (`marketing` satisfies this)\n", - "- JSON key is missing\n", - "- JSON value is set to JSON `null` (`engineering` satisfies this)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "bcf1682e-a0c7-4c2f-826b-0aec9052a694", - "metadata": {}, - "source": [ - "## Projection" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "daea110e", - "metadata": {}, - "source": [ - "Projections can be quite useful with the `json` type since we can extract out just what we need. This allows greater query flexibility but more importantly, for us to be able to fetch only what is pertinent." - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "id": "8fb8334a", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " \n", - "
\n", - " \n", - " \n", - " \n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "
\n", - "

name

\n", - " team name\n", - "
\n", - "

car_name

\n", - " calculated attribute\n", - "
\n", - "

car_length

\n", - " calculated attribute\n", - "
businessChaching100
engineeringRever20.5
marketingNoneNone
\n", - " \n", - "

Total: 3

\n", - " " - ], - "text/plain": [ - "*name car_name car_length \n", - "+------------+ +----------+ +------------+\n", - "business Chaching 100 \n", - "engineering Rever 20.5 \n", - "marketing None None \n", - " (Total: 3)" - ] - }, - "execution_count": 11, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Only interested in the car names and the length but let the type be inferred\n", - "q_untyped = Team.proj(\n", - " car_name=\"car.name\",\n", - " car_length=\"car.length\",\n", - ")\n", - "q_untyped" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "id": "bb5f0448", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[{'name': 'business', 'car_name': 'Chaching', 'car_length': '100'},\n", - " {'name': 'engineering', 'car_name': 'Rever', 'car_length': '20.5'},\n", - " {'name': 'marketing', 'car_name': None, 'car_length': None}]" - ] - }, - "execution_count": 12, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "q_untyped.fetch(as_dict=True)" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "id": "a307dfd7", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " \n", - "
\n", - " \n", - " \n", - " \n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "
\n", - "

name

\n", - " team name\n", - "
\n", - "

car_name

\n", - " calculated attribute\n", - "
\n", - "

car_length

\n", - " calculated attribute\n", - "
businessChaching100.0
engineeringRever20.5
marketingNoneNone
\n", - " \n", - "

Total: 3

\n", - " " - ], - "text/plain": [ - "*name car_name car_length \n", - "+------------+ +----------+ +------------+\n", - "business Chaching 100.0 \n", - "engineering Rever 20.5 \n", - "marketing None None \n", - " (Total: 3)" - ] - }, - "execution_count": 13, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Nevermind, I'll specify the type explicitly\n", - "q_typed = Team.proj(\n", - " car_name=\"car.name\",\n", - " car_length=\"car.length:float\",\n", - ")\n", - "q_typed" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "id": "8a93dbf9", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[{'name': 'business', 'car_name': 'Chaching', 'car_length': 100.0},\n", - " {'name': 'engineering', 'car_name': 'Rever', 'car_length': 20.5},\n", - " {'name': 'marketing', 'car_name': None, 'car_length': None}]" - ] - }, - "execution_count": 14, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "q_typed.fetch(as_dict=True)" - ] - }, - { - "cell_type": "markdown", - "id": "62dd0239-fa70-4369-81eb-3d46c5053fee", - "metadata": {}, - "source": [ - "## Describe" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "73d9df01", - "metadata": {}, - "source": [ - "Lastly, the `.describe()` function on the `Team` table can help us generate the table's definition. This is useful if we are connected directly to the pipeline without the original source." - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "id": "0e739932", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "# A team within a company\n", - "name : varchar(40) # team name\n", - "---\n", - "car=null : json # A car belonging to a team (null to allow registering first but specifying car later)\n", - "UNIQUE INDEX ((json_value(`car`, _utf8mb4'$.length' returning decimal(4, 1))))\n", - "\n" - ] - } - ], - "source": [ - "rebuilt_definition = Team.describe()\n", - "print(rebuilt_definition)" - ] - }, - { - "cell_type": "markdown", - "id": "be1070d5-765b-4bc2-92de-8a6ffd885984", - "metadata": {}, - "source": [ - "## Cleanup" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "cb959927", - "metadata": {}, - "source": [ - "Finally, let's clean up what we created in this tutorial." - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "id": "d9cc28a3-3ffd-4126-b7e9-bc6365040b93", - "metadata": {}, - "outputs": [], - "source": [ - "schema.drop()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "68ad4340", - "metadata": {}, - "outputs": [], - "source": [] - } - ], - "metadata": { - "kernelspec": { - "display_name": "all_purposes", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.18" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/src/datajoint/codecs.py b/src/datajoint/codecs.py index cf2e2105f..4ddb33d9c 100644 --- a/src/datajoint/codecs.py +++ b/src/datajoint/codecs.py @@ -112,7 +112,7 @@ def __init_subclass__(cls, *, register: bool = True, **kwargs): existing = _codec_registry[cls.name] if type(existing) is not cls: raise DataJointError( - f"Codec <{cls.name}> already registered by " f"{type(existing).__module__}.{type(existing).__name__}" + f"Codec <{cls.name}> already registered by {type(existing).__module__}.{type(existing).__name__}" ) return # Same class, idempotent @@ -301,7 +301,7 @@ def get_codec(name: str) -> Codec: return _codec_registry[type_name] raise DataJointError( - f"Unknown codec: <{type_name}>. " f"Ensure the codec is defined (inherit from dj.Codec with name='{type_name}')." + f"Unknown codec: <{type_name}>. Ensure the codec is defined (inherit from dj.Codec with name='{type_name}')." ) @@ -499,7 +499,7 @@ def lookup_codec(codec_spec: str) -> tuple[Codec, str | None]: if is_codec_registered(type_name): return get_codec(type_name), store_name - raise DataJointError(f"Codec <{type_name}> is not registered. " "Define a Codec subclass with name='{type_name}'.") + raise DataJointError(f"Codec <{type_name}> is not registered. Define a Codec subclass with name='{{type_name}}'.") # ============================================================================= diff --git a/src/datajoint/content_registry.py b/src/datajoint/content_registry.py index f5da65ff5..70b38324a 100644 --- a/src/datajoint/content_registry.py +++ b/src/datajoint/content_registry.py @@ -151,7 +151,7 @@ def get_content(content_hash: str, store_name: str | None = None) -> bytes: # Verify hash (optional but recommended for integrity) actual_hash = compute_content_hash(data) if actual_hash != content_hash: - raise DataJointError(f"Content hash mismatch: expected {content_hash[:16]}..., " f"got {actual_hash[:16]}...") + raise DataJointError(f"Content hash mismatch: expected {content_hash[:16]}..., got {actual_hash[:16]}...") return data diff --git a/src/datajoint/heading.py b/src/datajoint/heading.py index 96e01f985..96383170b 100644 --- a/src/datajoint/heading.py +++ b/src/datajoint/heading.py @@ -41,17 +41,17 @@ def name(self) -> str: def get_dtype(self, is_external: bool) -> str: raise DataJointError( - f"Codec <{self._codec_name}> is not registered. " f"Define a Codec subclass with name='{self._codec_name}'." + f"Codec <{self._codec_name}> is not registered. Define a Codec subclass with name='{self._codec_name}'." ) def encode(self, value, *, key=None, store_name=None): raise DataJointError( - f"Codec <{self._codec_name}> is not registered. " f"Define a Codec subclass with name='{self._codec_name}'." + f"Codec <{self._codec_name}> is not registered. Define a Codec subclass with name='{self._codec_name}'." ) def decode(self, stored, *, key=None): raise DataJointError( - f"Codec <{self._codec_name}> is not registered. " f"Define a Codec subclass with name='{self._codec_name}'." + f"Codec <{self._codec_name}> is not registered. Define a Codec subclass with name='{self._codec_name}'." ) diff --git a/src/datajoint/jobs.py b/src/datajoint/jobs.py index 7be80a0e5..70c24f354 100644 --- a/src/datajoint/jobs.py +++ b/src/datajoint/jobs.py @@ -145,7 +145,7 @@ def _generate_definition(self) -> str: if not pk_attrs: raise DataJointError( - f"Cannot create jobs table for {self._target.full_table_name}: " "no FK-derived primary key attributes found." + f"Cannot create jobs table for {self._target.full_table_name}: no FK-derived primary key attributes found." ) pk_lines = "\n ".join(f"{name} : {dtype}" for name, dtype in pk_attrs) diff --git a/src/datajoint/objectref.py b/src/datajoint/objectref.py index 9a049b2cf..5d84fb96c 100644 --- a/src/datajoint/objectref.py +++ b/src/datajoint/objectref.py @@ -128,32 +128,6 @@ def to_json(self) -> dict: data["item_count"] = self.item_count return data - def to_dict(self) -> dict: - """ - Return the raw JSON metadata as a dictionary. - - This is useful for inspecting the stored metadata without triggering - any storage backend operations. The returned dict matches the JSON - structure stored in the database. - - Returns - ------- - dict - Dict containing the object metadata: - - - path: Relative storage path within the store - - url: Full URI (e.g., 's3://bucket/path') (optional) - - store: Store name (optional, None for default store) - - size: File/folder size in bytes (or None) - - hash: Content hash (or None) - - ext: File extension (or None) - - is_dir: True if folder - - timestamp: Upload timestamp - - mime_type: MIME type (files only, optional) - - item_count: Number of files (folders only, optional) - """ - return self.to_json() - def _ensure_backend(self): """Ensure storage backend is available for I/O operations.""" if self._backend is None: diff --git a/src/datajoint/settings.py b/src/datajoint/settings.py index 1c43b1ed2..5812f2257 100644 --- a/src/datajoint/settings.py +++ b/src/datajoint/settings.py @@ -389,7 +389,7 @@ def get_store_spec(self, store: str) -> dict[str, Any]: if protocol not in supported_protocols: raise DataJointError( f'Missing or invalid protocol in config.stores["{store}"]. ' - f'Supported protocols: {", ".join(supported_protocols)}' + f"Supported protocols: {', '.join(supported_protocols)}" ) # Define required and allowed keys by protocol @@ -479,7 +479,7 @@ def get_object_storage_spec(self) -> dict[str, Any]: supported_protocols = ("file", "s3", "gcs", "azure") if protocol not in supported_protocols: raise DataJointError( - f"Invalid object_storage.protocol: {protocol}. " f'Supported protocols: {", ".join(supported_protocols)}' + f"Invalid object_storage.protocol: {protocol}. Supported protocols: {', '.join(supported_protocols)}" ) # Build spec dict @@ -555,8 +555,7 @@ def get_object_store_spec(self, store_name: str | None = None) -> dict[str, Any] supported_protocols = ("file", "s3", "gcs", "azure") if protocol not in supported_protocols: raise DataJointError( - f"Invalid protocol for store '{store_name}': {protocol}. " - f'Supported protocols: {", ".join(supported_protocols)}' + f"Invalid protocol for store '{store_name}': {protocol}. Supported protocols: {', '.join(supported_protocols)}" ) # Use project_name from default config if not specified in store diff --git a/src/datajoint/storage.py b/src/datajoint/storage.py index 6dacbd7ec..846228137 100644 --- a/src/datajoint/storage.py +++ b/src/datajoint/storage.py @@ -24,13 +24,13 @@ # Characters safe for use in filenames and URLs TOKEN_ALPHABET = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_" -# Supported remote URL protocols for copy insert -REMOTE_PROTOCOLS = ("s3://", "gs://", "gcs://", "az://", "abfs://", "http://", "https://") +# Supported URL protocols +URL_PROTOCOLS = ("file://", "s3://", "gs://", "gcs://", "az://", "abfs://", "http://", "https://") -def is_remote_url(path: str) -> bool: +def is_url(path: str) -> bool: """ - Check if a path is a remote URL. + Check if a path is a URL. Parameters ---------- @@ -40,21 +40,57 @@ def is_remote_url(path: str) -> bool: Returns ------- bool - True if path starts with a supported remote protocol. + True if path starts with a supported URL protocol. """ - if not isinstance(path, str): - return False - return path.lower().startswith(REMOTE_PROTOCOLS) + return path.lower().startswith(URL_PROTOCOLS) -def parse_remote_url(url: str) -> tuple[str, str]: +def normalize_to_url(path: str) -> str: """ - Parse a remote URL into protocol and path. + Normalize a path to URL form. + + Converts local filesystem paths to file:// URLs. URLs are returned unchanged. + + Parameters + ---------- + path : str + Path string (local path or URL). + + Returns + ------- + str + URL form of the path. + + Examples + -------- + >>> normalize_to_url("/data/file.dat") + 'file:///data/file.dat' + >>> normalize_to_url("s3://bucket/key") + 's3://bucket/key' + >>> normalize_to_url("file:///already/url") + 'file:///already/url' + """ + if is_url(path): + return path + # Convert local path to file:// URL + # Ensure absolute path and proper format + abs_path = str(Path(path).resolve()) + # Handle Windows paths (C:\...) vs Unix paths (/...) + if abs_path.startswith("/"): + return f"file://{abs_path}" + else: + # Windows: file:///C:/path + return f"file:///{abs_path.replace(chr(92), '/')}" + + +def parse_url(url: str) -> tuple[str, str]: + """ + Parse a URL into protocol and path. Parameters ---------- url : str - Remote URL (e.g., ``'s3://bucket/path/file.dat'``). + URL (e.g., ``'s3://bucket/path/file.dat'`` or ``'file:///path/to/file'``). Returns ------- @@ -65,11 +101,19 @@ def parse_remote_url(url: str) -> tuple[str, str]: ------ DataJointError If URL protocol is not supported. + + Examples + -------- + >>> parse_url("s3://bucket/key/file.dat") + ('s3', 'bucket/key/file.dat') + >>> parse_url("file:///data/file.dat") + ('file', '/data/file.dat') """ url_lower = url.lower() # Map URL schemes to fsspec protocols protocol_map = { + "file://": "file", "s3://": "s3", "gs://": "gcs", "gcs://": "gcs", @@ -84,7 +128,7 @@ def parse_remote_url(url: str) -> tuple[str, str]: path = url[len(prefix) :] return protocol, path - raise errors.DataJointError(f"Unsupported remote URL protocol: {url}") + raise errors.DataJointError(f"Unsupported URL protocol: {url}") def generate_token(length: int = 8) -> str: @@ -358,6 +402,53 @@ def _full_path(self, path: str | PurePosixPath) -> str: return str(Path(location) / path) return path + def get_url(self, path: str | PurePosixPath) -> str: + """ + Get the full URL for a path in storage. + + Returns a consistent URL representation for any storage backend, + including file:// URLs for local filesystem. + + Parameters + ---------- + path : str or PurePosixPath + Relative path within the storage location. + + Returns + ------- + str + Full URL (e.g., 's3://bucket/path' or 'file:///data/path'). + + Examples + -------- + >>> backend = StorageBackend({"protocol": "file", "location": "/data"}) + >>> backend.get_url("schema/table/file.dat") + 'file:///data/schema/table/file.dat' + + >>> backend = StorageBackend({"protocol": "s3", "bucket": "mybucket", ...}) + >>> backend.get_url("schema/table/file.dat") + 's3://mybucket/schema/table/file.dat' + """ + full_path = self._full_path(path) + + if self.protocol == "file": + # Ensure absolute path for file:// URL + abs_path = str(Path(full_path).resolve()) + if abs_path.startswith("/"): + return f"file://{abs_path}" + else: + # Windows path + return f"file:///{abs_path.replace(chr(92), '/')}" + elif self.protocol == "s3": + return f"s3://{full_path}" + elif self.protocol == "gcs": + return f"gs://{full_path}" + elif self.protocol == "azure": + return f"az://{full_path}" + else: + # Fallback: use protocol prefix + return f"{self.protocol}://{full_path}" + def put_file(self, local_path: str | Path, remote_path: str | PurePosixPath, metadata: dict | None = None) -> None: """ Upload a file from local filesystem to storage. @@ -674,7 +765,7 @@ def copy_from_url(self, source_url: str, dest_path: str | PurePosixPath) -> int: int Size of copied file in bytes. """ - protocol, source_path = parse_remote_url(source_url) + protocol, source_path = parse_url(source_url) full_dest = self._full_path(dest_path) logger.debug(f"copy_from_url: {protocol}://{source_path} -> {self.protocol}:{full_dest}") @@ -774,8 +865,8 @@ def source_is_directory(self, source: str) -> bool: bool True if source is a directory. """ - if is_remote_url(source): - protocol, path = parse_remote_url(source) + if is_url(source): + protocol, path = parse_url(source) source_fs = fsspec.filesystem(protocol) return source_fs.isdir(path) else: @@ -795,8 +886,8 @@ def source_exists(self, source: str) -> bool: bool True if source exists. """ - if is_remote_url(source): - protocol, path = parse_remote_url(source) + if is_url(source): + protocol, path = parse_url(source) source_fs = fsspec.filesystem(protocol) return source_fs.exists(path) else: @@ -817,8 +908,8 @@ def get_source_size(self, source: str) -> int | None: Size in bytes, or None if directory or cannot determine. """ try: - if is_remote_url(source): - protocol, path = parse_remote_url(source) + if is_url(source): + protocol, path = parse_url(source) source_fs = fsspec.filesystem(protocol) if source_fs.isdir(path): return None diff --git a/src/datajoint/table.py b/src/datajoint/table.py index 77611cb59..0040943c5 100644 --- a/src/datajoint/table.py +++ b/src/datajoint/table.py @@ -963,8 +963,7 @@ def cascade(table): transaction = False else: raise DataJointError( - "Delete cannot use a transaction within an ongoing transaction. " - "Set transaction=False or prompt=False." + "Delete cannot use a transaction within an ongoing transaction. Set transaction=False or prompt=False." ) # Cascading delete diff --git a/src/datajoint/user_tables.py b/src/datajoint/user_tables.py index 535276bbd..942179685 100644 --- a/src/datajoint/user_tables.py +++ b/src/datajoint/user_tables.py @@ -252,9 +252,7 @@ def drop(self, part_integrity: str = "enforce"): if part_integrity == "ignore": super().drop() elif part_integrity == "enforce": - raise DataJointError( - "Cannot drop a Part directly. Drop master instead, " "or use part_integrity='ignore' to force." - ) + raise DataJointError("Cannot drop a Part directly. Drop master instead, or use part_integrity='ignore' to force.") else: raise ValueError(f"part_integrity for drop must be 'enforce' or 'ignore', got {part_integrity!r}") diff --git a/tests/integration/test_object.py b/tests/integration/test_object.py index 8f44068e1..d4d42a461 100644 --- a/tests/integration/test_object.py +++ b/tests/integration/test_object.py @@ -759,94 +759,3 @@ def test_staged_insert_missing_pk_raises(self, schema_obj, mock_object_storage): with table.staged_insert1 as staged: # Don't set primary key staged.store("data_file", ".dat") - - -class TestRemoteURLSupport: - """Tests for remote URL detection and parsing.""" - - def test_is_remote_url_s3(self): - """Test S3 URL detection.""" - from datajoint.storage import is_remote_url - - assert is_remote_url("s3://bucket/path/file.dat") is True - assert is_remote_url("S3://bucket/path/file.dat") is True - - def test_is_remote_url_gcs(self): - """Test GCS URL detection.""" - from datajoint.storage import is_remote_url - - assert is_remote_url("gs://bucket/path/file.dat") is True - assert is_remote_url("gcs://bucket/path/file.dat") is True - - def test_is_remote_url_azure(self): - """Test Azure URL detection.""" - from datajoint.storage import is_remote_url - - assert is_remote_url("az://container/path/file.dat") is True - assert is_remote_url("abfs://container/path/file.dat") is True - - def test_is_remote_url_http(self): - """Test HTTP/HTTPS URL detection.""" - from datajoint.storage import is_remote_url - - assert is_remote_url("http://example.com/path/file.dat") is True - assert is_remote_url("https://example.com/path/file.dat") is True - - def test_is_remote_url_local_path(self): - """Test local paths are not detected as remote.""" - from datajoint.storage import is_remote_url - - assert is_remote_url("/local/path/file.dat") is False - assert is_remote_url("relative/path/file.dat") is False - assert is_remote_url("C:\\Windows\\path\\file.dat") is False - - def test_is_remote_url_non_string(self): - """Test non-string inputs return False.""" - from datajoint.storage import is_remote_url - - assert is_remote_url(None) is False - assert is_remote_url(123) is False - assert is_remote_url(Path("/local/path")) is False - - def test_parse_remote_url_s3(self): - """Test S3 URL parsing.""" - from datajoint.storage import parse_remote_url - - protocol, path = parse_remote_url("s3://bucket/path/file.dat") - assert protocol == "s3" - assert path == "bucket/path/file.dat" - - def test_parse_remote_url_gcs(self): - """Test GCS URL parsing.""" - from datajoint.storage import parse_remote_url - - protocol, path = parse_remote_url("gs://bucket/path/file.dat") - assert protocol == "gcs" - assert path == "bucket/path/file.dat" - - protocol, path = parse_remote_url("gcs://bucket/path/file.dat") - assert protocol == "gcs" - assert path == "bucket/path/file.dat" - - def test_parse_remote_url_azure(self): - """Test Azure URL parsing.""" - from datajoint.storage import parse_remote_url - - protocol, path = parse_remote_url("az://container/path/file.dat") - assert protocol == "abfs" - assert path == "container/path/file.dat" - - def test_parse_remote_url_http(self): - """Test HTTP URL parsing.""" - from datajoint.storage import parse_remote_url - - protocol, path = parse_remote_url("https://example.com/path/file.dat") - assert protocol == "https" - assert path == "example.com/path/file.dat" - - def test_parse_remote_url_unsupported(self): - """Test unsupported protocol raises error.""" - from datajoint.storage import parse_remote_url - - with pytest.raises(dj.DataJointError, match="Unsupported remote URL"): - parse_remote_url("ftp://server/path/file.dat") diff --git a/tests/unit/test_storage_urls.py b/tests/unit/test_storage_urls.py new file mode 100644 index 000000000..649d695b2 --- /dev/null +++ b/tests/unit/test_storage_urls.py @@ -0,0 +1,121 @@ +"""Unit tests for storage URL functions.""" + +import pytest + +from datajoint.storage import ( + URL_PROTOCOLS, + is_url, + normalize_to_url, + parse_url, +) + + +class TestURLProtocols: + """Test URL protocol constants.""" + + def test_url_protocols_includes_file(self): + """URL_PROTOCOLS should include file://.""" + assert "file://" in URL_PROTOCOLS + + def test_url_protocols_includes_s3(self): + """URL_PROTOCOLS should include s3://.""" + assert "s3://" in URL_PROTOCOLS + + def test_url_protocols_includes_cloud_providers(self): + """URL_PROTOCOLS should include major cloud providers.""" + assert "gs://" in URL_PROTOCOLS + assert "az://" in URL_PROTOCOLS + + +class TestIsUrl: + """Test is_url function.""" + + def test_s3_url(self): + assert is_url("s3://bucket/key") + + def test_gs_url(self): + assert is_url("gs://bucket/key") + + def test_file_url(self): + assert is_url("file:///path/to/file") + + def test_http_url(self): + assert is_url("http://example.com/file") + + def test_https_url(self): + assert is_url("https://example.com/file") + + def test_local_path_not_url(self): + assert not is_url("/path/to/file") + + def test_relative_path_not_url(self): + assert not is_url("relative/path/file.dat") + + def test_case_insensitive(self): + assert is_url("S3://bucket/key") + assert is_url("FILE:///path") + + +class TestNormalizeToUrl: + """Test normalize_to_url function.""" + + def test_local_path_to_file_url(self): + url = normalize_to_url("/data/file.dat") + assert url.startswith("file://") + assert "data/file.dat" in url + + def test_s3_url_unchanged(self): + url = "s3://bucket/key/file.dat" + assert normalize_to_url(url) == url + + def test_file_url_unchanged(self): + url = "file:///data/file.dat" + assert normalize_to_url(url) == url + + def test_relative_path_becomes_absolute(self): + url = normalize_to_url("relative/path.dat") + assert url.startswith("file://") + # Should be absolute (contain full path) + assert "/" in url[7:] # After "file://" + + +class TestParseUrl: + """Test parse_url function.""" + + def test_parse_s3(self): + protocol, path = parse_url("s3://bucket/key/file.dat") + assert protocol == "s3" + assert path == "bucket/key/file.dat" + + def test_parse_gs(self): + protocol, path = parse_url("gs://bucket/key") + assert protocol == "gcs" + assert path == "bucket/key" + + def test_parse_gcs(self): + protocol, path = parse_url("gcs://bucket/key") + assert protocol == "gcs" + assert path == "bucket/key" + + def test_parse_file(self): + protocol, path = parse_url("file:///data/file.dat") + assert protocol == "file" + assert path == "/data/file.dat" + + def test_parse_http(self): + protocol, path = parse_url("http://example.com/file") + assert protocol == "http" + assert path == "example.com/file" + + def test_parse_https(self): + protocol, path = parse_url("https://example.com/file") + assert protocol == "https" + assert path == "example.com/file" + + def test_unsupported_protocol_raises(self): + with pytest.raises(Exception, match="Unsupported URL protocol"): + parse_url("ftp://example.com/file") + + def test_local_path_raises(self): + with pytest.raises(Exception, match="Unsupported URL protocol"): + parse_url("/local/path")