|
| 1 | +# Intelligent Version Detection System |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +The DataSpace platform implements an version detection system that automatically determines the appropriate version increment (major, minor, or patch) when resources are updated. This document explains the technical implementation, triggering mechanisms, and version classification logic. |
| 6 | + |
| 7 | +## Version Increment Types |
| 8 | + |
| 9 | +The system follows semantic versioning principles (X.Y.Z): |
| 10 | + |
| 11 | +1. **Major Version (X)**: Breaking changes or significant structural modifications |
| 12 | +2. **Minor Version (Y)**: Publishing Dataset or significant data changes that maintain compatibility |
| 13 | +3. **Patch Version (Z)**: Small fixes, corrections, or minimal data changes |
| 14 | + |
| 15 | +## Triggering Mechanisms |
| 16 | + |
| 17 | +Version changes are automatically triggered by the following events: |
| 18 | + |
| 19 | +1. **Resource File Updates**: When a resource file is updated through the API |
| 20 | +2. **Dataset Publication**: When a dataset containing resources is published |
| 21 | +3. **Manual Triggers**: Through management commands (e.g., `create_major_version`) |
| 22 | + |
| 23 | +## Technical Implementation |
| 24 | + |
| 25 | +### Components |
| 26 | + |
| 27 | +1. **Signal Handlers**: Detect changes to resources and trigger version detection |
| 28 | +2. **Version Detection Utility**: Analyzes file changes to determine version increment type |
| 29 | +3. **DVC Manager**: Handles version tracking, tagging, and remote storage |
| 30 | + |
| 31 | +### Signal Flow |
| 32 | + |
| 33 | +``` |
| 34 | +ResourceFileDetails update → post_save signal → version_resource_with_dvc |
| 35 | + → detect_version_change_type → _increment_version → Create ResourceVersion |
| 36 | +``` |
| 37 | + |
| 38 | +## Change Detection Logic |
| 39 | + |
| 40 | +The system uses different strategies based on file type: |
| 41 | + |
| 42 | +### CSV/Tabular Files |
| 43 | + |
| 44 | +**Major Version** triggers: |
| 45 | +- Schema changes (columns added/removed) |
| 46 | +- Data type changes in columns |
| 47 | + |
| 48 | +**Minor Version** triggers: |
| 49 | +- Row count changes > 10% |
| 50 | +- Data changes in > 30% of cells |
| 51 | + |
| 52 | +**Patch Version** triggers: |
| 53 | +- Small data changes (< 30% of cells) |
| 54 | +- Minimal corrections |
| 55 | + |
| 56 | +### JSON Files |
| 57 | + |
| 58 | +**Major Version** triggers: |
| 59 | +- Structure changes (keys added/removed) |
| 60 | +- Data type changes |
| 61 | + |
| 62 | +**Minor Version** triggers: |
| 63 | +- Significant value changes (> 30% of values) |
| 64 | +- Array item additions/removals |
| 65 | + |
| 66 | +**Patch Version** triggers: |
| 67 | +- Small value changes |
| 68 | +- Formatting changes |
| 69 | + |
| 70 | +### XML Files |
| 71 | + |
| 72 | +**Major Version** triggers: |
| 73 | +- Tag structure changes |
| 74 | + |
| 75 | +**Minor Version** triggers: |
| 76 | +- Attribute changes |
| 77 | + |
| 78 | +**Patch Version** triggers: |
| 79 | +- Content changes without structural modifications |
| 80 | + |
| 81 | +### Generic Files |
| 82 | + |
| 83 | +For non-structured files, the system uses file size differences: |
| 84 | + |
| 85 | +**Major Version** triggers: |
| 86 | +- Size changes > 50% |
| 87 | + |
| 88 | +**Minor Version** triggers: |
| 89 | +- Size changes between 10% and 50% |
| 90 | + |
| 91 | +**Patch Version** triggers: |
| 92 | +- Size changes < 10% |
| 93 | + |
| 94 | +## Technical Details |
| 95 | + |
| 96 | +### Dependencies |
| 97 | + |
| 98 | +- **pandas**: For efficient tabular data comparison |
| 99 | +- **DeepDiff**: For accurate JSON structure comparison |
| 100 | +- **DVC**: For version tracking and storage |
| 101 | + |
| 102 | +### Performance Considerations |
| 103 | + |
| 104 | +- Large files (>100MB) use chunked processing |
| 105 | +- Tabular comparisons use sampling for very large datasets |
| 106 | +- Early return logic to avoid unnecessary processing |
| 107 | + |
| 108 | +## Example |
| 109 | + |
| 110 | +When a CSV resource is updated: |
| 111 | + |
| 112 | +1. The system detects the change through a Django signal |
| 113 | +2. `detect_version_change_type` loads both versions of the file |
| 114 | +3. The function compares schemas, data types, and values |
| 115 | +4. Based on the extent of changes, it returns "major", "minor", or "patch" |
| 116 | +5. The version is incremented accordingly (e.g., 1.2.3 → 1.3.0 for minor) |
| 117 | +6. DVC tracks the new version with appropriate tags |
| 118 | +7. A ResourceVersion record is created in the database |
| 119 | + |
| 120 | +## Management Commands |
| 121 | + |
| 122 | +The system includes management commands for manual version control: |
| 123 | + |
| 124 | +- `create_major_version`: Force a major version increment for a resource |
| 125 | +- `setup_dvc`: Configure DVC repository and remotes |
| 126 | + |
| 127 | +## Error Handling |
| 128 | + |
| 129 | +The version detection system includes robust error handling to ensure that: |
| 130 | + |
| 131 | +1. Failed comparisons default to "minor" version changes |
| 132 | +2. Temporary files are properly cleaned up |
| 133 | +3. Errors are logged with detailed context |
| 134 | +4. The system continues functioning even if analysis fails |
| 135 | + |
| 136 | +## Future Improvements |
| 137 | + |
| 138 | +- Support for more file formats |
| 139 | +- Configurable thresholds for different file types |
| 140 | +- Performance optimizations for very large datasets |
0 commit comments