Skip to content

Commit 03a6124

Browse files
committed
add md for documentign version changes
1 parent 437ef25 commit 03a6124

File tree

1 file changed

+140
-0
lines changed

1 file changed

+140
-0
lines changed

docs/version_detection_system.md

Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
# Intelligent Version Detection System
2+
3+
## Overview
4+
5+
The DataSpace platform implements an version detection system that automatically determines the appropriate version increment (major, minor, or patch) when resources are updated. This document explains the technical implementation, triggering mechanisms, and version classification logic.
6+
7+
## Version Increment Types
8+
9+
The system follows semantic versioning principles (X.Y.Z):
10+
11+
1. **Major Version (X)**: Breaking changes or significant structural modifications
12+
2. **Minor Version (Y)**: Publishing Dataset or significant data changes that maintain compatibility
13+
3. **Patch Version (Z)**: Small fixes, corrections, or minimal data changes
14+
15+
## Triggering Mechanisms
16+
17+
Version changes are automatically triggered by the following events:
18+
19+
1. **Resource File Updates**: When a resource file is updated through the API
20+
2. **Dataset Publication**: When a dataset containing resources is published
21+
3. **Manual Triggers**: Through management commands (e.g., `create_major_version`)
22+
23+
## Technical Implementation
24+
25+
### Components
26+
27+
1. **Signal Handlers**: Detect changes to resources and trigger version detection
28+
2. **Version Detection Utility**: Analyzes file changes to determine version increment type
29+
3. **DVC Manager**: Handles version tracking, tagging, and remote storage
30+
31+
### Signal Flow
32+
33+
```
34+
ResourceFileDetails update → post_save signal → version_resource_with_dvc
35+
→ detect_version_change_type → _increment_version → Create ResourceVersion
36+
```
37+
38+
## Change Detection Logic
39+
40+
The system uses different strategies based on file type:
41+
42+
### CSV/Tabular Files
43+
44+
**Major Version** triggers:
45+
- Schema changes (columns added/removed)
46+
- Data type changes in columns
47+
48+
**Minor Version** triggers:
49+
- Row count changes > 10%
50+
- Data changes in > 30% of cells
51+
52+
**Patch Version** triggers:
53+
- Small data changes (< 30% of cells)
54+
- Minimal corrections
55+
56+
### JSON Files
57+
58+
**Major Version** triggers:
59+
- Structure changes (keys added/removed)
60+
- Data type changes
61+
62+
**Minor Version** triggers:
63+
- Significant value changes (> 30% of values)
64+
- Array item additions/removals
65+
66+
**Patch Version** triggers:
67+
- Small value changes
68+
- Formatting changes
69+
70+
### XML Files
71+
72+
**Major Version** triggers:
73+
- Tag structure changes
74+
75+
**Minor Version** triggers:
76+
- Attribute changes
77+
78+
**Patch Version** triggers:
79+
- Content changes without structural modifications
80+
81+
### Generic Files
82+
83+
For non-structured files, the system uses file size differences:
84+
85+
**Major Version** triggers:
86+
- Size changes > 50%
87+
88+
**Minor Version** triggers:
89+
- Size changes between 10% and 50%
90+
91+
**Patch Version** triggers:
92+
- Size changes < 10%
93+
94+
## Technical Details
95+
96+
### Dependencies
97+
98+
- **pandas**: For efficient tabular data comparison
99+
- **DeepDiff**: For accurate JSON structure comparison
100+
- **DVC**: For version tracking and storage
101+
102+
### Performance Considerations
103+
104+
- Large files (>100MB) use chunked processing
105+
- Tabular comparisons use sampling for very large datasets
106+
- Early return logic to avoid unnecessary processing
107+
108+
## Example
109+
110+
When a CSV resource is updated:
111+
112+
1. The system detects the change through a Django signal
113+
2. `detect_version_change_type` loads both versions of the file
114+
3. The function compares schemas, data types, and values
115+
4. Based on the extent of changes, it returns "major", "minor", or "patch"
116+
5. The version is incremented accordingly (e.g., 1.2.3 → 1.3.0 for minor)
117+
6. DVC tracks the new version with appropriate tags
118+
7. A ResourceVersion record is created in the database
119+
120+
## Management Commands
121+
122+
The system includes management commands for manual version control:
123+
124+
- `create_major_version`: Force a major version increment for a resource
125+
- `setup_dvc`: Configure DVC repository and remotes
126+
127+
## Error Handling
128+
129+
The version detection system includes robust error handling to ensure that:
130+
131+
1. Failed comparisons default to "minor" version changes
132+
2. Temporary files are properly cleaned up
133+
3. Errors are logged with detailed context
134+
4. The system continues functioning even if analysis fails
135+
136+
## Future Improvements
137+
138+
- Support for more file formats
139+
- Configurable thresholds for different file types
140+
- Performance optimizations for very large datasets

0 commit comments

Comments
 (0)