Skip to content

Commit beca485

Browse files
KarolinaPomianCopilotMateuszGrabuszynski
authored
file blob integrity (#409)
* file blob integrity * linter fix * linter fix * linter fix * Update tests/validation/common/integrity/blob_integrity.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: KarolinaPomian <108665762+KarolinaPomian@users.noreply.github.com> * fix * linter fix * Apply suggestion from @MateuszGrabuszynski Co-authored-by: Mateusz Grabuszyński <mgrabuszynski@gmail.com> Signed-off-by: KarolinaPomian <108665762+KarolinaPomian@users.noreply.github.com> * Update tests/validation/common/integrity/blob_integrity.py Co-authored-by: Mateusz Grabuszyński <mgrabuszynski@gmail.com> Signed-off-by: KarolinaPomian <108665762+KarolinaPomian@users.noreply.github.com> * Update tests/validation/common/integrity/blob_integrity.py Co-authored-by: Mateusz Grabuszyński <mgrabuszynski@gmail.com> Signed-off-by: KarolinaPomian <108665762+KarolinaPomian@users.noreply.github.com> * Update tests/validation/common/integrity/blob_integrity.py Co-authored-by: Mateusz Grabuszyński <mgrabuszynski@gmail.com> Signed-off-by: KarolinaPomian <108665762+KarolinaPomian@users.noreply.github.com> * remove magic number 4096 * linter fix * "Integritor" to "Integrator" * Update test_demo.py Signed-off-by: KarolinaPomian <108665762+KarolinaPomian@users.noreply.github.com> --------- Signed-off-by: KarolinaPomian <108665762+KarolinaPomian@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Mateusz Grabuszyński <mgrabuszynski@gmail.com>
1 parent 455d59f commit beca485

File tree

6 files changed

+1089
-6
lines changed

6 files changed

+1089
-6
lines changed
Lines changed: 183 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,183 @@
1+
# Blob Integrity Checker
2+
3+
This module provides blob (binary data) integrity checking functionality for the Media Communications Mesh project. It's designed to work alongside the existing video integrity checker and integrates with the `integrity_runner.py` framework.
4+
5+
## Overview
6+
7+
The blob integrity checker compares binary files chunk by chunk using MD5 hashes to detect any data corruption or transmission errors. It supports two modes of operation:
8+
9+
1. **File mode**: For single blob files
10+
2. **Stream mode**: For blob data segmented into multiple files
11+
12+
## Features
13+
14+
- **Chunk-based verification**: Compares files in configurable chunks for detailed error detection
15+
- **Full file hash comparison**: Quick integrity check using complete file MD5 hashes
16+
- **Multi-processing support**: Parallel processing for stream mode with configurable worker count
17+
- **Detailed logging**: Comprehensive logging of integrity check results and errors
18+
- **Flexible configuration**: Configurable chunk sizes, output paths, and processing options
19+
20+
## Files
21+
22+
- `blob_integrity.py`: Main integrity checking module
23+
- `test_blob_integrity.py`: Test suite and usage examples
24+
- `integrity_runner.py`: Extended with blob integrity runner classes
25+
26+
## Usage
27+
28+
### Command Line Interface
29+
30+
#### Single File Mode
31+
32+
```bash
33+
python blob_integrity.py file source.bin output_name --chunk_size 1048576 --output_path /mnt/ramdisk
34+
```
35+
36+
#### Stream Mode (for segmented files)
37+
38+
```bash
39+
python blob_integrity.py stream source.bin output_prefix --chunk_size 1048576 --output_path /mnt/ramdisk --segment_duration 3 --workers 5
40+
```
41+
42+
### Programmatic Usage with integrity_runner.py
43+
44+
#### File Blob Integrity
45+
46+
```python
47+
from integrity_runner import FileBlobIntegrityRunner
48+
49+
runner = FileBlobIntegrityRunner(
50+
host=host_connection,
51+
test_repo_path="/path/to/repo",
52+
src_url="/path/to/source.bin",
53+
out_name="output.bin",
54+
chunk_size=1048576, # 1MB chunks
55+
out_path="/mnt/ramdisk",
56+
delete_file=True
57+
)
58+
59+
runner.setup()
60+
success = runner.run()
61+
```
62+
63+
#### Stream Blob Integrity
64+
65+
```python
66+
from integrity_runner import StreamBlobIntegrityRunner
67+
68+
runner = StreamBlobIntegrityRunner(
69+
host=host_connection,
70+
test_repo_path="/path/to/repo",
71+
src_url="/path/to/source.bin",
72+
out_name="output_prefix",
73+
chunk_size=1048576,
74+
segment_duration=3,
75+
workers=5
76+
)
77+
78+
runner.setup()
79+
runner.run()
80+
# ... later ...
81+
success = runner.stop_and_verify()
82+
```
83+
84+
### Direct Class Usage
85+
86+
```python
87+
import logging
88+
from blob_integrity import BlobFileIntegrator
89+
90+
logger = logging.getLogger(__name__)
91+
92+
integrator = BlobFileIntegrator(
93+
logger=logger,
94+
src_url="/path/to/source.bin",
95+
out_name="output.bin",
96+
chunk_size=1048576,
97+
out_path="/mnt/ramdisk",
98+
delete_file=False
99+
)
100+
101+
success = integrator.check_blob_integrity()
102+
```
103+
104+
## Generating Test Blob Data
105+
106+
You can generate test blob data using the `dd` command:
107+
108+
```bash
109+
# Generate 100MB of random data
110+
dd if=/dev/random of=test_blob.bin bs=1M count=100
111+
112+
# Generate 1GB of random data
113+
dd if=/dev/random of=large_blob.bin bs=1M count=1024
114+
115+
# For faster generation, use /dev/urandom
116+
dd if=/dev/urandom of=test_blob.bin bs=1M count=100
117+
```
118+
119+
## Configuration Options
120+
121+
### Command Line Arguments
122+
123+
- `--chunk_size`: Size of chunks for processing in bytes (default: 1048576 = 1MB)
124+
- `--output_path`: Directory where output files are located (default: /mnt/ramdisk)
125+
- `--delete_file`/`--no_delete_file`: Whether to delete files after processing
126+
- `--segment_duration`: Time to wait between file checks in stream mode (default: 3 seconds)
127+
- `--workers`: Number of worker processes for stream mode (default: 5)
128+
129+
### Class Parameters
130+
131+
- `chunk_size`: Chunk size for processing (bytes)
132+
- `delete_file`: Whether to delete output files after successful verification
133+
- `out_path`: Output directory to monitor
134+
- `segment_duration`: Polling interval for new files in stream mode
135+
- `workers_count`: Number of parallel worker processes
136+
137+
## Testing
138+
139+
Run the test suite to verify functionality:
140+
141+
```bash
142+
python test_blob_integrity.py
143+
```
144+
145+
This will:
146+
1. Generate test blob files
147+
2. Test normal integrity checking
148+
3. Test corruption detection
149+
4. Report results
150+
151+
## Integration with Media Communications Mesh
152+
153+
The blob integrity checker is designed to work with the MCM validation framework:
154+
155+
1. **Stream Mode**: For continuous blob data transmission where data is segmented into files
156+
2. **File Mode**: For single blob file transfers
157+
3. **Runner Integration**: Works with the existing `integrity_runner.py` framework
158+
4. **Logging**: Integrates with MCM logging standards
159+
5. **Error Handling**: Follows MCM error reporting patterns
160+
161+
## Error Detection
162+
163+
The checker can detect:
164+
- **Data corruption**: Changes in file content
165+
- **Size mismatches**: Files with different sizes
166+
- **Missing chunks**: Incomplete file transfers
167+
- **Ordering issues**: Files with incorrect chunk sequences
168+
169+
## Performance Considerations
170+
171+
- **Chunk Size**: Larger chunks reduce memory usage but may miss small corruptions
172+
- **Worker Count**: More workers speed up processing but increase resource usage
173+
- **Hash Algorithm**: MD5 is used for speed; consider SHA256 for higher security requirements
174+
- **File I/O**: Large files may require tuning of chunk sizes for optimal performance
175+
176+
## Dependencies
177+
178+
- Python 3.8+
179+
- hashlib (standard library)
180+
- multiprocessing (standard library)
181+
- pathlib (standard library)
182+
183+
No external dependencies required (unlike video integrity which needs OpenCV and Tesseract).

0 commit comments

Comments
 (0)