|
| 1 | +# Blob Integrity Checker |
| 2 | + |
| 3 | +This module provides blob (binary data) integrity checking functionality for the Media Communications Mesh project. It's designed to work alongside the existing video integrity checker and integrates with the `integrity_runner.py` framework. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +The blob integrity checker compares binary files chunk by chunk using MD5 hashes to detect any data corruption or transmission errors. It supports two modes of operation: |
| 8 | + |
| 9 | +1. **File mode**: For single blob files |
| 10 | +2. **Stream mode**: For blob data segmented into multiple files |
| 11 | + |
| 12 | +## Features |
| 13 | + |
| 14 | +- **Chunk-based verification**: Compares files in configurable chunks for detailed error detection |
| 15 | +- **Full file hash comparison**: Quick integrity check using complete file MD5 hashes |
| 16 | +- **Multi-processing support**: Parallel processing for stream mode with configurable worker count |
| 17 | +- **Detailed logging**: Comprehensive logging of integrity check results and errors |
| 18 | +- **Flexible configuration**: Configurable chunk sizes, output paths, and processing options |
| 19 | + |
| 20 | +## Files |
| 21 | + |
| 22 | +- `blob_integrity.py`: Main integrity checking module |
| 23 | +- `test_blob_integrity.py`: Test suite and usage examples |
| 24 | +- `integrity_runner.py`: Extended with blob integrity runner classes |
| 25 | + |
| 26 | +## Usage |
| 27 | + |
| 28 | +### Command Line Interface |
| 29 | + |
| 30 | +#### Single File Mode |
| 31 | + |
| 32 | +```bash |
| 33 | +python blob_integrity.py file source.bin output_name --chunk_size 1048576 --output_path /mnt/ramdisk |
| 34 | +``` |
| 35 | + |
| 36 | +#### Stream Mode (for segmented files) |
| 37 | + |
| 38 | +```bash |
| 39 | +python blob_integrity.py stream source.bin output_prefix --chunk_size 1048576 --output_path /mnt/ramdisk --segment_duration 3 --workers 5 |
| 40 | +``` |
| 41 | + |
| 42 | +### Programmatic Usage with integrity_runner.py |
| 43 | + |
| 44 | +#### File Blob Integrity |
| 45 | + |
| 46 | +```python |
| 47 | +from integrity_runner import FileBlobIntegrityRunner |
| 48 | + |
| 49 | +runner = FileBlobIntegrityRunner( |
| 50 | + host=host_connection, |
| 51 | + test_repo_path="/path/to/repo", |
| 52 | + src_url="/path/to/source.bin", |
| 53 | + out_name="output.bin", |
| 54 | + chunk_size=1048576, # 1MB chunks |
| 55 | + out_path="/mnt/ramdisk", |
| 56 | + delete_file=True |
| 57 | +) |
| 58 | + |
| 59 | +runner.setup() |
| 60 | +success = runner.run() |
| 61 | +``` |
| 62 | + |
| 63 | +#### Stream Blob Integrity |
| 64 | + |
| 65 | +```python |
| 66 | +from integrity_runner import StreamBlobIntegrityRunner |
| 67 | + |
| 68 | +runner = StreamBlobIntegrityRunner( |
| 69 | + host=host_connection, |
| 70 | + test_repo_path="/path/to/repo", |
| 71 | + src_url="/path/to/source.bin", |
| 72 | + out_name="output_prefix", |
| 73 | + chunk_size=1048576, |
| 74 | + segment_duration=3, |
| 75 | + workers=5 |
| 76 | +) |
| 77 | + |
| 78 | +runner.setup() |
| 79 | +runner.run() |
| 80 | +# ... later ... |
| 81 | +success = runner.stop_and_verify() |
| 82 | +``` |
| 83 | + |
| 84 | +### Direct Class Usage |
| 85 | + |
| 86 | +```python |
| 87 | +import logging |
| 88 | +from blob_integrity import BlobFileIntegrator |
| 89 | + |
| 90 | +logger = logging.getLogger(__name__) |
| 91 | + |
| 92 | +integrator = BlobFileIntegrator( |
| 93 | + logger=logger, |
| 94 | + src_url="/path/to/source.bin", |
| 95 | + out_name="output.bin", |
| 96 | + chunk_size=1048576, |
| 97 | + out_path="/mnt/ramdisk", |
| 98 | + delete_file=False |
| 99 | +) |
| 100 | + |
| 101 | +success = integrator.check_blob_integrity() |
| 102 | +``` |
| 103 | + |
| 104 | +## Generating Test Blob Data |
| 105 | + |
| 106 | +You can generate test blob data using the `dd` command: |
| 107 | + |
| 108 | +```bash |
| 109 | +# Generate 100MB of random data |
| 110 | +dd if=/dev/random of=test_blob.bin bs=1M count=100 |
| 111 | + |
| 112 | +# Generate 1GB of random data |
| 113 | +dd if=/dev/random of=large_blob.bin bs=1M count=1024 |
| 114 | + |
| 115 | +# For faster generation, use /dev/urandom |
| 116 | +dd if=/dev/urandom of=test_blob.bin bs=1M count=100 |
| 117 | +``` |
| 118 | + |
| 119 | +## Configuration Options |
| 120 | + |
| 121 | +### Command Line Arguments |
| 122 | + |
| 123 | +- `--chunk_size`: Size of chunks for processing in bytes (default: 1048576 = 1MB) |
| 124 | +- `--output_path`: Directory where output files are located (default: /mnt/ramdisk) |
| 125 | +- `--delete_file`/`--no_delete_file`: Whether to delete files after processing |
| 126 | +- `--segment_duration`: Time to wait between file checks in stream mode (default: 3 seconds) |
| 127 | +- `--workers`: Number of worker processes for stream mode (default: 5) |
| 128 | + |
| 129 | +### Class Parameters |
| 130 | + |
| 131 | +- `chunk_size`: Chunk size for processing (bytes) |
| 132 | +- `delete_file`: Whether to delete output files after successful verification |
| 133 | +- `out_path`: Output directory to monitor |
| 134 | +- `segment_duration`: Polling interval for new files in stream mode |
| 135 | +- `workers_count`: Number of parallel worker processes |
| 136 | + |
| 137 | +## Testing |
| 138 | + |
| 139 | +Run the test suite to verify functionality: |
| 140 | + |
| 141 | +```bash |
| 142 | +python test_blob_integrity.py |
| 143 | +``` |
| 144 | + |
| 145 | +This will: |
| 146 | +1. Generate test blob files |
| 147 | +2. Test normal integrity checking |
| 148 | +3. Test corruption detection |
| 149 | +4. Report results |
| 150 | + |
| 151 | +## Integration with Media Communications Mesh |
| 152 | + |
| 153 | +The blob integrity checker is designed to work with the MCM validation framework: |
| 154 | + |
| 155 | +1. **Stream Mode**: For continuous blob data transmission where data is segmented into files |
| 156 | +2. **File Mode**: For single blob file transfers |
| 157 | +3. **Runner Integration**: Works with the existing `integrity_runner.py` framework |
| 158 | +4. **Logging**: Integrates with MCM logging standards |
| 159 | +5. **Error Handling**: Follows MCM error reporting patterns |
| 160 | + |
| 161 | +## Error Detection |
| 162 | + |
| 163 | +The checker can detect: |
| 164 | +- **Data corruption**: Changes in file content |
| 165 | +- **Size mismatches**: Files with different sizes |
| 166 | +- **Missing chunks**: Incomplete file transfers |
| 167 | +- **Ordering issues**: Files with incorrect chunk sequences |
| 168 | + |
| 169 | +## Performance Considerations |
| 170 | + |
| 171 | +- **Chunk Size**: Larger chunks reduce memory usage but may miss small corruptions |
| 172 | +- **Worker Count**: More workers speed up processing but increase resource usage |
| 173 | +- **Hash Algorithm**: MD5 is used for speed; consider SHA256 for higher security requirements |
| 174 | +- **File I/O**: Large files may require tuning of chunk sizes for optimal performance |
| 175 | + |
| 176 | +## Dependencies |
| 177 | + |
| 178 | +- Python 3.8+ |
| 179 | +- hashlib (standard library) |
| 180 | +- multiprocessing (standard library) |
| 181 | +- pathlib (standard library) |
| 182 | + |
| 183 | +No external dependencies required (unlike video integrity which needs OpenCV and Tesseract). |
0 commit comments