@@ -11,6 +11,7 @@ The retrocast transcription module enables you to transcribe podcast audio files
1111- 💾 ** Content Deduplication** : SHA256 hashing prevents re-transcribing the same audio
1212- 🔍 ** Full-Text Search** : Search across all transcribed content using SQLite FTS5
1313- 📝 ** Multiple Formats** : Export as TXT, JSON, SRT (subtitles), or VTT (WebVTT)
14+ - ✅ ** Schema Validation** : Validate JSON transcription files against Pydantic models
1415- 📊 ** Rich CLI** : Progress bars, colored output, and detailed status messages
1516
1617## Installation
@@ -237,6 +238,76 @@ retrocast transcription search [OPTIONS] QUERY
237238| ` --limit ` | INT | 10 | Maximum number of results |
238239| ` --db ` | PATH | app_dir/overcast.db | Database file path |
239240
241+ ### ` retrocast transcription validate `
242+
243+ Validate all JSON transcription files against the expected schema.
244+
245+ ** Usage:**
246+ ``` bash
247+ retrocast transcription validate [OPTIONS]
248+ ```
249+
250+ ** Options:**
251+
252+ | Option | Type | Default | Description |
253+ | --------| ------| ---------| -------------|
254+ | ` --output-dir ` | PATH | app_dir/transcriptions | Directory containing transcription JSON files |
255+ | ` --verbose ` , ` -v ` | FLAG | - | Show detailed validation errors for each file |
256+
257+ ** Description:**
258+
259+ The validate command checks all JSON transcription files in the specified directory to ensure they conform to the expected schema. It provides:
260+
261+ - ** Real-time progress** : Shows a progress bar with file counts and percentage complete
262+ - ** Comprehensive validation** : Checks for:
263+ - JSON parsing errors (malformed JSON)
264+ - Schema violations (missing required fields, invalid data types)
265+ - Data constraints (negative durations, invalid timestamps)
266+ - ** Summary report** : Displays a table with counts and percentages of valid/invalid/error files
267+ - ** Error details** : Lists problematic files and shows specific validation errors in verbose mode
268+ - ** Proper exit codes** : Returns 0 if all files are valid, 1 if any errors are found
269+
270+ ** Example Output:**
271+
272+ ```
273+ Validating 42 transcription file(s)...
274+
275+ Validating TestPodcast/episode1.json... ━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:05
276+
277+ ═══ Validation Summary ═══
278+
279+ Status │ Count │ Percentage
280+ ────────────────┼───────┼────────────
281+ Valid │ 40 │ 95.2%
282+ Invalid Schema │ 1 │ 2.4%
283+ Parse Errors │ 1 │ 2.4%
284+ Total │ 42 │ 100.0%
285+
286+ Files with validation errors (1):
287+ • TestPodcast/invalid_episode.json
288+ ```
289+
290+ ** Examples:**
291+
292+ ``` bash
293+ # Validate all transcriptions in default directory
294+ retrocast transcription validate
295+
296+ # Show detailed errors for each invalid file
297+ retrocast transcription validate --verbose
298+
299+ # Validate transcriptions in a custom directory
300+ retrocast transcription validate --output-dir /path/to/transcriptions
301+ ```
302+
303+ ** Use Cases:**
304+
305+ - Verify transcription file integrity after processing
306+ - Detect corrupted or malformed JSON files
307+ - Ensure schema compliance before sharing or archiving
308+ - Troubleshoot transcription issues
309+ - Validate files after manual edits or migrations
310+
240311## Usage Examples
241312
242313### Basic Transcription
@@ -369,6 +440,45 @@ Found 3 result(s) for: machine learning
369440 Let's dive into machine learning algorithms and their applications...
370441```
371442
443+ ### Validating Transcriptions
444+
445+ ``` bash
446+ # Validate all transcription files in the default directory
447+ retrocast transcription validate
448+
449+ # Get detailed error messages for each invalid file
450+ retrocast transcription validate --verbose
451+
452+ # Validate transcriptions in a custom directory
453+ retrocast transcription validate --output-dir ~ /my-transcripts/
454+
455+ # Use validation in scripts (check exit code)
456+ if retrocast transcription validate; then
457+ echo " All transcriptions are valid!"
458+ else
459+ echo " Some transcriptions have errors"
460+ fi
461+ ```
462+
463+ ** Validation Output Example:**
464+ ```
465+ Validating 12 transcription file(s)...
466+
467+ ✓ TechPodcast/episode1.json
468+ ✓ TechPodcast/episode2.json
469+ ✗ TechPodcast/episode3.json: Validation failed
470+ Field: ('duration',), Error: Input should be greater than or equal to 0
471+
472+ ═══ Validation Summary ═══
473+
474+ Status │ Count │ Percentage
475+ ────────────────┼───────┼────────────
476+ Valid │ 11 │ 91.7%
477+ Invalid Schema │ 1 │ 8.3%
478+ Parse Errors │ 0 │ 0.0%
479+ Total │ 12 │ 100.0%
480+ ```
481+
372482## Workflow Examples
373483
374484### Transcribe a Podcast Series
@@ -701,6 +811,26 @@ retrocast transcription search --podcast "Tech Talk" "machine learning"
701811
702812** A:** MP3, M4A, OGG, Opus, WAV, FLAC, and AAC.
703813
814+ ### Q: How do I verify my transcription files are valid?
815+
816+ ** A:** Use the ` retrocast transcription validate ` command to check all JSON transcription files against the expected schema:
817+
818+ ``` bash
819+ # Validate all transcriptions
820+ retrocast transcription validate
821+
822+ # Get detailed error messages
823+ retrocast transcription validate --verbose
824+ ```
825+
826+ This will identify:
827+ - Malformed JSON files
828+ - Missing required fields
829+ - Invalid data types or values (e.g., negative durations)
830+ - Schema violations
831+
832+ The command returns exit code 0 if all files are valid, making it useful in scripts and CI/CD pipelines.
833+
704834## Future Enhancements
705835
706836The following features are planned for future releases:
0 commit comments