Skip to content

Commit 1e84dcd

Browse files
Copilotcrossjam
andcommitted
Add documentation for transcription validate command
Co-authored-by: crossjam <208062+crossjam@users.noreply.github.com>
1 parent edd2d4b commit 1e84dcd

File tree

2 files changed

+131
-0
lines changed

2 files changed

+131
-0
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -326,6 +326,7 @@ retrocast transcription search --podcast "Tech Talk" "python"
326326
| `retrocast transcription backends list` | List available backends |
327327
| `retrocast transcription backends test BACKEND` | Test a specific backend |
328328
| `retrocast transcription search QUERY` | Search transcribed content |
329+
| `retrocast transcription validate` | Validate JSON transcription files |
329330
| `retrocast transcription summary` | Show transcription statistics |
330331
| `retrocast transcription podcasts list` | List podcasts with transcriptions |
331332
| `retrocast transcription episodes list` | List transcribed episodes |

docs/TRANSCRIPTION.md

Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ The retrocast transcription module enables you to transcribe podcast audio files
1111
- 💾 **Content Deduplication**: SHA256 hashing prevents re-transcribing the same audio
1212
- 🔍 **Full-Text Search**: Search across all transcribed content using SQLite FTS5
1313
- 📝 **Multiple Formats**: Export as TXT, JSON, SRT (subtitles), or VTT (WebVTT)
14+
-**Schema Validation**: Validate JSON transcription files against Pydantic models
1415
- 📊 **Rich CLI**: Progress bars, colored output, and detailed status messages
1516

1617
## Installation
@@ -237,6 +238,76 @@ retrocast transcription search [OPTIONS] QUERY
237238
| `--limit` | INT | 10 | Maximum number of results |
238239
| `--db` | PATH | app_dir/overcast.db | Database file path |
239240

241+
### `retrocast transcription validate`
242+
243+
Validate all JSON transcription files against the expected schema.
244+
245+
**Usage:**
246+
```bash
247+
retrocast transcription validate [OPTIONS]
248+
```
249+
250+
**Options:**
251+
252+
| Option | Type | Default | Description |
253+
|--------|------|---------|-------------|
254+
| `--output-dir` | PATH | app_dir/transcriptions | Directory containing transcription JSON files |
255+
| `--verbose`, `-v` | FLAG | - | Show detailed validation errors for each file |
256+
257+
**Description:**
258+
259+
The validate command checks all JSON transcription files in the specified directory to ensure they conform to the expected schema. It provides:
260+
261+
- **Real-time progress**: Shows a progress bar with file counts and percentage complete
262+
- **Comprehensive validation**: Checks for:
263+
- JSON parsing errors (malformed JSON)
264+
- Schema violations (missing required fields, invalid data types)
265+
- Data constraints (negative durations, invalid timestamps)
266+
- **Summary report**: Displays a table with counts and percentages of valid/invalid/error files
267+
- **Error details**: Lists problematic files and shows specific validation errors in verbose mode
268+
- **Proper exit codes**: Returns 0 if all files are valid, 1 if any errors are found
269+
270+
**Example Output:**
271+
272+
```
273+
Validating 42 transcription file(s)...
274+
275+
Validating TestPodcast/episode1.json... ━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:05
276+
277+
═══ Validation Summary ═══
278+
279+
Status │ Count │ Percentage
280+
────────────────┼───────┼────────────
281+
Valid │ 40 │ 95.2%
282+
Invalid Schema │ 1 │ 2.4%
283+
Parse Errors │ 1 │ 2.4%
284+
Total │ 42 │ 100.0%
285+
286+
Files with validation errors (1):
287+
• TestPodcast/invalid_episode.json
288+
```
289+
290+
**Examples:**
291+
292+
```bash
293+
# Validate all transcriptions in default directory
294+
retrocast transcription validate
295+
296+
# Show detailed errors for each invalid file
297+
retrocast transcription validate --verbose
298+
299+
# Validate transcriptions in a custom directory
300+
retrocast transcription validate --output-dir /path/to/transcriptions
301+
```
302+
303+
**Use Cases:**
304+
305+
- Verify transcription file integrity after processing
306+
- Detect corrupted or malformed JSON files
307+
- Ensure schema compliance before sharing or archiving
308+
- Troubleshoot transcription issues
309+
- Validate files after manual edits or migrations
310+
240311
## Usage Examples
241312

242313
### Basic Transcription
@@ -369,6 +440,45 @@ Found 3 result(s) for: machine learning
369440
Let's dive into machine learning algorithms and their applications...
370441
```
371442

443+
### Validating Transcriptions
444+
445+
```bash
446+
# Validate all transcription files in the default directory
447+
retrocast transcription validate
448+
449+
# Get detailed error messages for each invalid file
450+
retrocast transcription validate --verbose
451+
452+
# Validate transcriptions in a custom directory
453+
retrocast transcription validate --output-dir ~/my-transcripts/
454+
455+
# Use validation in scripts (check exit code)
456+
if retrocast transcription validate; then
457+
echo "All transcriptions are valid!"
458+
else
459+
echo "Some transcriptions have errors"
460+
fi
461+
```
462+
463+
**Validation Output Example:**
464+
```
465+
Validating 12 transcription file(s)...
466+
467+
✓ TechPodcast/episode1.json
468+
✓ TechPodcast/episode2.json
469+
✗ TechPodcast/episode3.json: Validation failed
470+
Field: ('duration',), Error: Input should be greater than or equal to 0
471+
472+
═══ Validation Summary ═══
473+
474+
Status │ Count │ Percentage
475+
────────────────┼───────┼────────────
476+
Valid │ 11 │ 91.7%
477+
Invalid Schema │ 1 │ 8.3%
478+
Parse Errors │ 0 │ 0.0%
479+
Total │ 12 │ 100.0%
480+
```
481+
372482
## Workflow Examples
373483

374484
### Transcribe a Podcast Series
@@ -701,6 +811,26 @@ retrocast transcription search --podcast "Tech Talk" "machine learning"
701811

702812
**A:** MP3, M4A, OGG, Opus, WAV, FLAC, and AAC.
703813

814+
### Q: How do I verify my transcription files are valid?
815+
816+
**A:** Use the `retrocast transcription validate` command to check all JSON transcription files against the expected schema:
817+
818+
```bash
819+
# Validate all transcriptions
820+
retrocast transcription validate
821+
822+
# Get detailed error messages
823+
retrocast transcription validate --verbose
824+
```
825+
826+
This will identify:
827+
- Malformed JSON files
828+
- Missing required fields
829+
- Invalid data types or values (e.g., negative durations)
830+
- Schema violations
831+
832+
The command returns exit code 0 if all files are valid, making it useful in scripts and CI/CD pipelines.
833+
704834
## Future Enhancements
705835

706836
The following features are planned for future releases:

0 commit comments

Comments
 (0)