|
| 1 | +# VOBSUB Subtitle Extraction from MKV Files |
| 2 | + |
| 3 | +CCExtractor supports extracting VOBSUB (S_VOBSUB) subtitles from Matroska (MKV) containers. VOBSUB is an image-based subtitle format originally from DVD video. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +VOBSUB subtitles consist of two files: |
| 8 | +- `.idx` - Index file containing metadata, palette, and timestamp/position entries |
| 9 | +- `.sub` - Binary file containing the actual subtitle bitmap data in MPEG Program Stream format |
| 10 | + |
| 11 | +## Basic Usage |
| 12 | + |
| 13 | +```bash |
| 14 | +ccextractor movie.mkv |
| 15 | +``` |
| 16 | + |
| 17 | +This will extract all VOBSUB tracks and create paired `.idx` and `.sub` files: |
| 18 | +- `movie_eng.idx` + `movie_eng.sub` (first English track) |
| 19 | +- `movie_eng_1.idx` + `movie_eng_1.sub` (second English track, if present) |
| 20 | +- etc. |
| 21 | + |
| 22 | +## Converting VOBSUB to SRT (Text) |
| 23 | + |
| 24 | +Since VOBSUB subtitles are images, you need OCR (Optical Character Recognition) to convert them to text-based formats like SRT. |
| 25 | + |
| 26 | +### Using subtile-ocr (Recommended) |
| 27 | + |
| 28 | +[subtile-ocr](https://github.com/gwen-lg/subtile-ocr) is an actively maintained Rust tool that provides accurate OCR conversion. |
| 29 | + |
| 30 | +#### Option 1: Docker (Easiest) |
| 31 | + |
| 32 | +We provide a Dockerfile that builds subtile-ocr with all dependencies: |
| 33 | + |
| 34 | +```bash |
| 35 | +# Build the Docker image (one-time) |
| 36 | +cd tools/vobsubocr |
| 37 | +docker build -t subtile-ocr . |
| 38 | + |
| 39 | +# Extract VOBSUB from MKV |
| 40 | +ccextractor movie.mkv |
| 41 | + |
| 42 | +# Convert to SRT using OCR |
| 43 | +docker run --rm -v $(pwd):/data subtile-ocr -l eng -o /data/movie_eng.srt /data/movie_eng.idx |
| 44 | +``` |
| 45 | + |
| 46 | +#### Option 2: Install subtile-ocr Natively |
| 47 | + |
| 48 | +If you have Rust and Tesseract development libraries installed: |
| 49 | + |
| 50 | +```bash |
| 51 | +# Install dependencies (Ubuntu/Debian) |
| 52 | +sudo apt-get install libleptonica-dev libtesseract-dev tesseract-ocr tesseract-ocr-eng |
| 53 | + |
| 54 | +# Install subtile-ocr |
| 55 | +cargo install --git https://github.com/gwen-lg/subtile-ocr |
| 56 | + |
| 57 | +# Convert |
| 58 | +subtile-ocr -l eng -o movie_eng.srt movie_eng.idx |
| 59 | +``` |
| 60 | + |
| 61 | +### subtile-ocr Options |
| 62 | + |
| 63 | +| Option | Description | |
| 64 | +|--------|-------------| |
| 65 | +| `-l, --lang <LANG>` | Tesseract language code (required). Examples: `eng`, `fra`, `deu`, `chi_sim` | |
| 66 | +| `-o, --output <FILE>` | Output SRT file (stdout if not specified) | |
| 67 | +| `-t, --threshold <0.0-1.0>` | Binarization threshold (default: 0.6) | |
| 68 | +| `-d, --dpi <DPI>` | Image DPI for OCR (default: 150) | |
| 69 | +| `--dump` | Save processed subtitle images as PNG files | |
| 70 | + |
| 71 | +### Language Codes |
| 72 | + |
| 73 | +Install additional Tesseract language packs as needed: |
| 74 | + |
| 75 | +```bash |
| 76 | +# Examples |
| 77 | +sudo apt-get install tesseract-ocr-fra # French |
| 78 | +sudo apt-get install tesseract-ocr-deu # German |
| 79 | +sudo apt-get install tesseract-ocr-spa # Spanish |
| 80 | +sudo apt-get install tesseract-ocr-chi-sim # Simplified Chinese |
| 81 | +``` |
| 82 | + |
| 83 | +## Technical Details |
| 84 | + |
| 85 | +### .idx File Format |
| 86 | + |
| 87 | +The index file contains: |
| 88 | +1. Header with metadata (size, palette, alignment settings) |
| 89 | +2. Language identifier line |
| 90 | +3. Timestamp entries with file positions |
| 91 | + |
| 92 | +Example: |
| 93 | +``` |
| 94 | +# VobSub index file, v7 (do not modify this line!) |
| 95 | +size: 720x576 |
| 96 | +palette: 000000, 828282, ... |
| 97 | +
|
| 98 | +id: eng, index: 0 |
| 99 | +timestamp: 00:01:12:920, filepos: 000000000 |
| 100 | +timestamp: 00:01:18:640, filepos: 000000800 |
| 101 | +... |
| 102 | +``` |
| 103 | + |
| 104 | +### .sub File Format |
| 105 | + |
| 106 | +The binary file contains MPEG Program Stream packets: |
| 107 | +- Each subtitle is wrapped in a PS Pack header (14 bytes) + PES header (15 bytes) |
| 108 | +- Subtitles are aligned to 2048-byte boundaries |
| 109 | +- Contains raw SPU (SubPicture Unit) bitmap data |
| 110 | + |
| 111 | +## Troubleshooting |
| 112 | + |
| 113 | +### Empty output files |
| 114 | +- Ensure the MKV file actually contains VOBSUB tracks (check with `mediainfo` or `ffprobe`) |
| 115 | +- CCExtractor will report "No VOBSUB subtitles to write" if the track is empty |
| 116 | + |
| 117 | +### OCR quality issues |
| 118 | +- Try adjusting the `-t` threshold parameter |
| 119 | +- Ensure the correct language pack is installed |
| 120 | +- Use `--dump` to inspect the processed images |
| 121 | + |
| 122 | +### Docker permission issues |
| 123 | +- The output files may be owned by root; use `sudo chown` to fix ownership |
| 124 | +- Or run Docker with `--user $(id -u):$(id -g)` |
| 125 | + |
| 126 | +## See Also |
| 127 | + |
| 128 | +- [OCR.md](OCR.md) - General OCR support in CCExtractor |
| 129 | +- [subtile-ocr GitHub](https://github.com/gwen-lg/subtile-ocr) - OCR tool documentation |
0 commit comments