Skip to content

Commit 6f2a73d

Browse files
cfsmp3claude
andcommitted
docs: Add VOBSUB extraction documentation and subtile-ocr Dockerfile
- Add docs/VOBSUB.md explaining the VOBSUB extraction workflow - Add tools/vobsubocr/Dockerfile for building subtile-ocr OCR tool - Document how to convert VOBSUB (.idx/.sub) to SRT using OCR The Dockerfile uses subtile-ocr (https://github.com/gwen-lg/subtile-ocr), an actively maintained fork of vobsubocr with better accuracy. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
1 parent 1fccb78 commit 6f2a73d

File tree

2 files changed

+164
-0
lines changed

2 files changed

+164
-0
lines changed

docs/VOBSUB.md

Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
# VOBSUB Subtitle Extraction from MKV Files
2+
3+
CCExtractor supports extracting VOBSUB (S_VOBSUB) subtitles from Matroska (MKV) containers. VOBSUB is an image-based subtitle format originally from DVD video.
4+
5+
## Overview
6+
7+
VOBSUB subtitles consist of two files:
8+
- `.idx` - Index file containing metadata, palette, and timestamp/position entries
9+
- `.sub` - Binary file containing the actual subtitle bitmap data in MPEG Program Stream format
10+
11+
## Basic Usage
12+
13+
```bash
14+
ccextractor movie.mkv
15+
```
16+
17+
This will extract all VOBSUB tracks and create paired `.idx` and `.sub` files:
18+
- `movie_eng.idx` + `movie_eng.sub` (first English track)
19+
- `movie_eng_1.idx` + `movie_eng_1.sub` (second English track, if present)
20+
- etc.
21+
22+
## Converting VOBSUB to SRT (Text)
23+
24+
Since VOBSUB subtitles are images, you need OCR (Optical Character Recognition) to convert them to text-based formats like SRT.
25+
26+
### Using subtile-ocr (Recommended)
27+
28+
[subtile-ocr](https://github.com/gwen-lg/subtile-ocr) is an actively maintained Rust tool that provides accurate OCR conversion.
29+
30+
#### Option 1: Docker (Easiest)
31+
32+
We provide a Dockerfile that builds subtile-ocr with all dependencies:
33+
34+
```bash
35+
# Build the Docker image (one-time)
36+
cd tools/vobsubocr
37+
docker build -t subtile-ocr .
38+
39+
# Extract VOBSUB from MKV
40+
ccextractor movie.mkv
41+
42+
# Convert to SRT using OCR
43+
docker run --rm -v $(pwd):/data subtile-ocr -l eng -o /data/movie_eng.srt /data/movie_eng.idx
44+
```
45+
46+
#### Option 2: Install subtile-ocr Natively
47+
48+
If you have Rust and Tesseract development libraries installed:
49+
50+
```bash
51+
# Install dependencies (Ubuntu/Debian)
52+
sudo apt-get install libleptonica-dev libtesseract-dev tesseract-ocr tesseract-ocr-eng
53+
54+
# Install subtile-ocr
55+
cargo install --git https://github.com/gwen-lg/subtile-ocr
56+
57+
# Convert
58+
subtile-ocr -l eng -o movie_eng.srt movie_eng.idx
59+
```
60+
61+
### subtile-ocr Options
62+
63+
| Option | Description |
64+
|--------|-------------|
65+
| `-l, --lang <LANG>` | Tesseract language code (required). Examples: `eng`, `fra`, `deu`, `chi_sim` |
66+
| `-o, --output <FILE>` | Output SRT file (stdout if not specified) |
67+
| `-t, --threshold <0.0-1.0>` | Binarization threshold (default: 0.6) |
68+
| `-d, --dpi <DPI>` | Image DPI for OCR (default: 150) |
69+
| `--dump` | Save processed subtitle images as PNG files |
70+
71+
### Language Codes
72+
73+
Install additional Tesseract language packs as needed:
74+
75+
```bash
76+
# Examples
77+
sudo apt-get install tesseract-ocr-fra # French
78+
sudo apt-get install tesseract-ocr-deu # German
79+
sudo apt-get install tesseract-ocr-spa # Spanish
80+
sudo apt-get install tesseract-ocr-chi-sim # Simplified Chinese
81+
```
82+
83+
## Technical Details
84+
85+
### .idx File Format
86+
87+
The index file contains:
88+
1. Header with metadata (size, palette, alignment settings)
89+
2. Language identifier line
90+
3. Timestamp entries with file positions
91+
92+
Example:
93+
```
94+
# VobSub index file, v7 (do not modify this line!)
95+
size: 720x576
96+
palette: 000000, 828282, ...
97+
98+
id: eng, index: 0
99+
timestamp: 00:01:12:920, filepos: 000000000
100+
timestamp: 00:01:18:640, filepos: 000000800
101+
...
102+
```
103+
104+
### .sub File Format
105+
106+
The binary file contains MPEG Program Stream packets:
107+
- Each subtitle is wrapped in a PS Pack header (14 bytes) + PES header (15 bytes)
108+
- Subtitles are aligned to 2048-byte boundaries
109+
- Contains raw SPU (SubPicture Unit) bitmap data
110+
111+
## Troubleshooting
112+
113+
### Empty output files
114+
- Ensure the MKV file actually contains VOBSUB tracks (check with `mediainfo` or `ffprobe`)
115+
- CCExtractor will report "No VOBSUB subtitles to write" if the track is empty
116+
117+
### OCR quality issues
118+
- Try adjusting the `-t` threshold parameter
119+
- Ensure the correct language pack is installed
120+
- Use `--dump` to inspect the processed images
121+
122+
### Docker permission issues
123+
- The output files may be owned by root; use `sudo chown` to fix ownership
124+
- Or run Docker with `--user $(id -u):$(id -g)`
125+
126+
## See Also
127+
128+
- [OCR.md](OCR.md) - General OCR support in CCExtractor
129+
- [subtile-ocr GitHub](https://github.com/gwen-lg/subtile-ocr) - OCR tool documentation

tools/vobsubocr/Dockerfile

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# Dockerfile for subtile-ocr - VOBSUB to SRT converter
2+
# Uses subtile-ocr, an actively maintained fork of vobsubocr
3+
# https://github.com/gwen-lg/subtile-ocr
4+
5+
FROM ubuntu:22.04
6+
7+
# Prevent interactive prompts during package installation
8+
ENV DEBIAN_FRONTEND=noninteractive
9+
10+
# Install build dependencies
11+
RUN apt-get update && apt-get install -y \
12+
build-essential \
13+
clang \
14+
pkg-config \
15+
libleptonica-dev \
16+
libtesseract-dev \
17+
tesseract-ocr \
18+
tesseract-ocr-eng \
19+
curl \
20+
git \
21+
&& rm -rf /var/lib/apt/lists/*
22+
23+
# Install Rust
24+
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
25+
ENV PATH="/root/.cargo/bin:${PATH}"
26+
27+
# Install subtile-ocr from git
28+
RUN cargo install --git https://github.com/gwen-lg/subtile-ocr
29+
30+
# Create working directory
31+
WORKDIR /data
32+
33+
# Default command shows help
34+
ENTRYPOINT ["subtile-ocr"]
35+
CMD ["--help"]

0 commit comments

Comments
 (0)