Skip to content

Commit 173db88

Browse files
authored
feat(matroska): Add VOBSUB subtitle extraction support for MKV files
2 parents ec30a79 + 9d14766 commit 173db88

File tree

4 files changed

+401
-9
lines changed

4 files changed

+401
-9
lines changed

docs/VOBSUB.md

Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
# VOBSUB Subtitle Extraction from MKV Files
2+
3+
CCExtractor supports extracting VOBSUB (S_VOBSUB) subtitles from Matroska (MKV) containers. VOBSUB is an image-based subtitle format originally from DVD video.
4+
5+
## Overview
6+
7+
VOBSUB subtitles consist of two files:
8+
- `.idx` - Index file containing metadata, palette, and timestamp/position entries
9+
- `.sub` - Binary file containing the actual subtitle bitmap data in MPEG Program Stream format
10+
11+
## Basic Usage
12+
13+
```bash
14+
ccextractor movie.mkv
15+
```
16+
17+
This will extract all VOBSUB tracks and create paired `.idx` and `.sub` files:
18+
- `movie_eng.idx` + `movie_eng.sub` (first English track)
19+
- `movie_eng_1.idx` + `movie_eng_1.sub` (second English track, if present)
20+
- etc.
21+
22+
## Converting VOBSUB to SRT (Text)
23+
24+
Since VOBSUB subtitles are images, you need OCR (Optical Character Recognition) to convert them to text-based formats like SRT.
25+
26+
### Using subtile-ocr (Recommended)
27+
28+
[subtile-ocr](https://github.com/gwen-lg/subtile-ocr) is an actively maintained Rust tool that provides accurate OCR conversion.
29+
30+
#### Option 1: Docker (Easiest)
31+
32+
We provide a Dockerfile that builds subtile-ocr with all dependencies:
33+
34+
```bash
35+
# Build the Docker image (one-time)
36+
cd tools/vobsubocr
37+
docker build -t subtile-ocr .
38+
39+
# Extract VOBSUB from MKV
40+
ccextractor movie.mkv
41+
42+
# Convert to SRT using OCR
43+
docker run --rm -v $(pwd):/data subtile-ocr -l eng -o /data/movie_eng.srt /data/movie_eng.idx
44+
```
45+
46+
#### Option 2: Install subtile-ocr Natively
47+
48+
If you have Rust and Tesseract development libraries installed:
49+
50+
```bash
51+
# Install dependencies (Ubuntu/Debian)
52+
sudo apt-get install libleptonica-dev libtesseract-dev tesseract-ocr tesseract-ocr-eng
53+
54+
# Install subtile-ocr
55+
cargo install --git https://github.com/gwen-lg/subtile-ocr
56+
57+
# Convert
58+
subtile-ocr -l eng -o movie_eng.srt movie_eng.idx
59+
```
60+
61+
### subtile-ocr Options
62+
63+
| Option | Description |
64+
|--------|-------------|
65+
| `-l, --lang <LANG>` | Tesseract language code (required). Examples: `eng`, `fra`, `deu`, `chi_sim` |
66+
| `-o, --output <FILE>` | Output SRT file (stdout if not specified) |
67+
| `-t, --threshold <0.0-1.0>` | Binarization threshold (default: 0.6) |
68+
| `-d, --dpi <DPI>` | Image DPI for OCR (default: 150) |
69+
| `--dump` | Save processed subtitle images as PNG files |
70+
71+
### Language Codes
72+
73+
Install additional Tesseract language packs as needed:
74+
75+
```bash
76+
# Examples
77+
sudo apt-get install tesseract-ocr-fra # French
78+
sudo apt-get install tesseract-ocr-deu # German
79+
sudo apt-get install tesseract-ocr-spa # Spanish
80+
sudo apt-get install tesseract-ocr-chi-sim # Simplified Chinese
81+
```
82+
83+
## Technical Details
84+
85+
### .idx File Format
86+
87+
The index file contains:
88+
1. Header with metadata (size, palette, alignment settings)
89+
2. Language identifier line
90+
3. Timestamp entries with file positions
91+
92+
Example:
93+
```
94+
# VobSub index file, v7 (do not modify this line!)
95+
size: 720x576
96+
palette: 000000, 828282, ...
97+
98+
id: eng, index: 0
99+
timestamp: 00:01:12:920, filepos: 000000000
100+
timestamp: 00:01:18:640, filepos: 000000800
101+
...
102+
```
103+
104+
### .sub File Format
105+
106+
The binary file contains MPEG Program Stream packets:
107+
- Each subtitle is wrapped in a PS Pack header (14 bytes) + PES header (15 bytes)
108+
- Subtitles are aligned to 2048-byte boundaries
109+
- Contains raw SPU (SubPicture Unit) bitmap data
110+
111+
## Troubleshooting
112+
113+
### Empty output files
114+
- Ensure the MKV file actually contains VOBSUB tracks (check with `mediainfo` or `ffprobe`)
115+
- CCExtractor will report "No VOBSUB subtitles to write" if the track is empty
116+
117+
### OCR quality issues
118+
- Try adjusting the `-t` threshold parameter
119+
- Ensure the correct language pack is installed
120+
- Use `--dump` to inspect the processed images
121+
122+
### Docker permission issues
123+
- The output files may be owned by root; use `sudo chown` to fix ownership
124+
- Or run Docker with `--user $(id -u):$(id -g)`
125+
126+
## See Also
127+
128+
- [OCR.md](OCR.md) - General OCR support in CCExtractor
129+
- [subtile-ocr GitHub](https://github.com/gwen-lg/subtile-ocr) - OCR tool documentation

src/lib_ccx/matroska.c

Lines changed: 232 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1334,11 +1334,243 @@ char *ass_ssa_sentence_erase_read_order(char *text)
13341334
return buf;
13351335
}
13361336

1337+
/* VOBSUB support: Generate PS Pack header
1338+
* The PS Pack header is 14 bytes:
1339+
* - 4 bytes: start code (00 00 01 ba)
1340+
* - 6 bytes: SCR (System Clock Reference) in MPEG-2 format
1341+
* - 3 bytes: mux rate
1342+
* - 1 byte: stuffing length (0)
1343+
*/
1344+
static void generate_ps_pack_header(unsigned char *buf, ULLONG pts_90khz)
1345+
{
1346+
// PS Pack start code
1347+
buf[0] = 0x00;
1348+
buf[1] = 0x00;
1349+
buf[2] = 0x01;
1350+
buf[3] = 0xBA;
1351+
1352+
// SCR (System Clock Reference) - use PTS as SCR base, SCR extension = 0
1353+
// MPEG-2 format: 01 SCR[32:30] 1 SCR[29:15] 1 SCR[14:0] 1 SCR_ext[8:0] 1
1354+
ULLONG scr = pts_90khz;
1355+
ULLONG scr_base = scr;
1356+
int scr_ext = 0;
1357+
1358+
buf[4] = 0x44 | ((scr_base >> 27) & 0x38) | ((scr_base >> 28) & 0x03);
1359+
buf[5] = (scr_base >> 20) & 0xFF;
1360+
buf[6] = 0x04 | ((scr_base >> 12) & 0xF8) | ((scr_base >> 13) & 0x03);
1361+
buf[7] = (scr_base >> 5) & 0xFF;
1362+
buf[8] = 0x04 | ((scr_base << 3) & 0xF8) | ((scr_ext >> 7) & 0x03);
1363+
buf[9] = ((scr_ext << 1) & 0xFE) | 0x01;
1364+
1365+
// Mux rate (10080 = standard DVD rate)
1366+
int mux_rate = 10080;
1367+
buf[10] = (mux_rate >> 14) & 0xFF;
1368+
buf[11] = (mux_rate >> 6) & 0xFF;
1369+
buf[12] = ((mux_rate << 2) & 0xFC) | 0x03;
1370+
1371+
// Stuffing length = 0, with marker bits
1372+
buf[13] = 0xF8;
1373+
}
1374+
1375+
/* VOBSUB support: Generate PES header for private stream 1
1376+
* Returns the total header size (variable based on PTS)
1377+
*/
1378+
static int generate_pes_header(unsigned char *buf, ULLONG pts_90khz, int payload_size, int stream_id)
1379+
{
1380+
// PES start code for private stream 1
1381+
buf[0] = 0x00;
1382+
buf[1] = 0x00;
1383+
buf[2] = 0x01;
1384+
buf[3] = 0xBD; // Private stream 1
1385+
1386+
// PES packet length = header data (3 + 5 for PTS) + 1 (substream ID) + payload
1387+
int pes_header_data_len = 5; // PTS only
1388+
int pes_packet_len = 3 + pes_header_data_len + 1 + payload_size;
1389+
buf[4] = (pes_packet_len >> 8) & 0xFF;
1390+
buf[5] = pes_packet_len & 0xFF;
1391+
1392+
// PES flags: MPEG-2, original
1393+
buf[6] = 0x81;
1394+
// PTS_DTS_flags = 10 (PTS only)
1395+
buf[7] = 0x80;
1396+
// PES header data length
1397+
buf[8] = pes_header_data_len;
1398+
1399+
// PTS (5 bytes): '0010' | PTS[32:30] | '1' | PTS[29:15] | '1' | PTS[14:0] | '1'
1400+
buf[9] = 0x21 | ((pts_90khz >> 29) & 0x0E);
1401+
buf[10] = (pts_90khz >> 22) & 0xFF;
1402+
buf[11] = 0x01 | ((pts_90khz >> 14) & 0xFE);
1403+
buf[12] = (pts_90khz >> 7) & 0xFF;
1404+
buf[13] = 0x01 | ((pts_90khz << 1) & 0xFE);
1405+
1406+
// Substream ID (0x20 = first VOBSUB stream)
1407+
buf[14] = 0x20 + stream_id;
1408+
1409+
return 15; // Total PES header size
1410+
}
1411+
1412+
/* VOBSUB support: Generate timestamp string for .idx file
1413+
* Format: HH:MM:SS:mmm (where mmm is milliseconds)
1414+
*/
1415+
static void generate_vobsub_timestamp(char *buf, size_t bufsize, ULLONG milliseconds)
1416+
{
1417+
ULLONG ms = milliseconds % 1000;
1418+
milliseconds /= 1000;
1419+
ULLONG seconds = milliseconds % 60;
1420+
milliseconds /= 60;
1421+
ULLONG minutes = milliseconds % 60;
1422+
milliseconds /= 60;
1423+
ULLONG hours = milliseconds;
1424+
1425+
snprintf(buf, bufsize, "%02" LLU_M ":%02" LLU_M ":%02" LLU_M ":%03" LLU_M,
1426+
hours, minutes, seconds, ms);
1427+
}
1428+
1429+
/* VOBSUB support: Save VOBSUB track to .idx and .sub files */
1430+
#define VOBSUB_BLOCK_SIZE 2048
1431+
static void save_vobsub_track(struct matroska_ctx *mkv_ctx, struct matroska_sub_track *track)
1432+
{
1433+
if (track->sentence_count == 0)
1434+
{
1435+
mprint("\nNo VOBSUB subtitles to write");
1436+
return;
1437+
}
1438+
1439+
// Generate base filename (without extension)
1440+
const char *lang_to_use = track->lang_ietf ? track->lang_ietf : track->lang;
1441+
const char *basename = get_basename(mkv_ctx->filename);
1442+
size_t needed = strlen(basename) + strlen(lang_to_use) + 32;
1443+
char *base_filename = malloc(needed);
1444+
if (base_filename == NULL)
1445+
fatal(EXIT_NOT_ENOUGH_MEMORY, "In save_vobsub_track: Out of memory.");
1446+
1447+
if (track->lang_index == 0)
1448+
snprintf(base_filename, needed, "%s_%s", basename, lang_to_use);
1449+
else
1450+
snprintf(base_filename, needed, "%s_%s_" LLD, basename, lang_to_use, track->lang_index);
1451+
1452+
// Create .sub filename
1453+
char *sub_filename = malloc(needed + 5);
1454+
if (sub_filename == NULL)
1455+
fatal(EXIT_NOT_ENOUGH_MEMORY, "In save_vobsub_track: Out of memory.");
1456+
snprintf(sub_filename, needed + 5, "%s.sub", base_filename);
1457+
1458+
// Create .idx filename
1459+
char *idx_filename = malloc(needed + 5);
1460+
if (idx_filename == NULL)
1461+
fatal(EXIT_NOT_ENOUGH_MEMORY, "In save_vobsub_track: Out of memory.");
1462+
snprintf(idx_filename, needed + 5, "%s.idx", base_filename);
1463+
1464+
mprint("\nOutput files: %s, %s", idx_filename, sub_filename);
1465+
1466+
// Open .sub file
1467+
int sub_desc;
1468+
#ifdef WIN32
1469+
sub_desc = open(sub_filename, O_WRONLY | O_CREAT | O_TRUNC | O_BINARY, S_IREAD | S_IWRITE);
1470+
#else
1471+
sub_desc = open(sub_filename, O_WRONLY | O_CREAT | O_TRUNC, S_IWUSR | S_IRUSR);
1472+
#endif
1473+
if (sub_desc < 0)
1474+
{
1475+
mprint("\nError: Cannot create .sub file");
1476+
free(base_filename);
1477+
free(sub_filename);
1478+
free(idx_filename);
1479+
return;
1480+
}
1481+
1482+
// Open .idx file
1483+
int idx_desc;
1484+
#ifdef WIN32
1485+
idx_desc = open(idx_filename, O_WRONLY | O_CREAT | O_TRUNC, S_IREAD | S_IWRITE);
1486+
#else
1487+
idx_desc = open(idx_filename, O_WRONLY | O_CREAT | O_TRUNC, S_IWUSR | S_IRUSR);
1488+
#endif
1489+
if (idx_desc < 0)
1490+
{
1491+
mprint("\nError: Cannot create .idx file");
1492+
close(sub_desc);
1493+
free(base_filename);
1494+
free(sub_filename);
1495+
free(idx_filename);
1496+
return;
1497+
}
1498+
1499+
// Write .idx header (from CodecPrivate)
1500+
if (track->header != NULL)
1501+
write_wrapped(idx_desc, track->header, strlen(track->header));
1502+
1503+
// Add language identifier line
1504+
char lang_line[128];
1505+
snprintf(lang_line, sizeof(lang_line), "\nid: %s, index: 0\n", lang_to_use);
1506+
write_wrapped(idx_desc, lang_line, strlen(lang_line));
1507+
1508+
// Buffer for PS/PES headers and padding
1509+
unsigned char header_buf[32];
1510+
unsigned char zero_buf[VOBSUB_BLOCK_SIZE];
1511+
memset(zero_buf, 0, VOBSUB_BLOCK_SIZE);
1512+
1513+
ULLONG file_pos = 0;
1514+
1515+
// Write each subtitle
1516+
for (int i = 0; i < track->sentence_count; i++)
1517+
{
1518+
struct matroska_sub_sentence *sentence = track->sentences[i];
1519+
mkv_ctx->sentence_count++;
1520+
1521+
// Convert timestamp to 90kHz PTS
1522+
ULLONG pts_90khz = sentence->time_start * 90;
1523+
1524+
// Write timestamp entry to .idx
1525+
char timestamp[32];
1526+
generate_vobsub_timestamp(timestamp, sizeof(timestamp), sentence->time_start);
1527+
char idx_entry[128];
1528+
snprintf(idx_entry, sizeof(idx_entry), "timestamp: %s, filepos: %09" LLX_M "\n",
1529+
timestamp, file_pos);
1530+
write_wrapped(idx_desc, idx_entry, strlen(idx_entry));
1531+
1532+
// Generate PS Pack header (14 bytes)
1533+
generate_ps_pack_header(header_buf, pts_90khz);
1534+
write_wrapped(sub_desc, (char *)header_buf, 14);
1535+
1536+
// Generate PES header (15 bytes)
1537+
int pes_header_len = generate_pes_header(header_buf, pts_90khz, sentence->text_size, 0);
1538+
write_wrapped(sub_desc, (char *)header_buf, pes_header_len);
1539+
1540+
// Write SPU data
1541+
write_wrapped(sub_desc, sentence->text, sentence->text_size);
1542+
1543+
// Calculate bytes written and pad to block boundary
1544+
ULLONG bytes_written = 14 + pes_header_len + sentence->text_size;
1545+
ULLONG padding_needed = VOBSUB_BLOCK_SIZE - (bytes_written % VOBSUB_BLOCK_SIZE);
1546+
if (padding_needed < VOBSUB_BLOCK_SIZE)
1547+
{
1548+
write_wrapped(sub_desc, (char *)zero_buf, padding_needed);
1549+
bytes_written += padding_needed;
1550+
}
1551+
1552+
file_pos += bytes_written;
1553+
}
1554+
1555+
close(sub_desc);
1556+
close(idx_desc);
1557+
free(base_filename);
1558+
free(sub_filename);
1559+
free(idx_filename);
1560+
}
1561+
13371562
void save_sub_track(struct matroska_ctx *mkv_ctx, struct matroska_sub_track *track)
13381563
{
13391564
char *filename;
13401565
int desc;
13411566

1567+
// VOBSUB tracks need special handling - separate .idx and .sub files
1568+
if (track->codec_id == MATROSKA_TRACK_SUBTITLE_CODEC_ID_VOBSUB)
1569+
{
1570+
save_vobsub_track(mkv_ctx, track);
1571+
return;
1572+
}
1573+
13421574
if (mkv_ctx->ctx->cc_to_stdout == CCX_TRUE)
13431575
{
13441576
desc = 1; // file descriptor of stdout
@@ -1358,11 +1590,6 @@ void save_sub_track(struct matroska_ctx *mkv_ctx, struct matroska_sub_track *tra
13581590
if (track->header != NULL)
13591591
write_wrapped(desc, track->header, strlen(track->header));
13601592

1361-
if (track->codec_id == MATROSKA_TRACK_SUBTITLE_CODEC_ID_VOBSUB)
1362-
{
1363-
mprint("\nError: VOBSUB not supported");
1364-
}
1365-
13661593
for (int i = 0; i < track->sentence_count; i++)
13671594
{
13681595
struct matroska_sub_sentence *sentence = track->sentences[i];
@@ -1497,10 +1724,6 @@ void save_sub_track(struct matroska_ctx *mkv_ctx, struct matroska_sub_track *tra
14971724
free(timestamp_start);
14981725
free(timestamp_end);
14991726
}
1500-
else if (track->codec_id == MATROSKA_TRACK_SUBTITLE_CODEC_ID_VOBSUB)
1501-
{
1502-
// TODO: Add support for VOBSUB
1503-
}
15041727
}
15051728
}
15061729

0 commit comments

Comments
 (0)