Skip to content

Commit 94980a6

Browse files
authored
Merge pull request #6 from NebularNerd/dev
Adds .ssa/.ass support and Multifile batch processing
2 parents 58fdb20 + 110be45 commit 94980a6

File tree

3 files changed

+133
-32
lines changed

3 files changed

+133
-32
lines changed

.github/workflows/black.yml

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
name: Black Formatting and Linting
2+
3+
on: [push, pull_request]
4+
5+
jobs:
6+
lint:
7+
runs-on: ubuntu-latest
8+
steps:
9+
- uses: actions/checkout@v4
10+
- uses: psf/black@stable
11+

README.md

Lines changed: 15 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
# subtotxt
2-
Quickly convert a [SubRip](https://en.wikipedia.org/wiki/SubRip) .srt or [WEBVTT](https://en.wikipedia.org/wiki/WebVTT) .vtt subtitle file to plain text. Removes timestamps and .srt/.vtt subtitle line numbers.
2+
Quickly convert a [SubRip](https://en.wikipedia.org/wiki/SubRip) .srt, [SubStation Alpha](https://wiki.multimedia.cx/index.php?title=SubStation_Alpha) .ssa/.ass or [WEBVTT](https://en.wikipedia.org/wiki/WebVTT) .vtt subtitle file to plain text. Removes timestamps and .srt/.vtt subtitle line numbers.
33
This was a quick project thrown together for my girlfriend, she's still learning English and wanted to be able to read subtitles more like a transcript for some trickier language issues (and to understand the jokes in Friends by discussing them with me).
44

55
With a spot of feature creep and some encoding detection needs, it evolved into being able to detect character encoding, along with being able to understand both .srt and .vtt formats to save some pre-processing work.
@@ -11,20 +11,16 @@ or
1111
The script will check which format the subtitle file is (incase of incorrect file extensions), detect the character encoding used then write out a .txt file with the same name as your input. If the output file already exists it will ask for permission to delete and create a new one.
1212
## Advanced Usage:
1313
The script has more advanced arguments you can parse:
14-
- *--utf8* or *-8*
15-
Forces the output file to use [UTF-8](https://en.wikipedia.org/wiki/UTF-8) encoding. This may eliminate character encoding issues if you cannot view the output file. In practice, if you can read the contents of the input subtitle file successfully the output should work without the need to change the encoding.
16-
- *--pause* or *-p*
17-
Pause the script at the sanity check stage to let you check some stats before continuing, handy if the output is not working.
18-
- *--screen* or *-s*
19-
Prints the output to the console while writing to the file, may help with debugging failed outputs.
20-
- *--copy* or *-c*
21-
Copies input to output without change, appends *-copy* to filename *e.g.: subtitle-copy.srt*, handy to use with *--utf8* to quickly change encoding. Might be useful if your video player app cannot understand your original subtitle file encoding.
22-
- *--overwrite* or *-o*
23-
Skips asking `Output file already exists, delete and make a new one? [y/n]` and simply deletes the existing output file to create a new one. Ideal for batch processing.
24-
- *--oneliners* or *-1*
25-
Writes all sentences in one line, even if the original file divides some sentences into many lines or subtitles.
26-
- *--help* or *-h*
27-
Shows above information.
14+
- **--dir** or **-d**: Multiple file mode, use this **instead** of `-f` and point it at a folder containing your subtitles. It will run through and process them all, the files must have `.srt`, `.vtt`, `.ssa` or `.ass` extensions. Path can be a full path e.g. `C:\mysubs` or a relative path `.\`.
15+
- **--noname** or **-nn**: For SubStation Alpha this prevents prepending the subtitle line with the character name given in the file, if present. A line with a character might appear as `Blackadder: Your name is Bob?`. I highly recommend this setting if using `oneliners` below. For other formats we attempt to remove `NAME:` from the beginning of the subtitle line.
16+
- **--nosort** or **-ns**: Specifically for SubStation Alpha files, one aspect of these files is that the subtitles can be placed in any order, when the file is processed it works out when a line will appear. I imagine the main reason for this is you could split the dialogue into one block, and labels for signs, books, etc... in another. By default we sort and most examples I've seen have everything in one large block.
17+
- **--utf8** or **-8**: Forces the output file to use [UTF-8](https://en.wikipedia.org/wiki/UTF-8) encoding. This may eliminate character encoding issues if you cannot view the output file. In practice, if you can read the contents of the input subtitle file successfully the output should work without the need to change the encoding.
18+
- **--pause** or **-p**: Pause the script at the sanity check stage to let you check some stats before continuing, handy if the output is not working.
19+
- **--screen** or **-s**: Prints the output to the console while writing to the file, may help with debugging failed outputs.
20+
- **--copy** or **-c**: Copies input to output without change, appends *-copy* to filename *e.g.: subtitle-copy.srt*, handy to use with *--utf8* to quickly change encoding. Might be useful if your video player app cannot understand your original subtitle file encoding.
21+
- **--overwrite** or **-o**: Skips asking `Output file already exists, delete and make a new one? [y/n]` and simply deletes the existing output file to create a new one. Ideal for batch processing.
22+
- **--oneliners** or **-1**: Writes all sentences in one line, even if the original file divides some sentences into many lines or subtitles.
23+
- **--help** or **-h**: Shows above information.
2824
## Required External Modules:
2925
- [Send2Trash](https://pypi.org/project/Send2Trash/) Python module to safely delete the old output file on both Win and \*nix based systems.
3026
- ~~[cchardet](https://pypi.org/project/cchardet/) Python module to detect your subtitle file encoding~~ (Removed for v2.0+ release due to issues with Python 3.10.x installs, still used in v1.0 and will work on Python 3.9.x installs).
@@ -33,15 +29,17 @@ Shows above information.
3329
If your system does not these installed, it will auto install them on first use (or if you install a new version of Python later). If you prefer you can install them either manually, or by using the `requirements.txt`
3430
## Features:
3531
- Fast (aside from initial missing modules install on slow net connections)
32+
- Process a single file or point at a folder to process all supported files.
3633
- Input files character encoding formats are autodetected (if supported by [cchardet](https://pypi.org/project/cchardet/) [v1.0] or [charset_normalizer](https://github.com/Ousret/charset_normalizer) [v2.0+]). For most languages it should be fine, for Chinese and near neighbour languages it can be tricky, a subtitle may contain valid characters for Mandarin or Cantonese (or other dialects) and be in potentially the wrong encoding. This can result in some wonky detection but it should not affect the overall output.
3734
- Output files are wrote in the same encoding as the input or can be forced to UTF8
3835
- Should be cross platform friendly thanks to PathLib and Send2Trash
3936
- Handles UNC style ```\\myserver\myshare\mysub.srt``` paths thanks to PathLib
4037
- Handles SRT to TXT or WEBVTT to TXT
4138
- Handles multi line subtitles and subtitle lines with just numbers (does not confuse them with SRT line numbers)
42-
- Strips formatting tags, and rogue `{\an8}` tags you sometimes find in poorly converted subtitles
39+
- Strips formatting tags, and rogue `{\an8}` tags you sometimes find in poorly converted subtitles
4340
- WEBVTT: Removes 'WEBVTT', headers, metadata, notes, styles and timestamps from output
4441
- SRT: Removes subtitle line #'s and Timestamps, will not work if first subtitle is not 1 or if duplicated line numbers are present (rare cases but possible), use [SubtitleEdit](https://github.com/SubtitleEdit/subtitleedit) to renumber lines for now if this happens.
42+
- SSA/ASS: Removes all non dialogue lines, detects script version, removes positional {xxx} tags from text.
4543
## Examples:
4644
WEBVTT Input:
4745
```
@@ -152,7 +150,7 @@ Output:
152150
Fue estupendo.
153151
```
154152
## Future plans:
155-
- Possibly handle more formats (.ssa Sub Station Alpha would be the other major one I could think of), for now you can use something like [SubtitleEdit](https://github.com/SubtitleEdit/subtitleedit) to convert most other formats to .srt or .vtt. If you have a format you would like to convert to txt, contact me or raise an issue to see if I can add support.
153+
- Possibly handle more formats, for now you can use something like [SubtitleEdit](https://github.com/SubtitleEdit/subtitleedit) to convert most other formats to .srt or .vtt. If you have a format you would like to convert to txt, contact me or raise an issue to see if I can add support.
156154
- GUI option for simple drag and drop usage.
157155
- Figure out a checking method for misnumbered or duplicate numbered SRT line numbers.
158156
## License:

subtotxt.py

Lines changed: 107 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# cSpell:disable
22
# SRT or WEBVTT to plain Text
33
# Author: NebularNerd
4-
# Version: 2025-01-31
4+
# Version: 2025-02-03
55
# https://github.com/NebularNerd/subtotxt
66
import sys
77
import os
@@ -10,6 +10,8 @@
1010
import re
1111
from pathlib import Path
1212

13+
version = "2025-02-03"
14+
1315

1416
def missing_modules_installer(required_modules):
1517
import platform
@@ -99,11 +101,16 @@ def testsub(self):
99101
return "vtt"
100102
if line.strip("\n") == "1" and re.search("(.*:.*:.*-->.*:.*:.*)", next(ts)):
101103
return "srt"
104+
if any(s in line for s in ["!:", "Timer:", "Style:", "Comment:", "Dialogue:", "ScriptType:"]):
105+
return "ass"
102106

103107
def junklist(self):
104108
# This list will grow
105109
# Escaping and r(raw) tag needed for special characters
106-
return ["<.*?>", r"\{\\an8\}", r"^-\s", r"\[.*\]", r"\(.*\)", "^.*?:"]
110+
j = ["<.*?>", r"\{.*?\}", r"\[.*\]", r"\(.*\)", r"^-\s"]
111+
if args.nonames:
112+
j.append("^.*?:")
113+
return j
107114

108115

109116
def cls(): # Clear screen win/*nix friendly
@@ -125,11 +132,23 @@ def yn(yn): # Simple Y/N selector, use yn(text_for_choice)
125132
def arguments():
126133
parser = argparse.ArgumentParser(
127134
formatter_class=argparse.RawDescriptionHelpFormatter,
128-
description="Quickly convert SRT or WEBVTT subtitles into plain text file.",
135+
description="Quickly convert SRT, SSA or WEBVTT subtitles into plain text file.",
129136
epilog="Visit https://github.com/NebularNerd/subtotxt for more information.",
130137
)
131-
parser.add_argument(
132-
"--file", "-f", type=str, required=True, help="Path to .srt or .vtt file, enclose in quotes if path has spaces"
138+
group = parser.add_mutually_exclusive_group(required=True)
139+
group.add_argument(
140+
"--file",
141+
"-f",
142+
type=str,
143+
required=False,
144+
help="Path to .srt/.vtt/.ass/.ssa file, enclose in quotes if path has spaces",
145+
)
146+
group.add_argument(
147+
"--dir",
148+
"-d",
149+
type=str,
150+
required=False,
151+
help="Path to folder containing subtitle files, process all files in folder",
133152
)
134153
parser.add_argument(
135154
"--utf8",
@@ -179,6 +198,22 @@ def arguments():
179198
required=False,
180199
help="Write all sentences in one line, even if the original divides it into many lines or subtitles.",
181200
)
201+
parser.add_argument(
202+
"--nonames",
203+
"-nn",
204+
default=False,
205+
action="store_true",
206+
required=False,
207+
help="Removes character names if present (.ssa/.ass), attempts this for other formats.",
208+
)
209+
parser.add_argument(
210+
"--nosort",
211+
"-ns",
212+
default=False,
213+
action="store_true",
214+
required=False,
215+
help="For SubStation Alpha (.ssa/.ass), do not sort by timecode.",
216+
)
182217
return parser.parse_args()
183218

184219

@@ -241,6 +276,7 @@ def do_srt():
241276
# SubRip subtitle file .srt
242277
# https://en.wikipedia.org/wiki/SubRip
243278
# Format has a line number followed by a timecode on the next line, then text.
279+
print("Processing file as SubRip subtitles [.srt]")
244280
with open(file.i, "r", encoding=enc.enc) as original:
245281
subnum = 1
246282
for line in original: # Ignore SRT Subtitle # and Timecode lines
@@ -258,6 +294,7 @@ def do_vtt():
258294
# This format has a few differing 'standards', you have:
259295
# Metadata, notes, styles, timceodes with optional hours, and optional line numbers,
260296
# almost none of which are actually used it seems. But we need to handle them
297+
print("Processing file as WebVTT (Web Video Text Tracks) [.vtt]")
261298
with open(file.i, "r", encoding=enc.enc) as original:
262299
subnum = 1
263300
head = 1 # Try and skip over everything until we reach the subtitles.
@@ -274,6 +311,44 @@ def do_vtt():
274311
write_to_file()
275312

276313

314+
def do_ass():
315+
# SubStation Alpha subtitle file .ssa/.ass
316+
# https://wiki.multimedia.cx/index.php?title=SubStation_Alpha
317+
# http://www.tcax.org/docs/ass-specs.htm Browser may complain as not https site.
318+
# This format has different version, later ones include more metadata and sections,
319+
# this should not be a big problem as teh text is always on a `Dialog:` line.
320+
# Two keys issues are; lines may not be in timecode order,
321+
# text may be for labelling things and not part of the script.
322+
print("Processing file as SubStation Alpha subtitle [.ssa/.ass]")
323+
with open(file.i, "r", encoding=enc.enc) as original:
324+
# Try and get version
325+
fv = ""
326+
for line in original:
327+
if "ScriptType:" in line:
328+
fv = line.split(": ")[1].strip()
329+
print(f"SSA Version: {fv}" if fv != "" else "No version found, assuming v1.0")
330+
original.seek(0)
331+
d = {}
332+
for line in original:
333+
# Example Dialog line v1.0:
334+
# Dialogue: Marked=0,0:01:16.0,0:01:23.4,White Text,Usagi,0000,0000,0000,Pretty Soldier Sailor Moon
335+
# Example Dialog line v3+:
336+
# Dialogue: Marked=0,0:01:38.95,0:01:41.75,owari,Lupin,0000,0000,0000,,Yeah, love is wonderful.
337+
if "Dialogue:" in line:
338+
if fv == "":
339+
x = re.findall(r"Dialogue:.*?,(.*?\.\d*),.*?\.\d*,.*?,(.*?),.*?,.*?,.*?,(.*)", line) # v1.0
340+
else:
341+
x = re.findall(r"Dialogue:.*?,(.*?\.\d*),.*?\.\d*,(.*?),.*?,.*?,.*?,.*?,.*?,(.*)", line) # v 3.0+
342+
stc = x[0][0] # Start timecode
343+
nom = x[0][1] # Character speaking
344+
txt = x[0][2] # Text
345+
text = txt if (args.nonames or nom == "") else f"{nom}: {txt}"
346+
d.update({stc: {"dialog": text}})
347+
for t in [v["dialog"] for k, v in sorted(d.items())] if not args.nosort else [v["dialog"] for v in d.values()]:
348+
process_line(t.replace(r"\n", " ").replace(r"\N", " ")) # Fixes odd newline in .ass
349+
write_to_file()
350+
351+
277352
def write_to_file():
278353
with open(file.o, "w", encoding=enc.out) as new:
279354
# We check for junk again because it can gets split over two lines and we can't find it until now.
@@ -288,6 +363,8 @@ def do_work():
288363
do_srt()
289364
elif sub.format == "vtt":
290365
do_vtt()
366+
elif sub.format == "ass":
367+
do_ass()
291368
else:
292369
raise Exception("Unable to determine Subtitle format.")
293370

@@ -296,16 +373,31 @@ def do_work():
296373
args = arguments()
297374
cls()
298375
try:
299-
print(f"SUB to TXT v2025-01-31\n{'-' * 22}")
300-
file = file_handler(Path(args.file))
301-
enc = encoding(file.i)
302-
if args.pause and not yn("Ready to start?"):
303-
raise Exception("User exited at pause before start")
304-
if args.copy:
305-
copy()
306-
else:
307-
sub = subtitle()
308-
do_work()
376+
print(f"SUB to TXT v{version}\n{'-' * 22}")
377+
if args.file or args.copy:
378+
file = file_handler(Path(args.file))
379+
enc = encoding(file.i)
380+
if args.pause and not yn("Ready to start?"):
381+
raise Exception("User exited at pause before start")
382+
if args.copy:
383+
copy()
384+
else:
385+
sub = subtitle()
386+
do_work()
387+
if args.dir:
388+
files = list(filter(lambda p: p.suffix in {".srt", ".vtt", ".ssa", ".ass"}, Path(args.dir).glob("*")))
389+
how_many = len(files)
390+
c = 0
391+
print(f"Multi file mode. Found {how_many} files.")
392+
print("-" * 22)
393+
for file in files:
394+
file = file_handler(Path(file))
395+
enc = encoding(file.i)
396+
sub = subtitle()
397+
do_work()
398+
print("-" * 22)
399+
c += 1
400+
print(f"Processed {c}/{how_many} files.")
309401
print("\nFinished!\n")
310402
except Exception as error:
311403
print(f"Script execution stopped because:\n{error}")

0 commit comments

Comments
 (0)