Skip to content

Commit 58fdb20

Browse files
authored
Merge pull request #5 from NebularNerd/dev
Major refactor
2 parents 65cb7b8 + dc44ff7 commit 58fdb20

File tree

5 files changed

+320
-169
lines changed

5 files changed

+320
-169
lines changed

.flake8

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
[flake8]
2+
max-line-length = 120

README.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
# subtotxt
2-
Quickly convert a [SubRip](https://en.wikipedia.org/wiki/SubRip) .srt or [WEBVTT](https://en.wikipedia.org/wiki/WebVTT) .vtt subtitle file to plain text. Removes timestamps and .srt subtitle line numbers.
2+
Quickly convert a [SubRip](https://en.wikipedia.org/wiki/SubRip) .srt or [WEBVTT](https://en.wikipedia.org/wiki/WebVTT) .vtt subtitle file to plain text. Removes timestamps and .srt/.vtt subtitle line numbers.
33
This was a quick project thrown together for my girlfriend, she's still learning English and wanted to be able to read subtitles more like a transcript for some trickier language issues (and to understand the jokes in Friends by discussing them with me).
44

55
With a spot of feature creep and some encoding detection needs, it evolved into being able to detect character encoding, along with being able to understand both .srt and .vtt formats to save some pre-processing work.
@@ -10,7 +10,7 @@ or
1010
```python C:\Python\subtotxt.py -f subtitle.vtt```
1111
The script will check which format the subtitle file is (incase of incorrect file extensions), detect the character encoding used then write out a .txt file with the same name as your input. If the output file already exists it will ask for permission to delete and create a new one.
1212
## Advanced Usage:
13-
The script has six more arguments you can parse:
13+
The script has more advanced arguments you can parse:
1414
- *--utf8* or *-8*
1515
Forces the output file to use [UTF-8](https://en.wikipedia.org/wiki/UTF-8) encoding. This may eliminate character encoding issues if you cannot view the output file. In practice, if you can read the contents of the input subtitle file successfully the output should work without the need to change the encoding.
1616
- *--pause* or *-p*
@@ -20,26 +20,27 @@ Prints the output to the console while writing to the file, may help with debugg
2020
- *--copy* or *-c*
2121
Copies input to output without change, appends *-copy* to filename *e.g.: subtitle-copy.srt*, handy to use with *--utf8* to quickly change encoding. Might be useful if your video player app cannot understand your original subtitle file encoding.
2222
- *--overwrite* or *-o*
23-
Skips asking ```Output file already exists, delete and make a new one? [y/n]``` and simply deletes the existing output file to create a new one. Ideal for batch processing.
23+
Skips asking `Output file already exists, delete and make a new one? [y/n]` and simply deletes the existing output file to create a new one. Ideal for batch processing.
2424
- *--oneliners* or *-1*
2525
Writes all sentences in one line, even if the original file divides some sentences into many lines or subtitles.
2626
- *--help* or *-h*
2727
Shows above information.
2828
## Required External Modules:
2929
- [Send2Trash](https://pypi.org/project/Send2Trash/) Python module to safely delete the old output file on both Win and \*nix based systems.
30-
- ~~[cchardet](https://pypi.org/project/cchardet/) Python module to detect your subtitle file encoding~~ (Removed for v2.0 release due to issues with Python 3.10.x installs, still used in v1.0 and will work on Python 3.9.x installs).
31-
- [charset_normalizer](https://github.com/Ousret/charset_normalizer) Python module to detect your subtitle file encoding (v2.0+ supports Python 3.9.x and 3.10.x).
30+
- ~~[cchardet](https://pypi.org/project/cchardet/) Python module to detect your subtitle file encoding~~ (Removed for v2.0+ release due to issues with Python 3.10.x installs, still used in v1.0 and will work on Python 3.9.x installs).
31+
- [charset_normalizer](https://github.com/Ousret/charset_normalizer) Python module to detect your subtitle file encoding (v2.0 and YYYY-MM-DD versions, supports Python 3.9.x and above).
3232

33-
If your system does not these installed, it will auto install them on first use.
33+
If your system does not these installed, it will auto install them on first use (or if you install a new version of Python later). If you prefer you can install them either manually, or by using the `requirements.txt`
3434
## Features:
3535
- Fast (aside from initial missing modules install on slow net connections)
36-
- Input files character encoding formats are autodetected (if supported by [cchardet](https://pypi.org/project/cchardet/) [v1.0] or [charset_normalizer](https://github.com/Ousret/charset_normalizer) [v2.0+])
36+
- Input files character encoding formats are autodetected (if supported by [cchardet](https://pypi.org/project/cchardet/) [v1.0] or [charset_normalizer](https://github.com/Ousret/charset_normalizer) [v2.0+]). For most languages it should be fine, for Chinese and near neighbour languages it can be tricky, a subtitle may contain valid characters for Mandarin or Cantonese (or other dialects) and be in potentially the wrong encoding. This can result in some wonky detection but it should not affect the overall output.
3737
- Output files are wrote in the same encoding as the input or can be forced to UTF8
3838
- Should be cross platform friendly thanks to PathLib and Send2Trash
3939
- Handles UNC style ```\\myserver\myshare\mysub.srt``` paths thanks to PathLib
4040
- Handles SRT to TXT or WEBVTT to TXT
4141
- Handles multi line subtitles and subtitle lines with just numbers (does not confuse them with SRT line numbers)
42-
- WEBVTT: Removes 'WEBVTT', 'Kind: xxxx', 'Language: xxx' headers and Timestamps from output
42+
- Strips formatting tags, and rogue `{\an8}` tags you sometimes find in poorly converted subtitles
43+
- WEBVTT: Removes 'WEBVTT', headers, metadata, notes, styles and timestamps from output
4344
- SRT: Removes subtitle line #'s and Timestamps, will not work if first subtitle is not 1 or if duplicated line numbers are present (rare cases but possible), use [SubtitleEdit](https://github.com/SubtitleEdit/subtitleedit) to renumber lines for now if this happens.
4445
## Examples:
4546
WEBVTT Input:
@@ -154,6 +155,5 @@ Output:
154155
- Possibly handle more formats (.ssa Sub Station Alpha would be the other major one I could think of), for now you can use something like [SubtitleEdit](https://github.com/SubtitleEdit/subtitleedit) to convert most other formats to .srt or .vtt. If you have a format you would like to convert to txt, contact me or raise an issue to see if I can add support.
155156
- GUI option for simple drag and drop usage.
156157
- Figure out a checking method for misnumbered or duplicate numbered SRT line numbers.
157-
- Handle stripping out SRT formatting tags for bold, italic etc...
158158
## License:
159159
Released as CC0, use it how you wish. If you do use it elsewhere, please be awesome and tag me as the original author. 🙂

pyproject.toml

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
[tool.black]
2+
line-length = 120
3+
target-version = [
4+
'py38',
5+
'py39',
6+
'py310',
7+
'py311',
8+
'py312',
9+
'py313',
10+
]
11+
exclude = '''
12+
/(
13+
\.eggs
14+
| \.git
15+
| \.idea
16+
| \.pytest_cache
17+
| \.github
18+
| _build
19+
| build
20+
| dist
21+
| venv
22+
| test/resources
23+
)/
24+
'''

requirements.txt

262 Bytes
Binary file not shown.

0 commit comments

Comments
 (0)