Skip to content

Commit fd5f561

Browse files
authored
[tiktok] extract subtitles and all cover types (#8805)
* Make sure that `img_id`, `audio_id` and `cover_id` fields are always available. The values are set '' where they are not applicable. Having `img_id` is necessary for the default `archive_fmt`, the other fields are handled for consistency. * Allow downloading more than one cover. The previous behavior is kept as-is, but setting the "covers" option to "all" now grabs all available covers. * Add support for downloading subtitles Allows filtering subtitles by source type (ASR, MT) and language. * Ensure archive uniqueness for covers and subtitles. * Update the URL test pattern to include the `image` extension. Although Tiktok may serve the covers with jpeg content, the file ending can be `.image`. The test before 0c14b16 failed because the asserted URL did not match all cover types, but the now used pattern needs the mentioned file ending. * Add support for "creator_caption" subtitles in "LC" format. These subtitles have the keys "Format" set to "creator_caption" and "Source" to "LC". * Add "LC" (Local Captions) as a subtitle source type in the documentation * Code deduplication and renaming subtitle metadata Changed the item type from singular `subtitle` to `subtitles`. Removed the wrong descriptor `cover` from the subtitles fallback title. * Refactor subtitle filtering The filter is now prepared in `_init` to prevent parsing the same config parameter for every item. The `_extract_subtitles` function will still extract if either filter (source or language) matches. * Generate a `file_id` for subtitles Subtitles have multiple fields that determine the unique file, so these are simply concatenated. This is similar to the cover types, only with more variations. * Added tests for subtitles * fix docs entries * fix '"covers": "all"' * simplify some code * Fix fallback title for subtitles Added the missing "f" to the f-string and added "subtitle" to the title. The resulting title will look like "TikTok video subtitle #1234567"
1 parent 2d01fef commit fd5f561

File tree

4 files changed

+225
-33
lines changed

4 files changed

+225
-33
lines changed

docs/configuration.rst

Lines changed: 55 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5914,12 +5914,25 @@ Description
59145914
extractor.tiktok.covers
59155915
-----------------------
59165916
Type
5917-
``bool``
5917+
* ``bool``
5918+
* ``string``
59185919
Default
59195920
``false``
59205921
Description
59215922
Download video covers.
59225923

5924+
``true``
5925+
Download the first cover found in the following order:
5926+
5927+
* ``thumbnail``
5928+
* ``cover``
5929+
* ``originCover``
5930+
* ``dynamicCover``
5931+
``false``
5932+
Do not download covers
5933+
``"all"``
5934+
Download all available covers
5935+
59235936

59245937
extractor.tiktok.photos
59255938
-----------------------
@@ -5931,6 +5944,47 @@ Description
59315944
Download photos.
59325945

59335946

5947+
extractor.tiktok.subtitles
5948+
--------------------------
5949+
Type
5950+
* ``bool``
5951+
* ``string``
5952+
Default
5953+
``false``
5954+
Example
5955+
* ``"all"``
5956+
* ``"ASR,MT,LC"``
5957+
* ``"ASR,eng-US"``
5958+
Description
5959+
Download video subtitles.
5960+
The subtitles can be filtered by source or language.
5961+
The following source types can be filtered:
5962+
5963+
* ``ASR`` - Automatic Speech Recognition
5964+
* ``MT`` - Machine Translation
5965+
* ``LC`` - Local Captions / Creator Captions
5966+
5967+
If both source types and language codes are provided,
5968+
only subtitles matching both are downloaded.
5969+
5970+
``true``
5971+
Download all subtitles tagged ``ASR``
5972+
``false``
5973+
Do not download subtitles
5974+
``"all"``
5975+
Download all available subtitles.
5976+
``"ASR,MT,eng-US,cmn-Hans-CN"``
5977+
Download english and simplified chinese subtitles
5978+
that are either automatically recognized or machine translated.
5979+
5980+
The source types and languages can be listed in any order.
5981+
Note
5982+
It is not possible to filter all subtitles of a specific source type,
5983+
while also filtering for additional languages of another source type.
5984+
(e.g. any ASR subtitle + fra-FR of any source type)
5985+
For this, refer to `extractor.*.image-filter`_.
5986+
5987+
59345988
extractor.tiktok.videos
59355989
-----------------------
59365990
Type

docs/gallery-dl.conf

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -825,10 +825,11 @@
825825
},
826826
"tiktok":
827827
{
828-
"audio" : true,
829-
"covers": false,
830-
"photos": true,
831-
"videos": true,
828+
"audio" : true,
829+
"covers" : false,
830+
"photos" : true,
831+
"subtitles": false,
832+
"videos" : true,
832833
"tiktok-range": "",
833834

834835
"posts": {

gallery_dl/extractor/tiktok.py

Lines changed: 112 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -36,10 +36,25 @@ def _init(self):
3636
self.audio = self.config("audio", True)
3737
self.video = self.config("videos", True)
3838
self.cover = self.config("covers", False)
39+
self.subtitles = self.config("subtitles", False)
3940

4041
self.range = self.config("tiktok-range") or ""
4142
self.range_predicate = util.predicate_range_parse(self.range)
4243

44+
# If one of these fields is None, the filter for it is disabled.
45+
# Therefore, if both fields are none, all subtitles are extracted.
46+
self.subtitle_sources = None
47+
self.subtitle_langs = None
48+
49+
if self.subtitles and self.subtitles != "all":
50+
if self.subtitles is True or not isinstance(self.subtitles, str):
51+
self.subtitles = "ASR"
52+
53+
known_sources = {"ASR", "MT", "LC"}
54+
filters = set(self.subtitles.split(","))
55+
self.subtitle_sources = known_sources.intersection(filters) or None
56+
self.subtitle_langs = filters.difference(known_sources) or None
57+
4358
def items(self):
4459
for tiktok_url in self.posts():
4560
tiktok_url = self._sanitize_url(tiktok_url)
@@ -73,13 +88,13 @@ def items(self):
7388
url = img["imageURL"]["urlList"][0]
7489
text.nameext_from_url(url, post)
7590
post.update({
76-
"type" : "image",
77-
"image" : img,
78-
"title" : title,
79-
"num" : i,
91+
"type" : "image",
92+
"image" : img,
93+
"title" : title,
94+
"num" : i,
8095
"file_id": post["filename"].partition("~")[0],
81-
"width" : img["imageWidth"],
82-
"height": img["imageHeight"],
96+
"width" : img["imageWidth"],
97+
"height" : img["imageHeight"],
8398
})
8499
yield Message.Url, url, post
85100

@@ -95,9 +110,23 @@ def items(self):
95110
elif self.video and (url := self._extract_video(post)):
96111
yield Message.Url, url, post
97112
del post["_fallback"]
98-
if self.cover and (url := self._extract_cover(post, "video")):
99-
yield Message.Url, url, post
100113

114+
if self.cover:
115+
for url in self._extract_covers(post, "video"):
116+
yield Message.Url, url, post
117+
if self.cover != "all":
118+
break
119+
120+
if self.subtitles:
121+
for url in self._extract_subtitles(post, "video"):
122+
yield Message.Url, url, post
123+
124+
# remove the subtitle related fields for the next item
125+
post.pop("subtitle_lang_id", None)
126+
post.pop("subtitle_lang_codename", None)
127+
post.pop("subtitle_format", None)
128+
post.pop("subtitle_version", None)
129+
post.pop("subtitle_source", None)
101130
else:
102131
self.log.info("%s: Skipping post", tiktok_url)
103132

@@ -277,7 +306,7 @@ def _extract_video(self, post):
277306
"title" : post["desc"] or f"TikTok video #{post['id']}",
278307
"duration" : video.get("duration"),
279308
"num" : 0,
280-
"file_id" : video.get("id"),
309+
"file_id" : "",
281310
"width" : video.get("width"),
282311
"height" : video.get("height"),
283312
})
@@ -334,28 +363,85 @@ def _extract_audio(self, post):
334363
post["extension"] = "mp3"
335364
return url
336365

337-
def _extract_cover(self, post, type):
366+
def _extract_covers(self, post, type):
338367
media = post[type]
339368

340369
for cover_id in ("thumbnail", "cover", "originCover", "dynamicCover"):
341370
if url := media.get(cover_id):
342-
break
343-
else:
344-
return
371+
text.nameext_from_url(url, post)
372+
post.update({
373+
"type" : "cover",
374+
"extension": "jpg",
375+
"image" : url,
376+
"title" : post["desc"] or
377+
f"TikTok {type} cover #{post['id']}",
378+
"duration" : media.get("duration"),
379+
"num" : 0,
380+
"file_id" : cover_id,
381+
"width" : 0,
382+
"height" : 0,
383+
})
384+
yield url
345385

346-
text.nameext_from_url(url, post)
347-
post.update({
348-
"type" : "cover",
349-
"extension": "jpg",
350-
"image" : url,
351-
"title" : post["desc"] or f"TikTok {type} cover #{post['id']}",
352-
"duration" : media.get("duration"),
353-
"num" : 0,
354-
"file_id" : cover_id,
355-
"width" : 0,
356-
"height" : 0,
357-
})
358-
return url
386+
def _extract_subtitles(self, post, type):
387+
media = post[type]
388+
sources_filtered = self.subtitle_sources is not None
389+
langs_filtered = self.subtitle_langs is not None
390+
391+
for subtitle in media.get("subtitleInfos", ()):
392+
sub_lang_id = subtitle.get("LanguageID")
393+
sub_lang_codename = subtitle.get("LanguageCodeName")
394+
sub_format = subtitle.get("Format")
395+
sub_version = subtitle.get("Version")
396+
sub_source = subtitle.get("Source")
397+
398+
# guard the iterable access
399+
sources_match = sources_filtered and \
400+
sub_source in self.subtitle_sources
401+
langs_match = langs_filtered and \
402+
sub_lang_codename in self.subtitle_langs
403+
404+
# Subtitles will be extracted when either filter matches.
405+
if not sources_match and not langs_match and \
406+
(sources_filtered or langs_filtered):
407+
continue
408+
409+
if url := subtitle.get("Url"):
410+
text.nameext_from_url(url, post)
411+
412+
# subtitle urls may not specify a filename,
413+
# so the metadata can be used to build one.
414+
if not post["filename"]:
415+
post["filename"] = (f"{post['id']}_{sub_lang_codename}_"
416+
f"{sub_version}_{sub_source}")
417+
post["extension"] = sub_format.lower()
418+
419+
# replace extensions for known formats
420+
if post["extension"] == "webvtt":
421+
post["extension"] = "vtt"
422+
elif post["extension"] == "creator_caption":
423+
post["extension"] = "json"
424+
425+
post.update({
426+
"type" : "subtitle",
427+
"image" : None,
428+
"title" :
429+
post["desc"] or
430+
f"TikTok {type} subtitle #{post['id']}",
431+
"duration" : media.get("duration"),
432+
"num" : 0,
433+
"file_id" :
434+
f"{sub_lang_id}_{sub_lang_codename}_{sub_source}_"
435+
f"{sub_version}_{sub_format}",
436+
"subtitle_lang_id" : sub_lang_id,
437+
"subtitle_lang_codename": sub_lang_codename,
438+
"subtitle_format" : sub_format,
439+
"subtitle_version" : sub_version,
440+
"subtitle_source" : sub_source,
441+
"width" : 0,
442+
"height" : 0,
443+
})
444+
yield url
359445

360446
def _check_status_code(self, detail, url, type_of_url):
361447
status = detail.get("statusCode")

test/results/tiktok.py

Lines changed: 53 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,12 +6,13 @@
66

77
from gallery_dl.extractor import tiktok
88

9-
PATTERN = r"https://p1[69]-[^/?#.]+\.tiktokcdn[^/?#.]*\.com/[^/?#]+/\w+~.*\.jpe?g"
9+
PATTERN = r"https://p1[69]-[^/?#.]+\.tiktokcdn[^/?#.]*\.com/[^/?#]+/\w+~.*\.(jpe?g|image)"
1010
PATTERN_WITH_AUDIO = r"(?:" + PATTERN + r"|https://v\d+m?\.tiktokcdn[^/?#.]*\.com/[^?#]+\?[^/?#]+)"
1111
VIDEO_PATTERN = r"https://v1[69]-webapp-prime.tiktok.com/video/tos/[^?#]+\?[^/?#]+"
1212
OLD_VIDEO_PATTERN = r"https://www.tiktok.com/aweme/v1/play/\?[^/?#]+"
1313
COMBINED_VIDEO_PATTERN = r"(?:" + VIDEO_PATTERN + r")|(?:" + OLD_VIDEO_PATTERN + r")"
1414
USER_PATTERN = r"(https://www.tiktok.com/@([\w_.-]+)/video/(\d+)|" + PATTERN + r")"
15+
SUBTITLE_PATTERN = r"https://v1[69]-[^/?#.]+\.tiktokcdn[^/?#.]*\.com/[^/?#]+/.*"
1516

1617

1718
__tests__ = (
@@ -127,10 +128,22 @@
127128
"#url" : "https://www.tiktok.com/@memezar/video/7449708266168274208",
128129
"#comment" : "video post cover image",
129130
"#class" : tiktok.TiktokPostExtractor,
130-
"#pattern" : r"https://p19-common-sign-useastred.tiktokcdn-eu.com/tos-useast2a-p-0037-euttp/o4rVzhI1bABhooAaEqtCAYGi6nijIsDib8NGfC~tplv-tiktokx-origin.image\?dr=10395&x-expires=\d+&x-signature=.+",
131+
"#pattern" : PATTERN,
132+
"#count" : 1,
131133
"#options" : {"videos": False, "covers": True},
132134

133135

136+
},
137+
138+
{
139+
"#url" : "https://www.tiktok.com/@memezar/video/7449708266168274208",
140+
"#comment" : "all video post cover images",
141+
"#class" : tiktok.TiktokPostExtractor,
142+
"#pattern" : PATTERN,
143+
"#count" : 3,
144+
"#options" : {"videos": False, "covers": "all"},
145+
146+
134147
},
135148

136149
{
@@ -211,6 +224,44 @@
211224
"#options" : {"videos": "ytdl"},
212225
},
213226

227+
{
228+
"#url" : "https://www.tiktok.com/@memezar/video/7588916452304997635",
229+
"#comment" : "default subtitles",
230+
"#class" : tiktok.TiktokPostExtractor,
231+
"#pattern" : SUBTITLE_PATTERN,
232+
"#count" : 1,
233+
"#options" : {"videos": False, "covers": False, "subtitles": True}
234+
},
235+
236+
{
237+
"#url" : "https://www.tiktok.com/@memezar/video/7588916452304997635",
238+
"#comment" : "english subtitles",
239+
"#class" : tiktok.TiktokPostExtractor,
240+
"#pattern" : SUBTITLE_PATTERN,
241+
"#count" : 1,
242+
"#options" : {"videos": False, "covers": False, "subtitles": "eng-US"}
243+
},
244+
245+
# This test is prone to break when more translation agents are added!
246+
{
247+
"#url" : "https://www.tiktok.com/@memezar/video/7588916452304997635",
248+
"#comment" : "combined subtitle filter",
249+
"#class" : tiktok.TiktokPostExtractor,
250+
"#pattern" : SUBTITLE_PATTERN,
251+
"#count" : 6,
252+
"#options" : {"videos": False, "covers": False, "subtitles": "ASR,deu-DE"}
253+
},
254+
255+
# This test is prone to break when new languages or more translation agents are added!
256+
{
257+
"#url" : "https://www.tiktok.com/@memezar/video/7588916452304997635",
258+
"#comment" : "all subtitles",
259+
"#class" : tiktok.TiktokPostExtractor,
260+
"#pattern" : SUBTITLE_PATTERN,
261+
"#count" : 64,
262+
"#options" : {"videos": False, "covers": False, "subtitles": "all"}
263+
},
264+
214265
{
215266
"#url" : "https://vm.tiktok.com/ZGdh4WUhr/",
216267
"#comment" : "vm.tiktok.com link: many photos",

0 commit comments

Comments
 (0)