Skip to content

Commit 2685654

Browse files
authored
[ie/youtube] Add a PO Token Provider Framework (yt-dlp#12840)
https://github.com/yt-dlp/yt-dlp/tree/master/yt_dlp/extractor/youtube/pot/README.md Authored by: coletdjnz
1 parent abf58dc commit 2685654

28 files changed

+4134
-28
lines changed

README.md

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1795,6 +1795,7 @@ The following extractors use this feature:
17951795
* `player_client`: Clients to extract video data from. The currently available clients are `web`, `web_safari`, `web_embedded`, `web_music`, `web_creator`, `mweb`, `ios`, `android`, `android_vr`, `tv` and `tv_embedded`. By default, `tv,ios,web` is used, or `tv,web` is used when authenticating with cookies. The `web_music` client is added for `music.youtube.com` URLs when logged-in cookies are used. The `web_embedded` client is added for age-restricted videos but only works if the video is embeddable. The `tv_embedded` and `web_creator` clients are added for age-restricted videos if account age-verification is required. Some clients, such as `web` and `web_music`, require a `po_token` for their formats to be downloadable. Some clients, such as `web_creator`, will only work with authentication. Not all clients support authentication via cookies. You can use `default` for the default clients, or you can use `all` for all clients (not recommended). You can prefix a client with `-` to exclude it, e.g. `youtube:player_client=default,-ios`
17961796
* `player_skip`: Skip some network requests that are generally needed for robust extraction. One or more of `configs` (skip client configs), `webpage` (skip initial webpage), `js` (skip js player), `initial_data` (skip initial data/next ep request). While these options can help reduce the number of requests needed or avoid some rate-limiting, they could cause issues such as missing formats or metadata. See [#860](https://github.com/yt-dlp/yt-dlp/pull/860) and [#12826](https://github.com/yt-dlp/yt-dlp/issues/12826) for more details
17971797
* `player_params`: YouTube player parameters to use for player requests. Will overwrite any default ones set by yt-dlp.
1798+
* `player_js_variant`: The player javascript variant to use for signature and nsig deciphering. The known variants are: `main`, `tce`, `tv`, `tv_es6`, `phone`, `tablet`. Only `main` is recommended as a possible workaround; the others are for debugging purposes. The default is to use what is prescribed by the site, and can be selected with `actual`
17981799
* `comment_sort`: `top` or `new` (default) - choose comment sorting mode (on YouTube's side)
17991800
* `max_comments`: Limit the amount of comments to gather. Comma-separated list of integers representing `max-comments,max-parents,max-replies,max-replies-per-thread`. Default is `all,all,all,all`
18001801
* E.g. `all,all,1000,10` will get a maximum of 1000 replies total, with up to 10 replies per thread. `1000,all,100` will get a maximum of 1000 comments, with a maximum of 100 replies total
@@ -1805,7 +1806,11 @@ The following extractors use this feature:
18051806
* `data_sync_id`: Overrides the account Data Sync ID used in Innertube API requests. This may be needed if you are using an account with `youtube:player_skip=webpage,configs` or `youtubetab:skip=webpage`
18061807
* `visitor_data`: Overrides the Visitor Data used in Innertube API requests. This should be used with `player_skip=webpage,configs` and without cookies. Note: this may have adverse effects if used improperly. If a session from a browser is wanted, you should pass cookies instead (which contain the Visitor ID)
18071808
* `po_token`: Proof of Origin (PO) Token(s) to use. Comma seperated list of PO Tokens in the format `CLIENT.CONTEXT+PO_TOKEN`, e.g. `youtube:po_token=web.gvs+XXX,web.player=XXX,web_safari.gvs+YYY`. Context can be either `gvs` (Google Video Server URLs) or `player` (Innertube player request)
1808-
* `player_js_variant`: The player javascript variant to use for signature and nsig deciphering. The known variants are: `main`, `tce`, `tv`, `tv_es6`, `phone`, `tablet`. Only `main` is recommended as a possible workaround; the others are for debugging purposes. The default is to use what is prescribed by the site, and can be selected with `actual`
1809+
* `pot_trace`: Enable debug logging for PO Token fetching. Either `true` or `false` (default)
1810+
* `fetch_pot`: Policy to use for fetching a PO Token from providers. One of `always` (always try fetch a PO Token regardless if the client requires one for the given context), `never` (never fetch a PO Token), or `auto` (default; only fetch a PO Token if the client requires one for the given context)
1811+
1812+
#### youtubepot-webpo
1813+
* `bind_to_visitor_id`: Whether to use the Visitor ID instead of Visitor Data for caching WebPO tokens. Either `true` (default) or `false`
18091814

18101815
#### youtubetab (YouTube playlists, channels, feeds, etc.)
18111816
* `skip`: One or more of `webpage` (skip initial webpage download), `authcheck` (allow the download of playlists requiring authentication when no initial webpage is downloaded. This may cause unwanted behavior, see [#1122](https://github.com/yt-dlp/yt-dlp/pull/1122) for more details)

test/test_YoutubeDL.py

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1435,6 +1435,27 @@ def test_load_plugins_compat(self):
14351435
FakeYDL().close()
14361436
assert all_plugins_loaded.value
14371437

1438+
def test_close_hooks(self):
1439+
# Should call all registered close hooks on close
1440+
close_hook_called = False
1441+
close_hook_two_called = False
1442+
1443+
def close_hook():
1444+
nonlocal close_hook_called
1445+
close_hook_called = True
1446+
1447+
def close_hook_two():
1448+
nonlocal close_hook_two_called
1449+
close_hook_two_called = True
1450+
1451+
ydl = FakeYDL()
1452+
ydl.add_close_hook(close_hook)
1453+
ydl.add_close_hook(close_hook_two)
1454+
1455+
ydl.close()
1456+
self.assertTrue(close_hook_called, 'Close hook was not called')
1457+
self.assertTrue(close_hook_two_called, 'Close hook two was not called')
1458+
14381459

14391460
if __name__ == '__main__':
14401461
unittest.main()

test/test_networking_utils.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,15 +20,14 @@
2020
add_accept_encoding_header,
2121
get_redirect_method,
2222
make_socks_proxy_opts,
23-
select_proxy,
2423
ssl_load_certs,
2524
)
2625
from yt_dlp.networking.exceptions import (
2726
HTTPError,
2827
IncompleteRead,
2928
)
3029
from yt_dlp.socks import ProxyType
31-
from yt_dlp.utils.networking import HTTPHeaderDict
30+
from yt_dlp.utils.networking import HTTPHeaderDict, select_proxy
3231

3332
TEST_DIR = os.path.dirname(os.path.abspath(__file__))
3433

test/test_pot/conftest.py

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
import collections
2+
3+
import pytest
4+
5+
from yt_dlp import YoutubeDL
6+
from yt_dlp.cookies import YoutubeDLCookieJar
7+
from yt_dlp.extractor.common import InfoExtractor
8+
from yt_dlp.extractor.youtube.pot._provider import IEContentProviderLogger
9+
from yt_dlp.extractor.youtube.pot.provider import PoTokenRequest, PoTokenContext
10+
from yt_dlp.utils.networking import HTTPHeaderDict
11+
12+
13+
class MockLogger(IEContentProviderLogger):
14+
15+
log_level = IEContentProviderLogger.LogLevel.TRACE
16+
17+
def __init__(self, *args, **kwargs):
18+
super().__init__(*args, **kwargs)
19+
self.messages = collections.defaultdict(list)
20+
21+
def trace(self, message: str):
22+
self.messages['trace'].append(message)
23+
24+
def debug(self, message: str):
25+
self.messages['debug'].append(message)
26+
27+
def info(self, message: str):
28+
self.messages['info'].append(message)
29+
30+
def warning(self, message: str, *, once=False):
31+
self.messages['warning'].append(message)
32+
33+
def error(self, message: str):
34+
self.messages['error'].append(message)
35+
36+
37+
@pytest.fixture
38+
def ie() -> InfoExtractor:
39+
ydl = YoutubeDL()
40+
return ydl.get_info_extractor('Youtube')
41+
42+
43+
@pytest.fixture
44+
def logger() -> MockLogger:
45+
return MockLogger()
46+
47+
48+
@pytest.fixture()
49+
def pot_request() -> PoTokenRequest:
50+
return PoTokenRequest(
51+
context=PoTokenContext.GVS,
52+
innertube_context={'client': {'clientName': 'WEB'}},
53+
innertube_host='youtube.com',
54+
session_index=None,
55+
player_url=None,
56+
is_authenticated=False,
57+
video_webpage=None,
58+
59+
visitor_data='example-visitor-data',
60+
data_sync_id='example-data-sync-id',
61+
video_id='example-video-id',
62+
63+
request_cookiejar=YoutubeDLCookieJar(),
64+
request_proxy=None,
65+
request_headers=HTTPHeaderDict(),
66+
request_timeout=None,
67+
request_source_address=None,
68+
request_verify_tls=True,
69+
70+
bypass_cache=False,
71+
)
Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
import threading
2+
import time
3+
from collections import OrderedDict
4+
import pytest
5+
from yt_dlp.extractor.youtube.pot._provider import IEContentProvider, BuiltinIEContentProvider
6+
from yt_dlp.utils import bug_reports_message
7+
from yt_dlp.extractor.youtube.pot._builtin.memory_cache import MemoryLRUPCP, memorylru_preference, initialize_global_cache
8+
from yt_dlp.version import __version__
9+
from yt_dlp.extractor.youtube.pot._registry import _pot_cache_providers, _pot_memory_cache
10+
11+
12+
class TestMemoryLRUPCS:
13+
14+
def test_base_type(self):
15+
assert issubclass(MemoryLRUPCP, IEContentProvider)
16+
assert issubclass(MemoryLRUPCP, BuiltinIEContentProvider)
17+
18+
@pytest.fixture
19+
def pcp(self, ie, logger) -> MemoryLRUPCP:
20+
return MemoryLRUPCP(ie, logger, {}, initialize_cache=lambda max_size: (OrderedDict(), threading.Lock(), max_size))
21+
22+
def test_is_registered(self):
23+
assert _pot_cache_providers.value.get('MemoryLRU') == MemoryLRUPCP
24+
25+
def test_initialization(self, pcp):
26+
assert pcp.PROVIDER_NAME == 'memory'
27+
assert pcp.PROVIDER_VERSION == __version__
28+
assert pcp.BUG_REPORT_MESSAGE == bug_reports_message(before='')
29+
assert pcp.is_available()
30+
31+
def test_store_and_get(self, pcp):
32+
pcp.store('key1', 'value1', int(time.time()) + 60)
33+
assert pcp.get('key1') == 'value1'
34+
assert len(pcp.cache) == 1
35+
36+
def test_store_ignore_expired(self, pcp):
37+
pcp.store('key1', 'value1', int(time.time()) - 1)
38+
assert len(pcp.cache) == 0
39+
assert pcp.get('key1') is None
40+
assert len(pcp.cache) == 0
41+
42+
def test_store_override_existing_key(self, ie, logger):
43+
MAX_SIZE = 2
44+
pcp = MemoryLRUPCP(ie, logger, {}, initialize_cache=lambda max_size: (OrderedDict(), threading.Lock(), MAX_SIZE))
45+
pcp.store('key1', 'value1', int(time.time()) + 60)
46+
pcp.store('key2', 'value2', int(time.time()) + 60)
47+
assert len(pcp.cache) == 2
48+
pcp.store('key1', 'value2', int(time.time()) + 60)
49+
# Ensure that the override key gets added to the end of the cache instead of in the same position
50+
pcp.store('key3', 'value3', int(time.time()) + 60)
51+
assert pcp.get('key1') == 'value2'
52+
53+
def test_store_ignore_expired_existing_key(self, pcp):
54+
pcp.store('key1', 'value2', int(time.time()) + 60)
55+
pcp.store('key1', 'value1', int(time.time()) - 1)
56+
assert len(pcp.cache) == 1
57+
assert pcp.get('key1') == 'value2'
58+
assert len(pcp.cache) == 1
59+
60+
def test_get_key_expired(self, pcp):
61+
pcp.store('key1', 'value1', int(time.time()) + 60)
62+
assert pcp.get('key1') == 'value1'
63+
assert len(pcp.cache) == 1
64+
pcp.cache['key1'] = ('value1', int(time.time()) - 1)
65+
assert pcp.get('key1') is None
66+
assert len(pcp.cache) == 0
67+
68+
def test_lru_eviction(self, ie, logger):
69+
MAX_SIZE = 2
70+
provider = MemoryLRUPCP(ie, logger, {}, initialize_cache=lambda max_size: (OrderedDict(), threading.Lock(), MAX_SIZE))
71+
provider.store('key1', 'value1', int(time.time()) + 5)
72+
provider.store('key2', 'value2', int(time.time()) + 5)
73+
assert len(provider.cache) == 2
74+
75+
assert provider.get('key1') == 'value1'
76+
77+
provider.store('key3', 'value3', int(time.time()) + 5)
78+
assert len(provider.cache) == 2
79+
80+
assert provider.get('key2') is None
81+
82+
provider.store('key4', 'value4', int(time.time()) + 5)
83+
assert len(provider.cache) == 2
84+
85+
assert provider.get('key1') is None
86+
assert provider.get('key3') == 'value3'
87+
assert provider.get('key4') == 'value4'
88+
89+
def test_delete(self, pcp):
90+
pcp.store('key1', 'value1', int(time.time()) + 5)
91+
assert len(pcp.cache) == 1
92+
assert pcp.get('key1') == 'value1'
93+
pcp.delete('key1')
94+
assert len(pcp.cache) == 0
95+
assert pcp.get('key1') is None
96+
97+
def test_use_global_cache_default(self, ie, logger):
98+
pcp = MemoryLRUPCP(ie, logger, {})
99+
assert pcp.max_size == _pot_memory_cache.value['max_size'] == 25
100+
assert pcp.cache is _pot_memory_cache.value['cache']
101+
assert pcp.lock is _pot_memory_cache.value['lock']
102+
103+
pcp2 = MemoryLRUPCP(ie, logger, {})
104+
assert pcp.max_size == pcp2.max_size == _pot_memory_cache.value['max_size'] == 25
105+
assert pcp.cache is pcp2.cache is _pot_memory_cache.value['cache']
106+
assert pcp.lock is pcp2.lock is _pot_memory_cache.value['lock']
107+
108+
def test_fail_max_size_change_global(self, ie, logger):
109+
pcp = MemoryLRUPCP(ie, logger, {})
110+
assert pcp.max_size == _pot_memory_cache.value['max_size'] == 25
111+
with pytest.raises(ValueError, match='Cannot change max_size of initialized global memory cache'):
112+
initialize_global_cache(50)
113+
114+
assert pcp.max_size == _pot_memory_cache.value['max_size'] == 25
115+
116+
def test_memory_lru_preference(self, pcp, ie, pot_request):
117+
assert memorylru_preference(pcp, pot_request) == 10000
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
import pytest
2+
from yt_dlp.extractor.youtube.pot.provider import (
3+
PoTokenContext,
4+
5+
)
6+
7+
from yt_dlp.extractor.youtube.pot.utils import get_webpo_content_binding, ContentBindingType
8+
9+
10+
class TestGetWebPoContentBinding:
11+
12+
@pytest.mark.parametrize('client_name, context, is_authenticated, expected', [
13+
*[(client, context, is_authenticated, expected) for client in [
14+
'WEB', 'MWEB', 'TVHTML5', 'WEB_EMBEDDED_PLAYER', 'WEB_CREATOR', 'TVHTML5_SIMPLY_EMBEDDED_PLAYER']
15+
for context, is_authenticated, expected in [
16+
(PoTokenContext.GVS, False, ('example-visitor-data', ContentBindingType.VISITOR_DATA)),
17+
(PoTokenContext.PLAYER, False, ('example-video-id', ContentBindingType.VIDEO_ID)),
18+
(PoTokenContext.GVS, True, ('example-data-sync-id', ContentBindingType.DATASYNC_ID)),
19+
]],
20+
('WEB_REMIX', PoTokenContext.GVS, False, ('example-visitor-data', ContentBindingType.VISITOR_DATA)),
21+
('WEB_REMIX', PoTokenContext.PLAYER, False, ('example-visitor-data', ContentBindingType.VISITOR_DATA)),
22+
('ANDROID', PoTokenContext.GVS, False, (None, None)),
23+
('IOS', PoTokenContext.GVS, False, (None, None)),
24+
])
25+
def test_get_webpo_content_binding(self, pot_request, client_name, context, is_authenticated, expected):
26+
pot_request.innertube_context['client']['clientName'] = client_name
27+
pot_request.context = context
28+
pot_request.is_authenticated = is_authenticated
29+
assert get_webpo_content_binding(pot_request) == expected
30+
31+
def test_extract_visitor_id(self, pot_request):
32+
pot_request.visitor_data = 'CgsxMjNhYmNYWVpfLSiA4s%2DqBg%3D%3D'
33+
assert get_webpo_content_binding(pot_request, bind_to_visitor_id=True) == ('123abcXYZ_-', ContentBindingType.VISITOR_ID)
34+
35+
def test_invalid_visitor_id(self, pot_request):
36+
# visitor id not alphanumeric (i.e. protobuf extraction failed)
37+
pot_request.visitor_data = 'CggxMjM0NTY3OCiA4s-qBg%3D%3D'
38+
assert get_webpo_content_binding(pot_request, bind_to_visitor_id=True) == (pot_request.visitor_data, ContentBindingType.VISITOR_DATA)
39+
40+
def test_no_visitor_id(self, pot_request):
41+
pot_request.visitor_data = 'KIDiz6oG'
42+
assert get_webpo_content_binding(pot_request, bind_to_visitor_id=True) == (pot_request.visitor_data, ContentBindingType.VISITOR_DATA)
43+
44+
def test_invalid_base64(self, pot_request):
45+
pot_request.visitor_data = 'invalid-base64'
46+
assert get_webpo_content_binding(pot_request, bind_to_visitor_id=True) == (pot_request.visitor_data, ContentBindingType.VISITOR_DATA)

0 commit comments

Comments
 (0)