99
1010### This library uses undocumented YouTube API, so it's possible that it will stop working at any time. Use at your own risk.
1111
12- > ** Note:** If you want to use this library on Android platform, refer to
12+ > ** Note:** If you want to use this library on an Android platform, refer to
1313> [ Android compatibility] ( #-android-compatibility ) .
1414
1515## 📖 Introduction
1616
1717Java library which allows you to retrieve subtitles/transcripts for a YouTube video.
1818It supports manual and automatically generated subtitles, bulk transcript retrieval for all videos in the playlist or
19- on the channel and does not use headless browser for scraping.
19+ on the channel and does not use a headless browser for scraping.
2020Inspired by [ Python library] ( https://github.com/jdepoix/youtube-transcript-api ) .
2121
2222## ☑️ Features
@@ -60,6 +60,15 @@ implementation 'io.github.thoroldvix:youtube-transcript-api:0.3.6'
6060implementation(" io.github.thoroldvix:youtube-transcript-api:0.3.6" )
6161```
6262
63+ ## ❗ IMPORTANT ❗
64+
65+ YouTube has started blocking most IPs that belong to cloud providers (like AWS, Google Cloud Platform, Azure, etc.),
66+ which means you most likely will get access errors when deploying to any cloud solution. It is also possible that
67+ YouTube will block you even if you run it locally, it will happen if you make too many requests, mainly when
68+ using [ bulk transcript retrieval] ( #bulk-transcript-retrieval ) .
69+ To avoid this, you will need to use rotating proxies like [ Webshare] ( https://www.webshare.io/?referral_code=g0ylrg6pzy7f ) (referral link) or similar solutions.
70+ You can read on how to make a library use your proxy [ here] ( #youtubeclient-customization-and-proxy ) .
71+
6372## 🔰 Getting Started
6473
6574To start using YouTube Transcript API, you need to create an instance of ` YoutubeTranscriptApi ` by
@@ -81,15 +90,15 @@ for [finding specific transcripts](#find-transcripts) by language or by type (ma
8190``` java
8291TranscriptList transcriptList = youtubeTranscriptApi. listTranscripts(" videoId" );
8392
84- // Iterate over transcript list
85- for (Transcript transcript : transcriptList) {
86- System . out. println(transcript);
93+ // Iterate over a transcript list
94+ for (Transcript transcript : transcriptList){
95+ System . out. println(transcript);
8796}
8897
8998// Find transcript in specific language
9099Transcript transcript = transcriptList. findTranscript(" en" );
91100
92- // Find manually created transcript
101+ // Find a manually created transcript
93102Transcript manualyCreatedTranscript = transcriptList. findManualTranscript(" en" );
94103
95104// Find automatically generated transcript
@@ -138,18 +147,19 @@ TranscriptContent transcriptContent = youtubeTranscriptApi.listTranscripts("vide
138147Given that English is the most common language, you can omit the language code, and it will default to English:
139148
140149``` java
141- // Retrieve transcript content in english
150+ // Retrieve transcript content in English
142151TranscriptContent transcriptContent = youtubeTranscriptApi. listTranscripts(" videoId" )
143- // no language code defaults to english
144- .findTranscript()
145- .fetch();
152+ // no language code defaults to English
153+ .findTranscript()
154+ .fetch();
146155// Or
147156TranscriptContent transcriptContent = youtubeTranscriptApi. getTranscript(" videoId" );
148157```
149158
150159For bulk transcript retrieval see [ Bulk Transcript Retrieval] ( #bulk-transcript-retrieval ) .
151160
152161## 🤖 Android compatibility
162+
153163This library uses Java 11 HttpClient for making YouTube requests by default, it was done so it depends on minimal amount
154164of 3rd party libraries. Since Android SDK doesn't include Java 11 HttpClient, you will have to implement
155165your own ` YoutubeClient ` for it to work.
@@ -160,7 +170,8 @@ You can check how to do it in [YoutubeClient Customization and Proxy](#youtubecl
160170
161171### Use fallback language
162172
163- In case if desired language is not available, instead of getting an exception you can pass some other languages that
173+ In case if the desired language is not available, instead of getting an exception, you can pass some other languages
174+ that
164175will be used as a fallback.
165176
166177For example:
@@ -260,15 +271,14 @@ By default, `YoutubeTranscriptApi` uses Java 11 HttpClient for making requests t
260271different client or use a proxy,
261272you can create your own YouTube client by implementing the `YoutubeClient ` interface.
262273
263- Here is example implementation using OkHttp :
274+ Here is an example implementation using OkHttp :
264275
265276```java
266277public class OkHttpYoutubeClient implements YoutubeClient {
267-
268278 private final OkHttpClient client;
269279
270280 public OkHttpYoutubeClient () {
271- this . client = new OkHttpClient ();
281+ this . client = new OkHttpClient ();
272282 }
273283
274284 @Override
@@ -278,67 +288,61 @@ public class OkHttpYoutubeClient implements YoutubeClient {
278288 .url(url)
279289 .build();
280290
281- return sendGetRequest (request);
291+ return executeRequest (request);
282292 }
283293
284294 @Override
285- public String get (YtApiV3Endpoint endpoint , Map<String , String > params ) throws TranscriptRetrievalException {
295+ public String post (String url , String json ) throws TranscriptRetrievalException {
296+ RequestBody requestBody = RequestBody . create(json, MediaType . parse(" application/json; charset=utf-8" ));
297+
286298 Request request = new Request .Builder ()
287- .url(endpoint. url(params))
299+ .url(url)
300+ .post(requestBody)
288301 .build();
289302
290- return sendGetRequest (request);
303+ return executeRequest (request);
291304 }
292305
293- private String sendGetRequest (Request request ) throws TranscriptRetrievalException {
306+ private String executeRequest (Request request ) throws TranscriptRetrievalException {
294307 try (Response response = client. newCall(request). execute()) {
295308 if (response. isSuccessful()) {
296- ResponseBody body = response. body();
297- if (body == null ) {
309+ ResponseBody responseBody = response. body();
310+ if (responseBody == null ) {
298311 throw new TranscriptRetrievalException (" Response body is null" );
299312 }
300- return body . string();
313+ return responseBody . string();
301314 }
302315 } catch (IOException e) {
303- throw new TranscriptRetrievalException (" Failed to retrieve data from YouTube " , e);
316+ throw new TranscriptRetrievalException (" HTTP request failed " , e);
304317 }
305- throw new TranscriptRetrievalException (" Failed to retrieve data from YouTube" );
318+
319+ throw new TranscriptRetrievalException (" HTTP request failed with non-successful response" );
306320 }
307321}
308322```
309- After implementing your custom ` YouTubeClient ` you will need to pass it to ` TranscriptApiFactory ` ` createWithClient ` method.
323+
324+ After implementing your custom ` YouTubeClient ` you will need to pass it to ` TranscriptApiFactory ` ` createWithClient `
325+ method.
310326
311327``` java
312328YoutubeClient okHttpClient = new OkHttpYoutubeClient ();
313329YoutubeTranscriptApi youtubeTranscriptApi = TranscriptApiFactory . createWithClient(okHttpClient);
314330```
315331
316332### Cookies
317-
318- Some videos may be age-restricted, requiring authentication to access the transcript.
319- To achieve this, obtain access to the desired video in a browser and download the cookies in Netscape format, storing
320- them as a TXT file.
321- You can use extensions
322- like [ Get cookies.txt LOCALLY] ( https://chromewebstore.google.com/detail/get-cookiestxt-locally/cclelndahbckbenkjhflpdbgdldlbecc )
323- for Chrome or [ cookies.txt] ( https://addons.mozilla.org/en-US/firefox/addon/cookies-txt/ ) for Firefox to do this.
324- ` YoutubeTranscriptApi ` contains ` listTranscriptsWithCookies ` and ` getTranscriptWithCookies ` which accept a path to the
325- cookies.txt file.
326-
327- ``` java
328- // Retrieve transcript list
329- TranscriptList transcriptList = youtubeTranscriptApi. listTranscriptsWithCookies(" videoId" , " path/to/cookies.txt" );
330-
331- // Get transcript content
332- TranscriptContent transcriptContent = youtubeTranscriptApi. getTranscriptWithCookies(" videoId" , " path/to/cookies.txt" , " en" );
333- ```
333+ Some videos are age-restricted, so this library won't be able to access those videos without some sort of authentication.
334+ Unfortunately, some recent changes to the YouTube API have broken the current implementation of cookie-based
335+ authentication, so this feature is currently not available.
334336
335337### Bulk Transcript Retrieval
336338
337- There are a few methods for bulk transcript retrieval in ` YoutubeTranscriptApi `
339+ #### ❗You will most likely get [ IP blocked] ( #-important- ) by YouTube if you use this❗
340+
341+ There are a few methods for bulk transcript retrieval in ` YoutubeTranscriptApi `
338342
339- Playlists and channels information is retrieved from
343+ Playlists and channels information are retrieved from
340344the [ YouTube V3 API] ( https://developers.google.com/youtube/v3/docs/ ) ,
341- so you will need to provide API key for all methods.
345+ so you will need to provide an API key for all methods.
342346
343347All methods take a ` TranscriptRequest ` object as a parameter,
344348which contains the following fields:
@@ -348,8 +352,6 @@ which contains the following fields:
348352 fail fast by throwing an error if one of the transcripts could not be retrieved,
349353 otherwise it will ignore failed transcripts.
350354
351- - ` cookies ` (optional) - Path to [ cookies.txt] ( #cookies ) file.
352-
353355All methods return a map which contains the video ID as a key and the corresponding result as a value.
354356
355357``` java
@@ -426,10 +428,28 @@ undocumented API URL embedded within its HTML. This JSON looks like this:
426428}
427429```
428430
429- This library works by making a single GET request to the YouTube page of the specified video, extracting the JSON data
430- from the HTML, and parsing it to obtain a list of all available transcripts. To fetch the transcript content, it then
431- sends a GET request to the API URL extracted from the JSON. The YouTube API returns the transcript content in XML
432- format, like this:
431+ Before you could directly extract this JSON from video page HTML and call extracted API URL, but YouTube fixed this by
432+ not allowing
433+ requests to the URL that is embedded in this JSON,
434+ but there is a workaround. Each video page also contains an INNERTUBE_API_KEY field, which can be used to access
435+ internal YouTube API. Because of this you can make POST request to this URL
436+ ` https://www.youtube.com/youtubei/v1/player?key=INNERTUBE_API_KEY ` with a body like this:
437+
438+ ``` json
439+ {
440+ "context" : {
441+ "client" : {
442+ "clientName" : " ANDROID" ,
443+ "clientVersion" : " 20.10.38"
444+ }
445+ },
446+ "videoId" : " dQw4w9WgXcQ"
447+ }
448+ ```
449+
450+ To retrieve JSON that is similar to the JSON contained in the video page HTML. Extracted API URL is then
451+ called to retrieve the content of the transcript,
452+ it has an XML format and looks like this
433453
434454``` xml
435455<?xml version =" 1.0" encoding =" utf-8" ?>
0 commit comments