Merge pull request #437 from apify/puppeteer-playwright-use-cases

mstephen19 · web-flow · commit 75fb4c647bc0 · 2022-09-20T13:21:22.000+02:00
docs(academy): Puppeteer/Playwright use cases migrations
diff --git a/content/academy/anti_scraping/mitigation/generating_fingerprints.md b/content/academy/anti_scraping/mitigation/generating_fingerprints.md
@@ -30,7 +30,7 @@ const fingerprintGenerator = new FingerprintGenerator({
 });
 
 // Grab a fingerprint from the fingerprint generator
-const { fingerprint } = fingerprintGenerator.getFingerprint({
+const generated = fingerprintGenerator.getFingerprint({
   locales: ["en-US", "en"]
 });
 ```
@@ -65,20 +65,20 @@ const fingerprintGenerator = new FingerprintGenerator({
 });
 
 // Grab a fingerprint
-const { fingerprint } = fingerprintGenerator.getFingerprint({
+const generated = fingerprintGenerator.getFingerprint({
   locales: ["en-US", "en"]
 });
 
 // Create a new browser context, plugging in
 // some values from the fingerprint
 const context = await browser.newContext({
-  userAgent: fingerprint.userAgent,
-  locale: fingerprint.navigator.language,
+  userAgent: generated.fingerprint.userAgent,
+  locale: generated.fingerprint.navigator.language,
 });
 
 // Attach the fingerprint to the newly created
 // browser context
-await fingerprintInjector.attachFingerprintToPlaywright(context, fingerprint);
+await fingerprintInjector.attachFingerprintToPlaywright(context, generated);
 
 // Create a new page and go to Google
 const page = await context.newPage();
diff --git a/content/academy/puppeteer_playwright/common_use_cases/downloading_files.md b/content/academy/puppeteer_playwright/common_use_cases/downloading_files.md
@@ -0,0 +1,98 @@
+---
+title: Downloading files
+description: Learn how to automate the downloading and saving of files to the disk using Puppeteer or Playwright.
+menuWeight: 3
+paths:
+    - puppeteer-playwright/common-use-cases/downloading-files
+---
+
+# Downloading files
+
+Downloading a file using Puppeteer can be tricky. On some systems, there can be issues with the usual file saving process that prevent you from doing it the easy way. However, there are different techniques that work (most of the time).
+
+These techniques are only necessary when we don't have a direct file link, which is usually the case when the file being downloaded is based on more complicated data export.
+
+## [](#setting-up-a-download-path) Setting up a download path
+
+Let's start with the easiest technique. This method tells the browser in what folder we want to download a file from Puppeteer after clicking on it.
+
+```JavaScript
+await page._client.send('Page.setDownloadBehavior', {behavior: 'allow', downloadPath: './my-downloads'})
+```
+
+We use the mysterious `_client` API which gives us access to all the functions of the underlying developer console protocol. Basically, it extends Puppeteer's functionality. Then we can download the file by clicking on the button.
+
+```JavaScript
+await page.click('.export-button');
+```
+
+Let's wait for one minute. In a real use case, you want to check the state of the file in the file system.
+
+```JavaScript
+await page.waitFor(60000);
+```
+
+To extract the file from the file system into memory, we have to first find its name, and then we can read it.
+
+```JavaScript
+import fs from 'fs';
+
+const fileNames = fs.readdirSync('./my-downloads');
+
+// Let's pick the first one
+const fileData = fs.readFileSync(`./my-downloads/${fileNames[0]}`);
+```
+
+## [](#intercepting-a-file-download-request) Intercepting and replicating a file download request
+
+For this second option, we can trigger the file download, intercept the request going out, and then replicate it to get the actual data. First, we need to enable request interception. This is done using the following line of code:
+
+```JavaScript
+await page.setRequestInterception(true);
+```
+
+Next, we need to trigger the actual file export. We might need to fill in some form, select an exported file type, etc. In the end, it will look something like this:
+
+```JavaScript
+await page.click('.export-button');
+```
+
+We don't need to await this promise since we'll be waiting for the result of this action anyway (the triggered request).
+
+The crucial part is intercepting the request that would result in downloading the file. Since the interception is already enabled, we just need to wait for the request to be sent.
+
+```JavaScript
+const xRequest = await new Promise(resolve => {
+    page.on('request', interceptedRequest => {
+        interceptedRequest.abort(); //stop intercepting requests
+        resolve(interceptedRequest);
+    });
+});
+```
+
+The last thing is to convert the intercepted Puppeteer request into a request-promise options object. We need to have the `request-promise` package installed.
+
+```JavaScript
+import request from 'request-promise';
+```
+
+Since the request interception does not include cookies, we need to add them subsequently.
+
+```JavaScript
+const options = {
+    encoding: null,
+    method: xRequest._method,
+    uri: xRequest._url,
+    body: xRequest._postData,
+    headers: xRequest._headers
+}
+
+// Add the cookies
+const cookies = await page.cookies();
+options.headers.Cookie = cookies.map(ck => ck.name + '=' + ck.value).join(';');
+
+// Resend the request
+const response = await request(options);
+```
+
+Now, the response contains the binary data of the downloaded file. It can be saved to the disk, uploaded somewhere, or [submitted with another form]({{@link puppeteer_playwright/common_use_cases/submitting_a_form_with_a_file_attachment.md}}).
diff --git a/content/academy/puppeteer_playwright/common_use_cases/logging_into_a_website.md b/content/academy/puppeteer_playwright/common_use_cases/logging_into_a_website.md
@@ -41,11 +41,11 @@ await page.click('a:has-text("Sign in")');
 await page.waitForLoadState('load');
 
 // Type in the username and continue forward
-await page.type('input[name="username"]', 'academy_playwright_login');
+await page.type('input[name="username"]', 'YOUR-LOGIN-HERE');
 await page.click('input[name="signin"]');
 
 // Type in the password and continue forward
-await page.type('input[name="password"]', 'AcademyIsGreat88');
+await page.type('input[name="password"]', 'YOUR-PASSWORD-HERE');
 await page.click('button[name="verifyPassword"]');
 await page.waitForLoadState('load');
 
@@ -67,11 +67,11 @@ await Promise.all([page.waitForSelector('a[data-ylk*="sign-in"]'), page.click('b
 await Promise.all([page.waitForNavigation(), page.click('a[data-ylk*="sign-in"]')]);
 
 // Type in the username and continue forward
-await page.type('input[name="username"]', 'academy_playwright_login');
+await page.type('input[name="username"]', 'YOUR-LOGIN-HERE');
 await Promise.all([page.waitForNavigation(), page.click('input[name="signin"]')]);
 
 // Type in the password and continue forward
-await page.type('input[name="password"]', 'AcademyIsGreat88');
+await page.type('input[name="password"]', 'YOUR-PASSWORD-HERE');
 await Promise.all([page.waitForNavigation(), page.click('button[name="verifyPassword"]')]);
 
 // Wait for 10 seconds so we can see that we have in fact
@@ -80,7 +80,7 @@ await page.waitForTimeout(10000)
 </marked-tab>
 ```
 
-Great! If you're following along and nothing is wrong with the credentials, you should see that on the final navigated page, you're logged into the **Academy** Yahoo account.
+Great! If you're following along and you've replaced the placeholder credentials with your own, you should see that on the final navigated page, you're logged into your Yahoo account.
 
 ![Successfully logged into Yahoo]({{@asset puppeteer_playwright/common_use_cases/images/logged-in.webp}})
 
@@ -289,10 +289,10 @@ await page.waitForSelector('a:has-text("Sign in")');
 await page.click('a:has-text("Sign in")');
 await page.waitForLoadState('load');
 
-await page.type('input[name="username"]', 'academy_playwright_login');
+await page.type('input[name="username"]', 'YOUR-LOGIN-HERE');
 await page.click('input[name="signin"]');
 
-await page.type('input[name="password"]', 'AcademyIsGreat88');
+await page.type('input[name="password"]', 'YOUR-PASSWORD-HERE');
 await page.click('button[name="verifyPassword"]');
 await page.waitForLoadState('load');
 
@@ -355,10 +355,10 @@ await page.goto('https://www.yahoo.com/');
 await Promise.all([page.waitForSelector('a[data-ylk*="sign-in"]'), page.click('button[name="agree"]')]);
 await Promise.all([page.waitForNavigation(), page.click('a[data-ylk*="sign-in"]')]);
 
-await page.type('input[name="username"]', 'academy_playwright_login');
+await page.type('input[name="username"]', 'YOUR-LOGIN-HERE');
 await Promise.all([page.waitForNavigation(), page.click('input[name="signin"]')]);
 
-await page.type('input[name="password"]', 'AcademyIsGreat88');
+await page.type('input[name="password"]', 'YOUR-PASSWORD-HERE');
 await Promise.all([page.waitForNavigation(), page.click('button[name="verifyPassword"]')]);
 
 const cookies = await page.cookies();
diff --git a/content/academy/puppeteer_playwright/common_use_cases/scraping_iframes.md b/content/academy/puppeteer_playwright/common_use_cases/scraping_iframes.md
@@ -0,0 +1,70 @@
+---
+title: Scraping iFrames
+description: Learn how to scrape information from iFrames using Puppeteer or Playwright.
+menuWeight: 5
+paths:
+    - puppeteer-playwright/common-use-cases/scraping-iframes
+---
+
+# Scraping iFrames
+
+Getting information from inside iFrames is a known pain, especially for new developers. After spending some time on Stack Overflow, you usually find answers like jQuery's `contents()` method or native contentDocument property, which can guide you to the insides of an iframe. But still, getting the right identifiers and holding that new context is a little annoying. Fortunately, you can make everything simpler and more straightforward by scraping iFrames with Puppeteer.
+
+## Finding the right `<iframe>`
+
+If you are using basic methods of page objects like `page.evaluate()`, you are actually already working with frames. Behind the scenes, Puppeteer will call `page.mainFrame().evaluate()`, so most of the methods you are using with page object can be used the same way with frame object. To access frames, you need to loop over the main frame's child frames and identify the one you want to use.
+
+As a simple demonstration, we'll scrape the Twitter widget iFrame from [IMDB](https://www.imdb.com/).
+
+```JavaScript
+const browser = await puppeteer.launch() ;
+
+const page = await browser.newPage();
+
+await page.goto('https://www.imdb.com');
+await page.waitFor(5000); // we need to wait for Twitter widget to load
+
+let twitterFrame; // this will be populated later by our identified frame
+
+for (const frame of page.mainFrame().childFrames()) {
+    // Here you can use few identifying methods like url(),name(),title()
+    if (frame.url().includes('twitter')){
+        console.log('we found the Twitter iframe')
+        twitterFrame = frame 
+        // we assign this frame to myFrame to use it later
+    }
+}
+
+await browser.close();
+```
+
+If it is hard to identify the iframe you want to access, don't worry. You can already use any Puppeteer method on the frame object to help you identify it, scrape it or manipulate it. You can also go through any nested frames.
+
+```JavaScript
+let twitterFrame;
+
+for (const frame of page.mainFrame().childFrames()) {
+    if (frame.url().includes('twitter')){
+        for(const nestedFrame of frame.childFrames()){
+             const tweetList = await nestedFrame.$('.timeline-TweetList')
+             if(tweetList){
+                 console.log('We found the frame with tweet list')
+                 twitterFrame = nestedFrame
+             }
+        }
+    }
+}
+```
+
+Here we used some more advanced techniques to find a nested `<iframe>`. Now when we have it assigned to our twitterFrame object, the hard work is over and we can start working with it (almost) like with a regular page object.
+
+```JavaScript
+const textFeed = await twitterFrame.$$eval('.timeline-Tweet-text', pElements => pElements.map((elem) => elem.textContent));
+
+for (const text of textFeed){
+    console.log(text)
+    console.log('**********')
+}
+```
+
+With a little more effort, we could also follow different links from the feed or even play a video, but that is not within the scope of this article. For all references about page and frame objects (and Puppeteer generally), you should study [the documentation](https://pub.dev/documentation/puppeteer/latest/puppeteer/Frame-class.html). New versions are released quite often, so checking the docs regularly can help you to stay on top of web scraping and automation.
diff --git a/content/academy/puppeteer_playwright/common_use_cases/submitting_a_form_with_a_file_attachment.md b/content/academy/puppeteer_playwright/common_use_cases/submitting_a_form_with_a_file_attachment.md
@@ -0,0 +1,74 @@
+---
+title: Submitting a form with a file attachment
+description: Understand how to download a file, attach it to a form using a headless browser in Playwright or Puppeteer, then submit the form.
+menuWeight: 4
+paths:
+    - puppeteer-playwright/common-use-cases/submitting-a-form-with-a-file-attachment
+---
+
+# Submitting a form with a file attachment
+
+We can use Puppeteer or Playwright to simulate submitting the same way a human-operated browser would.
+
+## [](#downloading-the-file) Downloading the file
+
+The first thing necessary is to download the file, which can be done using the `request-promise` module. We will also be using the `fs/promises` module to save it to the disk, so make sure they are included.
+
+```JavaScript
+import * as fs from 'fs/promises';
+import request from 'request-promise';
+```
+
+The actual downloading is slightly different for text and binary files. For a text file, it can simply be done like this:
+
+```JavaScript
+const fileData = await request('https://some-site.com/file.txt');
+```
+
+For a binary data file, we need to provide an additional parameter so as not to interpret it as text:
+
+```JavaScript
+const fileData = await request({
+    uri: 'https://some-site.com/file.pdf',
+    encoding: null
+});
+```
+
+In this case, `fileData` will be a `Buffer` instead of a string.
+
+To use the file in Puppeteer/Playwright, we need to save it to the disk. This can be done using the `fs/promises` module.
+
+```JavaScript
+await fs.writeFile('./file.pdf', fileData);
+```
+
+## [](#submitting-the-form) Submitting the form
+
+The first step necessary is to open the form page in Puppeteer. This can be done as follows:
+
+```JavaScript
+const browser = await puppeteer.launch();
+const page = await browser.newPage();
+await page.goto('https://some-site.com/file-upload.php');
+```
+
+To fill in any necessary form inputs, we can use the `page.type()` function. This works even in cases when `elem.value = 'value'` is not usable.
+
+```JavaScript
+await page.type('input[name=firstName]', 'John');
+await page.type('input[name=surname]', 'Doe');
+await page.type('input[name=email]', 'john.doe@mail.com');
+```
+
+To add the file to the appropriate input, we first need to find it and then use the [`uploadFile()`](https://pptr.dev/next/api/puppeteer.elementhandle.uploadfile) function.
+
+```JavaScript
+const fileInput = await page.$('input[type=file]');
+await fileInput.uploadFile('./file.pdf');
+```
+
+Now we can finally submit the form.
+
+```JavaScript
+await page.click('input[type=submit]');
+```