Skip to content

Commit 75fb4c6

Browse files
authored
Merge pull request #437 from apify/puppeteer-playwright-use-cases
docs(academy): Puppeteer/Playwright use cases migrations
2 parents ce6dc84 + 060149d commit 75fb4c6

File tree

5 files changed

+256
-14
lines changed

5 files changed

+256
-14
lines changed

content/academy/anti_scraping/mitigation/generating_fingerprints.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ const fingerprintGenerator = new FingerprintGenerator({
3030
});
3131

3232
// Grab a fingerprint from the fingerprint generator
33-
const { fingerprint } = fingerprintGenerator.getFingerprint({
33+
const generated = fingerprintGenerator.getFingerprint({
3434
locales: ["en-US", "en"]
3535
});
3636
```
@@ -65,20 +65,20 @@ const fingerprintGenerator = new FingerprintGenerator({
6565
});
6666

6767
// Grab a fingerprint
68-
const { fingerprint } = fingerprintGenerator.getFingerprint({
68+
const generated = fingerprintGenerator.getFingerprint({
6969
locales: ["en-US", "en"]
7070
});
7171

7272
// Create a new browser context, plugging in
7373
// some values from the fingerprint
7474
const context = await browser.newContext({
75-
userAgent: fingerprint.userAgent,
76-
locale: fingerprint.navigator.language,
75+
userAgent: generated.fingerprint.userAgent,
76+
locale: generated.fingerprint.navigator.language,
7777
});
7878

7979
// Attach the fingerprint to the newly created
8080
// browser context
81-
await fingerprintInjector.attachFingerprintToPlaywright(context, fingerprint);
81+
await fingerprintInjector.attachFingerprintToPlaywright(context, generated);
8282

8383
// Create a new page and go to Google
8484
const page = await context.newPage();
Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
---
2+
title: Downloading files
3+
description: Learn how to automate the downloading and saving of files to the disk using Puppeteer or Playwright.
4+
menuWeight: 3
5+
paths:
6+
- puppeteer-playwright/common-use-cases/downloading-files
7+
---
8+
9+
# Downloading files
10+
11+
Downloading a file using Puppeteer can be tricky. On some systems, there can be issues with the usual file saving process that prevent you from doing it the easy way. However, there are different techniques that work (most of the time).
12+
13+
These techniques are only necessary when we don't have a direct file link, which is usually the case when the file being downloaded is based on more complicated data export.
14+
15+
## [](#setting-up-a-download-path) Setting up a download path
16+
17+
Let's start with the easiest technique. This method tells the browser in what folder we want to download a file from Puppeteer after clicking on it.
18+
19+
```JavaScript
20+
await page._client.send('Page.setDownloadBehavior', {behavior: 'allow', downloadPath: './my-downloads'})
21+
```
22+
23+
We use the mysterious `_client` API which gives us access to all the functions of the underlying developer console protocol. Basically, it extends Puppeteer's functionality. Then we can download the file by clicking on the button.
24+
25+
```JavaScript
26+
await page.click('.export-button');
27+
```
28+
29+
Let's wait for one minute. In a real use case, you want to check the state of the file in the file system.
30+
31+
```JavaScript
32+
await page.waitFor(60000);
33+
```
34+
35+
To extract the file from the file system into memory, we have to first find its name, and then we can read it.
36+
37+
```JavaScript
38+
import fs from 'fs';
39+
40+
const fileNames = fs.readdirSync('./my-downloads');
41+
42+
// Let's pick the first one
43+
const fileData = fs.readFileSync(`./my-downloads/${fileNames[0]}`);
44+
```
45+
46+
## [](#intercepting-a-file-download-request) Intercepting and replicating a file download request
47+
48+
For this second option, we can trigger the file download, intercept the request going out, and then replicate it to get the actual data. First, we need to enable request interception. This is done using the following line of code:
49+
50+
```JavaScript
51+
await page.setRequestInterception(true);
52+
```
53+
54+
Next, we need to trigger the actual file export. We might need to fill in some form, select an exported file type, etc. In the end, it will look something like this:
55+
56+
```JavaScript
57+
await page.click('.export-button');
58+
```
59+
60+
We don't need to await this promise since we'll be waiting for the result of this action anyway (the triggered request).
61+
62+
The crucial part is intercepting the request that would result in downloading the file. Since the interception is already enabled, we just need to wait for the request to be sent.
63+
64+
```JavaScript
65+
const xRequest = await new Promise(resolve => {
66+
page.on('request', interceptedRequest => {
67+
interceptedRequest.abort(); //stop intercepting requests
68+
resolve(interceptedRequest);
69+
});
70+
});
71+
```
72+
73+
The last thing is to convert the intercepted Puppeteer request into a request-promise options object. We need to have the `request-promise` package installed.
74+
75+
```JavaScript
76+
import request from 'request-promise';
77+
```
78+
79+
Since the request interception does not include cookies, we need to add them subsequently.
80+
81+
```JavaScript
82+
const options = {
83+
encoding: null,
84+
method: xRequest._method,
85+
uri: xRequest._url,
86+
body: xRequest._postData,
87+
headers: xRequest._headers
88+
}
89+
90+
// Add the cookies
91+
const cookies = await page.cookies();
92+
options.headers.Cookie = cookies.map(ck => ck.name + '=' + ck.value).join(';');
93+
94+
// Resend the request
95+
const response = await request(options);
96+
```
97+
98+
Now, the response contains the binary data of the downloaded file. It can be saved to the disk, uploaded somewhere, or [submitted with another form]({{@link puppeteer_playwright/common_use_cases/submitting_a_form_with_a_file_attachment.md}}).

content/academy/puppeteer_playwright/common_use_cases/logging_into_a_website.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -41,11 +41,11 @@ await page.click('a:has-text("Sign in")');
4141
await page.waitForLoadState('load');
4242
4343
// Type in the username and continue forward
44-
await page.type('input[name="username"]', 'academy_playwright_login');
44+
await page.type('input[name="username"]', 'YOUR-LOGIN-HERE');
4545
await page.click('input[name="signin"]');
4646
4747
// Type in the password and continue forward
48-
await page.type('input[name="password"]', 'AcademyIsGreat88');
48+
await page.type('input[name="password"]', 'YOUR-PASSWORD-HERE');
4949
await page.click('button[name="verifyPassword"]');
5050
await page.waitForLoadState('load');
5151
@@ -67,11 +67,11 @@ await Promise.all([page.waitForSelector('a[data-ylk*="sign-in"]'), page.click('b
6767
await Promise.all([page.waitForNavigation(), page.click('a[data-ylk*="sign-in"]')]);
6868
6969
// Type in the username and continue forward
70-
await page.type('input[name="username"]', 'academy_playwright_login');
70+
await page.type('input[name="username"]', 'YOUR-LOGIN-HERE');
7171
await Promise.all([page.waitForNavigation(), page.click('input[name="signin"]')]);
7272
7373
// Type in the password and continue forward
74-
await page.type('input[name="password"]', 'AcademyIsGreat88');
74+
await page.type('input[name="password"]', 'YOUR-PASSWORD-HERE');
7575
await Promise.all([page.waitForNavigation(), page.click('button[name="verifyPassword"]')]);
7676
7777
// Wait for 10 seconds so we can see that we have in fact
@@ -80,7 +80,7 @@ await page.waitForTimeout(10000)
8080
</marked-tab>
8181
```
8282

83-
Great! If you're following along and nothing is wrong with the credentials, you should see that on the final navigated page, you're logged into the **Academy** Yahoo account.
83+
Great! If you're following along and you've replaced the placeholder credentials with your own, you should see that on the final navigated page, you're logged into your Yahoo account.
8484

8585
![Successfully logged into Yahoo]({{@asset puppeteer_playwright/common_use_cases/images/logged-in.webp}})
8686

@@ -289,10 +289,10 @@ await page.waitForSelector('a:has-text("Sign in")');
289289
await page.click('a:has-text("Sign in")');
290290
await page.waitForLoadState('load');
291291
292-
await page.type('input[name="username"]', 'academy_playwright_login');
292+
await page.type('input[name="username"]', 'YOUR-LOGIN-HERE');
293293
await page.click('input[name="signin"]');
294294
295-
await page.type('input[name="password"]', 'AcademyIsGreat88');
295+
await page.type('input[name="password"]', 'YOUR-PASSWORD-HERE');
296296
await page.click('button[name="verifyPassword"]');
297297
await page.waitForLoadState('load');
298298
@@ -355,10 +355,10 @@ await page.goto('https://www.yahoo.com/');
355355
await Promise.all([page.waitForSelector('a[data-ylk*="sign-in"]'), page.click('button[name="agree"]')]);
356356
await Promise.all([page.waitForNavigation(), page.click('a[data-ylk*="sign-in"]')]);
357357
358-
await page.type('input[name="username"]', 'academy_playwright_login');
358+
await page.type('input[name="username"]', 'YOUR-LOGIN-HERE');
359359
await Promise.all([page.waitForNavigation(), page.click('input[name="signin"]')]);
360360
361-
await page.type('input[name="password"]', 'AcademyIsGreat88');
361+
await page.type('input[name="password"]', 'YOUR-PASSWORD-HERE');
362362
await Promise.all([page.waitForNavigation(), page.click('button[name="verifyPassword"]')]);
363363
364364
const cookies = await page.cookies();
Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
---
2+
title: Scraping iFrames
3+
description: Learn how to scrape information from iFrames using Puppeteer or Playwright.
4+
menuWeight: 5
5+
paths:
6+
- puppeteer-playwright/common-use-cases/scraping-iframes
7+
---
8+
9+
# Scraping iFrames
10+
11+
Getting information from inside iFrames is a known pain, especially for new developers. After spending some time on Stack Overflow, you usually find answers like jQuery's `contents()` method or native contentDocument property, which can guide you to the insides of an iframe. But still, getting the right identifiers and holding that new context is a little annoying. Fortunately, you can make everything simpler and more straightforward by scraping iFrames with Puppeteer.
12+
13+
## Finding the right `<iframe>`
14+
15+
If you are using basic methods of page objects like `page.evaluate()`, you are actually already working with frames. Behind the scenes, Puppeteer will call `page.mainFrame().evaluate()`, so most of the methods you are using with page object can be used the same way with frame object. To access frames, you need to loop over the main frame's child frames and identify the one you want to use.
16+
17+
As a simple demonstration, we'll scrape the Twitter widget iFrame from [IMDB](https://www.imdb.com/).
18+
19+
```JavaScript
20+
const browser = await puppeteer.launch() ;
21+
22+
const page = await browser.newPage();
23+
24+
await page.goto('https://www.imdb.com');
25+
await page.waitFor(5000); // we need to wait for Twitter widget to load
26+
27+
let twitterFrame; // this will be populated later by our identified frame
28+
29+
for (const frame of page.mainFrame().childFrames()) {
30+
// Here you can use few identifying methods like url(),name(),title()
31+
if (frame.url().includes('twitter')){
32+
console.log('we found the Twitter iframe')
33+
twitterFrame = frame
34+
// we assign this frame to myFrame to use it later
35+
}
36+
}
37+
38+
await browser.close();
39+
```
40+
41+
If it is hard to identify the iframe you want to access, don't worry. You can already use any Puppeteer method on the frame object to help you identify it, scrape it or manipulate it. You can also go through any nested frames.
42+
43+
```JavaScript
44+
let twitterFrame;
45+
46+
for (const frame of page.mainFrame().childFrames()) {
47+
if (frame.url().includes('twitter')){
48+
for(const nestedFrame of frame.childFrames()){
49+
const tweetList = await nestedFrame.$('.timeline-TweetList')
50+
if(tweetList){
51+
console.log('We found the frame with tweet list')
52+
twitterFrame = nestedFrame
53+
}
54+
}
55+
}
56+
}
57+
```
58+
59+
Here we used some more advanced techniques to find a nested `<iframe>`. Now when we have it assigned to our twitterFrame object, the hard work is over and we can start working with it (almost) like with a regular page object.
60+
61+
```JavaScript
62+
const textFeed = await twitterFrame.$$eval('.timeline-Tweet-text', pElements => pElements.map((elem) => elem.textContent));
63+
64+
for (const text of textFeed){
65+
console.log(text)
66+
console.log('**********')
67+
}
68+
```
69+
70+
With a little more effort, we could also follow different links from the feed or even play a video, but that is not within the scope of this article. For all references about page and frame objects (and Puppeteer generally), you should study [the documentation](https://pub.dev/documentation/puppeteer/latest/puppeteer/Frame-class.html). New versions are released quite often, so checking the docs regularly can help you to stay on top of web scraping and automation.
Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
---
2+
title: Submitting a form with a file attachment
3+
description: Understand how to download a file, attach it to a form using a headless browser in Playwright or Puppeteer, then submit the form.
4+
menuWeight: 4
5+
paths:
6+
- puppeteer-playwright/common-use-cases/submitting-a-form-with-a-file-attachment
7+
---
8+
9+
# Submitting a form with a file attachment
10+
11+
We can use Puppeteer or Playwright to simulate submitting the same way a human-operated browser would.
12+
13+
## [](#downloading-the-file) Downloading the file
14+
15+
The first thing necessary is to download the file, which can be done using the `request-promise` module. We will also be using the `fs/promises` module to save it to the disk, so make sure they are included.
16+
17+
```JavaScript
18+
import * as fs from 'fs/promises';
19+
import request from 'request-promise';
20+
```
21+
22+
The actual downloading is slightly different for text and binary files. For a text file, it can simply be done like this:
23+
24+
```JavaScript
25+
const fileData = await request('https://some-site.com/file.txt');
26+
```
27+
28+
For a binary data file, we need to provide an additional parameter so as not to interpret it as text:
29+
30+
```JavaScript
31+
const fileData = await request({
32+
uri: 'https://some-site.com/file.pdf',
33+
encoding: null
34+
});
35+
```
36+
37+
In this case, `fileData` will be a `Buffer` instead of a string.
38+
39+
To use the file in Puppeteer/Playwright, we need to save it to the disk. This can be done using the `fs/promises` module.
40+
41+
```JavaScript
42+
await fs.writeFile('./file.pdf', fileData);
43+
```
44+
45+
## [](#submitting-the-form) Submitting the form
46+
47+
The first step necessary is to open the form page in Puppeteer. This can be done as follows:
48+
49+
```JavaScript
50+
const browser = await puppeteer.launch();
51+
const page = await browser.newPage();
52+
await page.goto('https://some-site.com/file-upload.php');
53+
```
54+
55+
To fill in any necessary form inputs, we can use the `page.type()` function. This works even in cases when `elem.value = 'value'` is not usable.
56+
57+
```JavaScript
58+
await page.type('input[name=firstName]', 'John');
59+
await page.type('input[name=surname]', 'Doe');
60+
await page.type('input[name=email]', '[email protected]');
61+
```
62+
63+
To add the file to the appropriate input, we first need to find it and then use the [`uploadFile()`](https://pptr.dev/next/api/puppeteer.elementhandle.uploadfile) function.
64+
65+
```JavaScript
66+
const fileInput = await page.$('input[type=file]');
67+
await fileInput.uploadFile('./file.pdf');
68+
```
69+
70+
Now we can finally submit the form.
71+
72+
```JavaScript
73+
await page.click('input[type=submit]');
74+
```

0 commit comments

Comments
 (0)