Skip to content

Commit 6902cd1

Browse files
committed
feat(scraping-iframes)
1 parent f050f40 commit 6902cd1

File tree

1 file changed

+70
-0
lines changed

1 file changed

+70
-0
lines changed
Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
---
2+
title: Scraping iFrames
3+
description: Learn how to scrape information from iFrames using Puppeteer or Playwright.
4+
menuWeight: 4
5+
paths:
6+
- puppeteer-playwright/common-use-cases/scraping-iframes
7+
---
8+
9+
# Scraping iFrames
10+
11+
Getting information from inside iFrames is a known pain, especially for new developers. After spending some time on Stack Overflow, you usually find answers like jQuery's `contents()` method or native contentDocument property, which can guide you to the insides of an iframe. But still, getting the right identifiers and holding that new context is a little annoying. Fortunately, you can make everything simpler and more straightforward by scraping iFrames with Puppeteer.
12+
13+
## Finding the right `<iframe>`
14+
15+
If you are using basic methods of page objects like `page.evaluate()`, you are actually already working with frames. Behind the scenes, Puppeteer will call `page.mainFrame().evaluate()`, so most of the methods you are using with page object can be used the same way with frame object. To access frames, you need to loop over the main frame's child frames and identify the one you want to use.
16+
17+
As a simple demonstration, we'll scrape the Twitter widget iFrame from [IMDB](https://www.imdb.com/).
18+
19+
```JavaScript
20+
const browser = await puppeteer.launch() ;
21+
22+
const page = await browser.newPage();
23+
24+
await page.goto('https://www.imdb.com');
25+
await page.waitFor(5000); // we need to wait for Twitter widget to load
26+
27+
let twitterFrame; // this will be populated later by our identified frame
28+
29+
for (const frame of page.mainFrame().childFrames()) {
30+
// Here you can use few identifying methods like url(),name(),title()
31+
if (frame.url().includes('twitter')){
32+
console.log('we found the Twitter iframe')
33+
twitterFrame = frame
34+
// we assign this frame to myFrame to use it later
35+
}
36+
}
37+
38+
await browser.close();
39+
```
40+
41+
If it is hard to identify the iframe you want to access, don't worry. You can already use any Puppeteer method on the frame object to help you identify it, scrape it or manipulate it. You can also go through any nested frames.
42+
43+
```JavaScript
44+
let twitterFrame;
45+
46+
for (const frame of page.mainFrame().childFrames()) {
47+
if (frame.url().includes('twitter')){
48+
for(const nestedFrame of frame.childFrames()){
49+
const tweetList = await nestedFrame.$('.timeline-TweetList')
50+
if(tweetList){
51+
console.log('We found the frame with tweet list')
52+
twitterFrame = nestedFrame
53+
}
54+
}
55+
}
56+
}
57+
```
58+
59+
Here we used some more advanced techniques to find a nested `<iframe>`. Now when we have it assigned to our twitterFrame object, the hard work is over and we can start working with it (almost) like with a regular page object.
60+
61+
```JavaScript
62+
const textFeed = await twitterFrame.$$eval('.timeline-Tweet-text', pElements => pElements.map((elem) => elem.textContent));
63+
64+
for (const text of textFeed){
65+
console.log(text)
66+
console.log('**********')
67+
}
68+
```
69+
70+
With a little more effort, we could also follow different links from the feed or even play a video, but that is not within the scope of this article. For all references about page and frame objects (and Puppeteer generally), you should study [the documentation](https://pub.dev/documentation/puppeteer/latest/puppeteer/Frame-class.html). New versions are released quite often, so checking the docs regularly can help you to stay on top of web scraping and automation.

0 commit comments

Comments
 (0)