Skip to content

Commit b98170e

Browse files
authored
Merge pull request #436 from apify/migrate-tutorials
feat(scraping-shadow-doms)
2 parents 1471dd4 + 5bfc2e4 commit b98170e

File tree

2 files changed

+70
-0
lines changed

2 files changed

+70
-0
lines changed
101 KB
Loading
Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
---
2+
title: Scraping sites with a shadow DOM
3+
description: The shadow DOM enables the isolation of web components, but causes problems for those building web scrapers. Here's an easy workaround.
4+
menuWeight: 4
5+
paths:
6+
- tutorials/scraping-shadow-doms
7+
---
8+
9+
# [](#scraping-shadow-doms) Scraping sites with a shadow DOM
10+
11+
Each website is represented by an HTML DOM, a tree-like structure consisting of HTML elements (e.g. paragraphs, images, videos) and text. [Shadow DOM](https://developer.mozilla.org/en-US/docs/Web/Web_Components/Using_shadow_DOM) allows the separate DOM trees to be attached to the main DOM while remaining isolated in terms of CSS inheritance and JavaScript DOM manipulation. The CSS and JavaScript codes of separate shadow DOM components do not clash, but the downside is that you can't easily access the content from outside.
12+
13+
Let's take a look at this page [alodokter.com](https://www.alodokter.com/). If you click on the menu and open a Chrome debugger, you will see that the menu tree is attached to the main DOM as shadow DOM under the element `<top-navbar-view id="top-navbar-view">`.
14+
15+
![Shadow root of the top-navbar-view custom element]({{@asset tutorials/images/shadow.webp}})
16+
17+
The rest of the content is rendered the same way. This makes it hard to scrape because `document.body.innerText`, `document.getElementsByTagName('a')`, and all others return an empty result.
18+
19+
The content of the menu can be accessed only via the [`shadowRoot`](https://developer.mozilla.org/en-US/docs/Web/API/ShadowRoot) property. If you use jQuery you can do the following:
20+
21+
```JavaScript
22+
// Find element that is shadow root of menu DOM tree.
23+
const shadowRoot = document.getElementById('top-navbar-view').shadowRoot;
24+
25+
// Create a copy of its HTML and use jQuery find links.
26+
const links = $(shadowRoot.innerHTML).find('a');
27+
28+
// Get URLs from link elements.
29+
const urls = links.map((obj, el) => el.href);
30+
```
31+
32+
However, this isn't very convenient, because you have to find the root element of each component you want to work with, and you can't easily take advantage of all the scripts and tools you already have.
33+
34+
So instead of that, we can replace the content of each element containing shadow DOM with the HTML of shadow DOM.
35+
36+
```JavaScript
37+
// Iterate over all elements in the main DOM.
38+
for (let el of document.getElementsByTagName('*')) {
39+
// If element contains shadow root then replace its
40+
// content with the HTML of shadow DOM.
41+
if (el.shadowRoot) el.innerHTML = el.shadowRoot.innerHTML;
42+
}
43+
```
44+
45+
After you run this, you can access all the elements and content easily using jQuery or plain JavaScript. The downside is that it breaks all the interactive components because you create a new copy of the shadow DOM HTML content without the JavaScript code and CSS attached, so this must be done after all the content has been rendered.
46+
47+
Some websites may contain shadow DOMs recursively inside of shadow DOMs. In these cases, we must replace them with HTML recursively:
48+
49+
```JavaScript
50+
// Returns HTML of given shadow DOM.
51+
const getShadowDomHtml = (shadowRoot) => {
52+
let shadowHTML = '';
53+
for (let el of shadowRoot.childNodes) {
54+
shadowHTML += el.nodeValue || el.outerHTML;
55+
}
56+
return shadowHTML;
57+
};
58+
59+
// Recursively replaces shadow DOMs with their HTML.
60+
const replaceShadowDomsWithHtml = (rootElement) => {
61+
for (let el of rootElement.querySelectorAll('*')) {
62+
if (el.shadowRoot) {
63+
replaceShadowDomsWithHtml(shadowRoot);
64+
el.innerHTML += getShadowDomHtml(el.shadowRoot);
65+
}
66+
}
67+
};
68+
69+
replaceShadowDomsWithHtml(document.body);
70+
```

0 commit comments

Comments
 (0)