diff --git a/sources/academy/webscraping/anti_scraping/mitigation/using_proxies.md b/sources/academy/webscraping/anti_scraping/mitigation/using_proxies.md index 571eac685..ac399e1c0 100644 --- a/sources/academy/webscraping/anti_scraping/mitigation/using_proxies.md +++ b/sources/academy/webscraping/anti_scraping/mitigation/using_proxies.md @@ -11,99 +11,110 @@ slug: /anti-scraping/mitigation/using-proxies --- -In the [**Web scraping for beginners**](../../scraping_basics_javascript/crawling/pro_scraping.md) course, we learned about the power of Crawlee, and how it can streamline the development process of web crawlers. You've already seen how powerful the `crawlee` package is; however, what you've been exposed to thus far is only the tip of the iceberg. +In the [**Web scraping for beginners**](../../scraping_basics_javascript/index.md) course, we learned about the power of Crawlee, and how it can streamline the development process of web crawlers. You've already seen how powerful the `crawlee` package is; however, what you've been exposed to thus far is only the tip of the iceberg. Because proxies are so widely used in the scraping world, Crawlee has been equipped with features which make it easy to implement them in an effective way. One of the main functionalities that comes baked into Crawlee is proxy rotation, which is when each request is sent through a different proxy from a proxy pool. -## Implementing proxies in a scraper {#implementing-proxies} +## Implementing proxies {#implementing-proxies} -Let's borrow some scraper code from the end of the [pro-scraping](../../scraping_basics_javascript/crawling/pro_scraping.md) lesson in our **Web Scraping for Beginners** course and paste it into a new file called **proxies.js**. This code enqueues all of the product links on [demo-webstore.apify.org](https://demo-webstore.apify.org)'s on-sale page, then makes a request to each product page and scrapes data about each one: +Let's build on top of the code which appears at the end of the [Professional scraping](../../scraping_basics_javascript/crawling/pro_scraping.md) lesson of the **Web Scraping for Beginners** course. -```js -// crawlee.js +Let's paste the same code to a new file, `proxies.js`, and make some changes. The code crawls the [Sales](https://warehouse-theme-metal.myshopify.com/collections/sales) page of a sample e-commerce website. It goes through all of the product links, enqueues requests to each page with a product detail, and scrapes data about all of the products: + +```js title=proxies.js import { CheerioCrawler, Dataset } from 'crawlee'; const crawler = new CheerioCrawler({ requestHandler: async ({ $, request, enqueueLinks }) => { - if (request.label === 'START') { + console.log(`Fetching URL: ${request.url}`); + + if (request.label === 'start-url') { await enqueueLinks({ - selector: 'a[href*="/product/"]', + selector: 'a.product-item__title', }); - - // When on the START page, we don't want to - // extract any data after we extract the links. return; } - // We copied and pasted the extraction code - // from the previous lesson - const title = $('h3').text().trim(); - const price = $('h3 + div').text().trim(); - const description = $('div[class*="Text_body"]').text().trim(); + const title = $('h1').text().trim(); + const vendor = $('a.product-meta__vendor').text().trim(); + const price = $('span.price').contents()[2].nodeValue; + const reviewCount = parseInt($('span.rating__caption').text(), 10); + const description = $('div[class*="description"] div.rte').text().trim(); - // Instead of saving the data to a variable, - // we immediately save everything to a file. await Dataset.pushData({ title, - description, + vendor, price, + reviewCount, + description, }); }, }); await crawler.addRequests([{ - url: 'https://demo-webstore.apify.org/search/on-sale', - // By labeling the Request, we can very easily - // identify it later in the requestHandler. - label: 'START', + url: 'https://warehouse-theme-metal.myshopify.com/collections/sales', + label: 'start-url', }]); await crawler.run(); ``` -In order to implement a proxy pool, we will first need some proxies. We'll quickly use the free [proxy scraper](https://apify.com/mstephen190/proxy-scraper) on the Apify platform to get our hands on some quality proxies. Next, we'll need to set up a [`ProxyConfiguration`](https://crawlee.dev/api/core/class/ProxyConfiguration) and configure it with our custom proxies, like so: +We'll want all the requests to go through a proxies. For that we obviously need proxies! To get some, we can use Matthias Stephens' [free proxy scraper](https://apify.com/mstephen190/proxy-scraper). It can find tens of reliable proxies out of the thousands it scrapes. + +Once we have a list of proxies, we can add [`ProxyConfiguration`](https://crawlee.dev/api/core/class/ProxyConfiguration) and pass it to our crawler. + +Proxy pools usually consist of many proxy URLs, but for the sake of simplicity of this lesson we'll list just three. At the time you're reading this text, they most probably won't work anymore, so be sure to use your own values. ```js -import { ProxyConfiguration } from 'crawlee'; +import { CheerioCrawler, Dataset, ProxyConfiguration } from 'crawlee'; const proxyConfiguration = new ProxyConfiguration({ proxyUrls: ['http://45.42.177.37:3128', 'http://43.128.166.24:59394', 'http://51.79.49.178:3128'], }); -``` -Awesome, so there's our proxy pool! Usually, a proxy pool is much larger than this; however, a three proxies pool is totally fine for tutorial purposes. Finally, we can pass the `proxyConfiguration` into our crawler's options: - -```js const crawler = new CheerioCrawler({ proxyConfiguration, requestHandler: async ({ $, request, enqueueLinks }) => { - if (request.label === 'START') { + console.log(`Fetching URL: ${request.url}`); + + if (request.label === 'start-url') { await enqueueLinks({ - selector: 'a[href*="/product/"]', + selector: 'a.product-item__title', }); return; } - const title = $('h3').text().trim(); - const price = $('h3 + div').text().trim(); - const description = $('div[class*="Text_body"]').text().trim(); + const title = $('h1').text().trim(); + const vendor = $('a.product-meta__vendor').text().trim(); + const price = $('span.price').contents()[2].nodeValue; + const reviewCount = parseInt($('span.rating__caption').text(), 10); + const description = $('div[class*="description"] div.rte').text().trim(); await Dataset.pushData({ title, - description, + vendor, price, + reviewCount, + description, }); }, }); + +await crawler.addRequests([{ + url: 'https://warehouse-theme-metal.myshopify.com/collections/sales', + label: 'start-url', +}]); + +await crawler.run(); ``` -> Note that if you run this code, it may not work, as the proxies could potentially be down/non-operating at the time you are going through this course. +The crawler will now automatically rotate through the proxies we provided in the `proxyUrls` array. -That's it! The crawler will now automatically rotate through the proxies we provided in the `proxyUrls` option. +## Debugging proxies {#debugging-proxies} -## A bit about debugging proxies {#debugging-proxies} +To check that we're scraping through the proxies, we can get `proxyInfo` from the handler's context, which includes useful data about the proxy used to make the request. -At the time of writing, our above scraper utilizing our custom proxy pool is working just fine. But how can we check that the scraper is for sure using the proxies we provided it, and more importantly, how can we debug proxies within our scraper? Luckily, within the same `context` object we've been destructuring `$` and `request` out of, there is a `proxyInfo` key as well. `proxyInfo` is an object which includes useful data about the proxy which was used to make the request. +In the code example we already destructure the context object to `$` and `request`, so we can just add `proxyInfo` as something we want to access in the handler, too. ```js const crawler = new CheerioCrawler({ @@ -118,15 +129,21 @@ const crawler = new CheerioCrawler({ }); ``` -After modifying your code to log `proxyInfo` to the console and running the scraper, you're going to see some logs which look like this: +After modifying the code to log `proxyInfo` and after running the scraper, we can see proxy details about each request made: + +![Sample logs of proxyInfo](./images/proxy-info-logs.png) -![proxyInfo being logged by the scraper](./images/proxy-info-logs.png) +These logs confirm that Crawlee uses and automatically rotates the proxies. Such logs can be also useful for debugging slow or broken proxies. -These logs confirm that our proxies are being used and rotated successfully by Crawlee, and can also be used to debug slow or broken proxies. +## Carefree proxy scraping {#higher-level-proxy-scraping} -## Higher level proxy scraping {#higher-level-proxy-scraping} +If scraping and managing proxies on your own feels tedious, there are services which do that for you. One of them is [Apify Proxy](https://apify.com/proxy), which provides proxies with both residential and datacenter IP addresses. The integration with Crawlee is seamless, but first you need the Apify SDK: + +```shell +npm install apify +``` -Though we will discuss it more in-depth in future courses, it is still important to mention that Crawlee has integrated support for the Apify SDK, which supports [Apify Proxy](https://apify.com/proxy) - a service that provides access to pools of both residential and datacenter IP addresses. A `proxyConfiguration` using Apify Proxy might look something like this: +Then you can create the `proxyConfiguration` like this: ```js import { Actor } from 'apify'; @@ -136,7 +153,7 @@ const proxyConfiguration = await Actor.createProxyConfiguration({ }); ``` -Notice that we didn't provide it a list of proxy URLs. This is because the `SHADER` group already serves as our proxy pool (courtesy of Apify Proxy). +For more information about the integration refer to the [Apify SDK documentation](https://docs.apify.com/sdk/js/docs/guides/proxy-management). ## Next up {#next} diff --git a/sources/academy/webscraping/puppeteer_playwright/executing_scripts/extracting_data.md b/sources/academy/webscraping/puppeteer_playwright/executing_scripts/extracting_data.md index 0aa66f27c..15778d3af 100644 --- a/sources/academy/webscraping/puppeteer_playwright/executing_scripts/extracting_data.md +++ b/sources/academy/webscraping/puppeteer_playwright/executing_scripts/extracting_data.md @@ -14,7 +14,7 @@ import TabItem from '@theme/TabItem'; --- -Now that we know how to execute scripts on a page, we're ready to learn a bit about [data extraction](../../scraping_basics_javascript/data_extraction/index.md). In this lesson, we'll be scraping all the on-sale products from our [Fakestore](https://demo-webstore.apify.org/search/on-sale) website. +Now that we know how to execute scripts on a page, we're ready to learn a bit about [data extraction](../../scraping_basics_javascript/data_extraction/index.md). In this lesson, we'll be scraping all the on-sale products from [warehouse-theme-metal.myshopify.com](https://warehouse-theme-metal.myshopify.com/), a sample Shopify website. > Most web data extraction cases involve looping through a list of items of some sort. @@ -36,7 +36,7 @@ import { chromium } from 'playwright'; const browser = await chromium.launch({ headless: false }); const page = await browser.newPage(); -await page.goto('https://demo-webstore.apify.org/search/on-sale'); +await page.goto('https://warehouse-theme-metal.myshopify.com/collections/sales'); // code will go here @@ -54,7 +54,7 @@ import puppeteer from 'puppeteer'; const browser = await puppeteer.launch({ headless: false }); const page = await browser.newPage(); -await page.goto('https://demo-webstore.apify.org/search/on-sale'); +await page.goto('https://warehouse-theme-metal.myshopify.com/collections/sales'); // code will go here @@ -82,16 +82,12 @@ We'll be returning a bunch of product objects from this function, which will be ```js const products = await page.evaluate(() => { - const productCards = Array.from(document.querySelectorAll('a[class*="ProductCard_root"]')); + const productCards = Array.from(document.querySelectorAll('.product-item')); return productCards.map((element) => { - const name = element.querySelector('h3[class*="ProductCard_name"]').textContent; - const price = element.querySelector('div[class*="ProductCard_price"]').textContent; - - return { - name, - price, - }; + const name = element.querySelector('.product-item__title').textContent; + const price = element.querySelector('.price').lastChild.textContent; + return { name, price }; }); }); @@ -100,7 +96,20 @@ console.log(products); When we run this code, we see this logged to our console: -![Products logged to the console](./images/log-products.png) +```text +$ node index.js +[ + { + name: 'JBL Flip 4 Waterproof Portable Bluetooth Speaker', + price: '$74.95' + }, + { + name: 'Sony XBR-950G BRAVIA 4K HDR Ultra HD TV', + price: 'From $1,398.00' + }, + ... +] +``` ## Using jQuery {#using-jquery} @@ -118,19 +127,13 @@ Now, since we're able to use jQuery, let's translate our vanilla JavaScript code await page.addScriptTag({ url: 'https://code.jquery.com/jquery-3.6.0.min.js' }); const products = await page.evaluate(() => { - const productCards = Array.from($('a[class*="ProductCard_root"]')); - - return productCards.map((element) => { - const card = $(element); - - const name = card.find('h3[class*="ProductCard_name"]').text(); - const price = card.find('div[class*="ProductCard_price"]').text(); - - return { - name, - price, - }; - }); + const productCards = $('.product-item'); + return productCards.map(function () { + const card = $(this); + const name = card.find('.product-item__title').text(); + const price = card.find('.price').contents().last().text(); + return { name, price }; + }).get(); }); console.log(products); @@ -178,7 +181,7 @@ import { load } from 'cheerio'; const browser = await chromium.launch({ headless: false }); const page = await browser.newPage(); -await page.goto('https://demo-webstore.apify.org/search/on-sale'); +await page.goto('https://warehouse-theme-metal.myshopify.com/collections/sales'); const $ = load(await page.content()); @@ -197,7 +200,7 @@ import { load } from 'cheerio'; const browser = await puppeteer.launch({ headless: false }); const page = await browser.newPage(); -await page.goto('https://demo-webstore.apify.org/search/on-sale'); +await page.goto('https://warehouse-theme-metal.myshopify.com/collections/sales'); const $ = load(await page.content()); @@ -214,19 +217,13 @@ Now, to loop through all of the products, we'll make use of the `$` object and l ```js const $ = load(await page.content()); -const productCards = Array.from($('a[class*="ProductCard_root"]')); - -const products = productCards.map((element) => { - const card = $(element); - - const name = card.find('h3[class*="ProductCard_name"]').text(); - const price = card.find('div[class*="ProductCard_price"]').text(); - - return { - name, - price, - }; -}); +const productCards = $('.product-item'); +const products = productCards.map(function () { + const card = $(this); + const name = card.find('.product-item__title').text(); + const price = card.find('.price').contents().last().text(); + return { name, price }; +}).get(); console.log(products); ``` @@ -245,23 +242,17 @@ import { load } from 'cheerio'; const browser = await chromium.launch({ headless: false }); const page = await browser.newPage(); -await page.goto('https://demo-webstore.apify.org/search/on-sale'); +await page.goto('https://warehouse-theme-metal.myshopify.com/collections/sales'); const $ = load(await page.content()); -const productCards = Array.from($('a[class*="ProductCard_root"]')); - -const products = productCards.map((element) => { - const card = $(element); - - const name = card.find('h3[class*="ProductCard_name"]').text(); - const price = card.find('div[class*="ProductCard_price"]').text(); - - return { - name, - price, - }; -}); +const productCards = $('.product-item'); +const products = productCards.map(function () { + const card = $(this); + const name = card.find('.product-item__title').text(); + const price = card.find('.price').contents().last().text(); + return { name, price }; +}).get(); console.log(products); @@ -278,23 +269,17 @@ import { load } from 'cheerio'; const browser = await puppeteer.launch({ headless: false }); const page = await browser.newPage(); -await page.goto('https://demo-webstore.apify.org/search/on-sale'); +await page.goto('https://warehouse-theme-metal.myshopify.com/collections/sales'); const $ = load(await page.content()); -const productCards = Array.from($('a[class*="ProductCard_root"]')); - -const products = productCards.map((element) => { - const card = $(element); - - const name = card.find('h3[class*="ProductCard_name"]').text(); - const price = card.find('div[class*="ProductCard_price"]').text(); - - return { - name, - price, - }; -}); +const productCards = $('.product-item'); +const products = productCards.map(function () { + const card = $(this); + const name = card.find('.product-item__title').text(); + const price = card.find('.price').contents().last().text(); + return { name, price }; +}).get(); console.log(products); diff --git a/sources/academy/webscraping/puppeteer_playwright/executing_scripts/images/log-products.png b/sources/academy/webscraping/puppeteer_playwright/executing_scripts/images/log-products.png deleted file mode 100644 index 42acd3c2b..000000000 Binary files a/sources/academy/webscraping/puppeteer_playwright/executing_scripts/images/log-products.png and /dev/null differ