diff --git a/sources/academy/webscraping/scraping_basics_javascript2/_exercises.mdx b/sources/academy/webscraping/scraping_basics/_exercises.mdx similarity index 100% rename from sources/academy/webscraping/scraping_basics_javascript2/_exercises.mdx rename to sources/academy/webscraping/scraping_basics/_exercises.mdx diff --git a/sources/academy/webscraping/scraping_basics_python/images/actor-file-structure.webp b/sources/academy/webscraping/scraping_basics/images/actor-file-structure.webp similarity index 100% rename from sources/academy/webscraping/scraping_basics_python/images/actor-file-structure.webp rename to sources/academy/webscraping/scraping_basics/images/actor-file-structure.webp diff --git a/sources/academy/webscraping/scraping_basics_python/images/actor-input-proxies.webp b/sources/academy/webscraping/scraping_basics/images/actor-input-proxies.webp similarity index 100% rename from sources/academy/webscraping/scraping_basics_python/images/actor-input-proxies.webp rename to sources/academy/webscraping/scraping_basics/images/actor-input-proxies.webp diff --git a/sources/academy/webscraping/scraping_basics_python/images/actor-input.webp b/sources/academy/webscraping/scraping_basics/images/actor-input.webp similarity index 100% rename from sources/academy/webscraping/scraping_basics_python/images/actor-input.webp rename to sources/academy/webscraping/scraping_basics/images/actor-input.webp diff --git a/sources/academy/webscraping/scraping_basics_python/images/actor-output.webp b/sources/academy/webscraping/scraping_basics/images/actor-output.webp similarity index 100% rename from sources/academy/webscraping/scraping_basics_python/images/actor-output.webp rename to sources/academy/webscraping/scraping_basics/images/actor-output.webp diff --git a/sources/academy/webscraping/scraping_basics_python/images/actor-schedule.webp b/sources/academy/webscraping/scraping_basics/images/actor-schedule.webp similarity index 100% rename from sources/academy/webscraping/scraping_basics_python/images/actor-schedule.webp rename to sources/academy/webscraping/scraping_basics/images/actor-schedule.webp diff --git a/sources/academy/webscraping/scraping_basics_python/images/child-elements.png b/sources/academy/webscraping/scraping_basics/images/child-elements.png similarity index 100% rename from sources/academy/webscraping/scraping_basics_python/images/child-elements.png rename to sources/academy/webscraping/scraping_basics/images/child-elements.png diff --git a/sources/academy/webscraping/scraping_basics_python/images/csv-example.png b/sources/academy/webscraping/scraping_basics/images/csv-example.png similarity index 100% rename from sources/academy/webscraping/scraping_basics_python/images/csv-example.png rename to sources/academy/webscraping/scraping_basics/images/csv-example.png diff --git a/sources/academy/webscraping/scraping_basics_python/images/csv-sheets.png b/sources/academy/webscraping/scraping_basics/images/csv-sheets.png similarity index 100% rename from sources/academy/webscraping/scraping_basics_python/images/csv-sheets.png rename to sources/academy/webscraping/scraping_basics/images/csv-sheets.png diff --git a/sources/academy/webscraping/scraping_basics_python/images/csv.png b/sources/academy/webscraping/scraping_basics/images/csv.png similarity index 100% rename from sources/academy/webscraping/scraping_basics_python/images/csv.png rename to sources/academy/webscraping/scraping_basics/images/csv.png diff --git a/sources/academy/webscraping/scraping_basics_python/images/dataset-item.png b/sources/academy/webscraping/scraping_basics/images/dataset-item.png similarity index 100% rename from sources/academy/webscraping/scraping_basics_python/images/dataset-item.png rename to sources/academy/webscraping/scraping_basics/images/dataset-item.png diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-console-textcontent.png b/sources/academy/webscraping/scraping_basics/images/devtools-console-textcontent.png similarity index 100% rename from sources/academy/webscraping/scraping_basics_python/images/devtools-console-textcontent.png rename to sources/academy/webscraping/scraping_basics/images/devtools-console-textcontent.png diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-console-variable.png b/sources/academy/webscraping/scraping_basics/images/devtools-console-variable.png similarity index 100% rename from sources/academy/webscraping/scraping_basics_python/images/devtools-console-variable.png rename to sources/academy/webscraping/scraping_basics/images/devtools-console-variable.png diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-console.png b/sources/academy/webscraping/scraping_basics/images/devtools-console.png similarity index 100% rename from sources/academy/webscraping/scraping_basics_python/images/devtools-console.png rename to sources/academy/webscraping/scraping_basics/images/devtools-console.png diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-element-selection.png b/sources/academy/webscraping/scraping_basics/images/devtools-element-selection.png similarity index 100% rename from sources/academy/webscraping/scraping_basics_python/images/devtools-element-selection.png rename to sources/academy/webscraping/scraping_basics/images/devtools-element-selection.png diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-elements-tab.png b/sources/academy/webscraping/scraping_basics/images/devtools-elements-tab.png similarity index 100% rename from sources/academy/webscraping/scraping_basics_python/images/devtools-elements-tab.png rename to sources/academy/webscraping/scraping_basics/images/devtools-elements-tab.png diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-exercise-cnn.png b/sources/academy/webscraping/scraping_basics/images/devtools-exercise-cnn.png similarity index 100% rename from sources/academy/webscraping/scraping_basics_python/images/devtools-exercise-cnn.png rename to sources/academy/webscraping/scraping_basics/images/devtools-exercise-cnn.png diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-exercise-fandom.png b/sources/academy/webscraping/scraping_basics/images/devtools-exercise-fandom.png similarity index 100% rename from sources/academy/webscraping/scraping_basics_python/images/devtools-exercise-fandom.png rename to sources/academy/webscraping/scraping_basics/images/devtools-exercise-fandom.png diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-exercise-fifa.png b/sources/academy/webscraping/scraping_basics/images/devtools-exercise-fifa.png similarity index 100% rename from sources/academy/webscraping/scraping_basics_python/images/devtools-exercise-fifa.png rename to sources/academy/webscraping/scraping_basics/images/devtools-exercise-fifa.png diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-exercise-guardian1.png b/sources/academy/webscraping/scraping_basics/images/devtools-exercise-guardian1.png similarity index 100% rename from sources/academy/webscraping/scraping_basics_python/images/devtools-exercise-guardian1.png rename to sources/academy/webscraping/scraping_basics/images/devtools-exercise-guardian1.png diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-exercise-guardian2.png b/sources/academy/webscraping/scraping_basics/images/devtools-exercise-guardian2.png similarity index 100% rename from sources/academy/webscraping/scraping_basics_python/images/devtools-exercise-guardian2.png rename to sources/academy/webscraping/scraping_basics/images/devtools-exercise-guardian2.png diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-exercise-shein.png b/sources/academy/webscraping/scraping_basics/images/devtools-exercise-shein.png similarity index 100% rename from sources/academy/webscraping/scraping_basics_python/images/devtools-exercise-shein.png rename to sources/academy/webscraping/scraping_basics/images/devtools-exercise-shein.png diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-exercise-wikipedia.png b/sources/academy/webscraping/scraping_basics/images/devtools-exercise-wikipedia.png similarity index 100% rename from sources/academy/webscraping/scraping_basics_python/images/devtools-exercise-wikipedia.png rename to sources/academy/webscraping/scraping_basics/images/devtools-exercise-wikipedia.png diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-extracting-price.png b/sources/academy/webscraping/scraping_basics/images/devtools-extracting-price.png similarity index 100% rename from sources/academy/webscraping/scraping_basics_python/images/devtools-extracting-price.png rename to sources/academy/webscraping/scraping_basics/images/devtools-extracting-price.png diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-extracting-text.png b/sources/academy/webscraping/scraping_basics/images/devtools-extracting-text.png similarity index 100% rename from sources/academy/webscraping/scraping_basics_python/images/devtools-extracting-text.png rename to sources/academy/webscraping/scraping_basics/images/devtools-extracting-text.png diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-extracting-title.png b/sources/academy/webscraping/scraping_basics/images/devtools-extracting-title.png similarity index 100% rename from sources/academy/webscraping/scraping_basics_python/images/devtools-extracting-title.png rename to sources/academy/webscraping/scraping_basics/images/devtools-extracting-title.png diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-hover-product.png b/sources/academy/webscraping/scraping_basics/images/devtools-hover-product.png similarity index 100% rename from sources/academy/webscraping/scraping_basics_python/images/devtools-hover-product.png rename to sources/academy/webscraping/scraping_basics/images/devtools-hover-product.png diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-hover-queryselector.png b/sources/academy/webscraping/scraping_basics/images/devtools-hover-queryselector.png similarity index 100% rename from sources/academy/webscraping/scraping_basics_python/images/devtools-hover-queryselector.png rename to sources/academy/webscraping/scraping_basics/images/devtools-hover-queryselector.png diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-hover-queryselectorall.png b/sources/academy/webscraping/scraping_basics/images/devtools-hover-queryselectorall.png similarity index 100% rename from sources/academy/webscraping/scraping_basics_python/images/devtools-hover-queryselectorall.png rename to sources/academy/webscraping/scraping_basics/images/devtools-hover-queryselectorall.png diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-hover.png b/sources/academy/webscraping/scraping_basics/images/devtools-hover.png similarity index 100% rename from sources/academy/webscraping/scraping_basics_python/images/devtools-hover.png rename to sources/academy/webscraping/scraping_basics/images/devtools-hover.png diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-product-details.png b/sources/academy/webscraping/scraping_basics/images/devtools-product-details.png similarity index 100% rename from sources/academy/webscraping/scraping_basics_python/images/devtools-product-details.png rename to sources/academy/webscraping/scraping_basics/images/devtools-product-details.png diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-product-list.png b/sources/academy/webscraping/scraping_basics/images/devtools-product-list.png similarity index 100% rename from sources/academy/webscraping/scraping_basics_python/images/devtools-product-list.png rename to sources/academy/webscraping/scraping_basics/images/devtools-product-list.png diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-product-title.png b/sources/academy/webscraping/scraping_basics/images/devtools-product-title.png similarity index 100% rename from sources/academy/webscraping/scraping_basics_python/images/devtools-product-title.png rename to sources/academy/webscraping/scraping_basics/images/devtools-product-title.png diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-queryselector.webp b/sources/academy/webscraping/scraping_basics/images/devtools-queryselector.webp similarity index 100% rename from sources/academy/webscraping/scraping_basics_python/images/devtools-queryselector.webp rename to sources/academy/webscraping/scraping_basics/images/devtools-queryselector.webp diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-warehouse.png b/sources/academy/webscraping/scraping_basics/images/devtools-warehouse.png similarity index 100% rename from sources/academy/webscraping/scraping_basics_python/images/devtools-warehouse.png rename to sources/academy/webscraping/scraping_basics/images/devtools-warehouse.png diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-wikipedia.png b/sources/academy/webscraping/scraping_basics/images/devtools-wikipedia.png similarity index 100% rename from sources/academy/webscraping/scraping_basics_python/images/devtools-wikipedia.png rename to sources/academy/webscraping/scraping_basics/images/devtools-wikipedia.png diff --git a/sources/academy/webscraping/scraping_basics_python/images/h1.png b/sources/academy/webscraping/scraping_basics/images/h1.png similarity index 100% rename from sources/academy/webscraping/scraping_basics_python/images/h1.png rename to sources/academy/webscraping/scraping_basics/images/h1.png diff --git a/sources/academy/webscraping/scraping_basics_python/images/pdp.png b/sources/academy/webscraping/scraping_basics/images/pdp.png similarity index 100% rename from sources/academy/webscraping/scraping_basics_python/images/pdp.png rename to sources/academy/webscraping/scraping_basics/images/pdp.png diff --git a/sources/academy/webscraping/scraping_basics_python/images/product-item.png b/sources/academy/webscraping/scraping_basics/images/product-item.png similarity index 100% rename from sources/academy/webscraping/scraping_basics_python/images/product-item.png rename to sources/academy/webscraping/scraping_basics/images/product-item.png diff --git a/sources/academy/webscraping/scraping_basics_python/images/refactoring.gif b/sources/academy/webscraping/scraping_basics/images/refactoring.gif similarity index 100% rename from sources/academy/webscraping/scraping_basics_python/images/refactoring.gif rename to sources/academy/webscraping/scraping_basics/images/refactoring.gif diff --git a/sources/academy/webscraping/scraping_basics_python/images/scraping.webp b/sources/academy/webscraping/scraping_basics/images/scraping.webp similarity index 100% rename from sources/academy/webscraping/scraping_basics_python/images/scraping.webp rename to sources/academy/webscraping/scraping_basics/images/scraping.webp diff --git a/sources/academy/webscraping/scraping_basics_python/images/variants-js.gif b/sources/academy/webscraping/scraping_basics/images/variants-js.gif similarity index 100% rename from sources/academy/webscraping/scraping_basics_python/images/variants-js.gif rename to sources/academy/webscraping/scraping_basics/images/variants-js.gif diff --git a/sources/academy/webscraping/scraping_basics_python/images/variants.png b/sources/academy/webscraping/scraping_basics/images/variants.png similarity index 100% rename from sources/academy/webscraping/scraping_basics_python/images/variants.png rename to sources/academy/webscraping/scraping_basics/images/variants.png diff --git a/sources/academy/webscraping/scraping_basics_javascript2/01_devtools_inspecting.md b/sources/academy/webscraping/scraping_basics_javascript2/01_devtools_inspecting.md index 2540bfd21b..3df10ee4a2 100644 --- a/sources/academy/webscraping/scraping_basics_javascript2/01_devtools_inspecting.md +++ b/sources/academy/webscraping/scraping_basics_javascript2/01_devtools_inspecting.md @@ -6,7 +6,7 @@ slug: /scraping-basics-javascript2/devtools-inspecting unlisted: true --- -import Exercises from './_exercises.mdx'; +import Exercises from '../scraping_basics/_exercises.mdx'; **In this lesson we'll use the browser tools for developers to inspect and manipulate the structure of a website.** @@ -28,11 +28,11 @@ Google Chrome is currently the most popular browser, and many others use the sam Now let's peek behind the scenes of a real-world website—say, Wikipedia. We'll open Google Chrome and visit [wikipedia.org](https://www.wikipedia.org/). Then, let's press **F12**, or right-click anywhere on the page and select **Inspect**. -![Wikipedia with Chrome DevTools open](./images/devtools-wikipedia.png) +![Wikipedia with Chrome DevTools open](../scraping_basics/images/devtools-wikipedia.png) Websites are built with three main technologies: HTML, CSS, and JavaScript. In the **Elements** tab, DevTools shows the HTML and CSS of the current page: -![Elements tab in Chrome DevTools](./images/devtools-elements-tab.png) +![Elements tab in Chrome DevTools](../scraping_basics/images/devtools-elements-tab.png) :::warning Screen adaptations @@ -62,17 +62,17 @@ While HTML and CSS describe what the browser should display, JavaScript adds int If you don't see it, press ESC to toggle the Console. Running commands in the Console lets us manipulate the loaded page—we’ll try this shortly. -![Console in Chrome DevTools](./images/devtools-console.png) +![Console in Chrome DevTools](../scraping_basics/images/devtools-console.png) ## Selecting an element In the top-left corner of DevTools, let's find the icon with an arrow pointing to a square. -![Chrome DevTools element selection tool](./images/devtools-element-selection.png) +![Chrome DevTools element selection tool](../scraping_basics/images/devtools-element-selection.png) We'll click the icon and hover your cursor over Wikipedia's subtitle, **The Free Encyclopedia**. As we move our cursor, DevTools will display information about the HTML element under it. We'll click on the subtitle. In the **Elements** tab, DevTools will highlight the HTML element that represents the subtitle. -![Chrome DevTools element hover](./images/devtools-hover.png) +![Chrome DevTools element hover](../scraping_basics/images/devtools-hover.png) The highlighted section should look something like this: @@ -108,7 +108,7 @@ We won't be creating Node.js scrapers just yet. Let's first get familiar with wh In the **Elements** tab, with the subtitle element highlighted, let's right-click the element to open the context menu. There, we'll choose **Store as global variable**. The **Console** should appear, with a `temp1` variable ready. -![Global variable in Chrome DevTools Console](./images/devtools-console-variable.png) +![Global variable in Chrome DevTools Console](../scraping_basics/images/devtools-console-variable.png) The Console allows us to run code in the context of the loaded page. We can use it to play around with elements. @@ -132,7 +132,7 @@ temp1.textContent = 'Hello World!'; When we change elements in the Console, those changes reflect immediately on the page! -![Changing textContent in Chrome DevTools Console](./images/devtools-console-textcontent.png) +![Changing textContent in Chrome DevTools Console](../scraping_basics/images/devtools-console-textcontent.png) But don't worry—we haven't hacked Wikipedia. The change only happens in our browser. If we reload the page, the change will disappear. This, however, is an easy way to craft a screenshot with fake content. That's why screenshots shouldn't be trusted as evidence. @@ -161,7 +161,7 @@ You're looking for an [`img`](https://developer.mozilla.org/en-US/docs/Web/HTML/ 1. Send the highlighted element to the **Console** using the **Store as global variable** option from the context menu. 1. In the console, type `temp1.src` and hit **Enter**. - ![DevTools exercise result](./images/devtools-exercise-fifa.png) + ![DevTools exercise result](../scraping_basics/images/devtools-exercise-fifa.png) @@ -178,6 +178,6 @@ Open a news website, such as [CNN](https://cnn.com). Use the Console to change t 1. Send the highlighted element to the **Console** using the **Store as global variable** option from the context menu. 1. In the console, type `temp1.textContent = 'Something something'` and hit **Enter**. - ![DevTools exercise result](./images/devtools-exercise-cnn.png) + ![DevTools exercise result](../scraping_basics/images/devtools-exercise-cnn.png) diff --git a/sources/academy/webscraping/scraping_basics_javascript2/02_devtools_locating_elements.md b/sources/academy/webscraping/scraping_basics_javascript2/02_devtools_locating_elements.md index 0796418c9e..f8a86f124d 100644 --- a/sources/academy/webscraping/scraping_basics_javascript2/02_devtools_locating_elements.md +++ b/sources/academy/webscraping/scraping_basics_javascript2/02_devtools_locating_elements.md @@ -6,7 +6,7 @@ slug: /scraping-basics-javascript2/devtools-locating-elements unlisted: true --- -import Exercises from './_exercises.mdx'; +import Exercises from '../scraping_basics/_exercises.mdx'; **In this lesson we'll use the browser tools for developers to manually find products on an e-commerce website.** @@ -30,17 +30,17 @@ That said, we designed all the additional exercises to work with live websites. As mentioned in the previous lesson, before building a scraper, we need to understand structure of the target page and identify the specific elements our program should extract. Let's figure out how to select details for each product on the [Sales page](https://warehouse-theme-metal.myshopify.com/collections/sales). -![Warehouse store with DevTools open](./images/devtools-warehouse.png) +![Warehouse store with DevTools open](../scraping_basics/images/devtools-warehouse.png) The page displays a grid of product cards, each showing a product's title and picture. Let's open DevTools and locate the title of the **Sony SACS9 Active Subwoofer**. We'll highlight it in the **Elements** tab by clicking on it. -![Selecting an element with DevTools](./images/devtools-product-title.png) +![Selecting an element with DevTools](../scraping_basics/images/devtools-product-title.png) Next, let's find all the elements containing details about this subwoofer—its price, number of reviews, image, and more. In the **Elements** tab, we'll move our cursor up from the `a` element containing the subwoofer's title. On the way, we'll hover over each element until we highlight the entire product card. Alternatively, we can use the arrow-up key. The `div` element we land on is the **parent element**, and all nested elements are its **child elements**. -![Selecting an element with hover](./images/devtools-hover-product.png) +![Selecting an element with hover](../scraping_basics/images/devtools-hover-product.png) At this stage, we could use the **Store as global variable** option to send the element to the **Console**. While helpful for manual inspection, this isn't something a program can do. @@ -64,7 +64,7 @@ document.querySelector('.product-item'); It will return the HTML element for the first product card in the listing: -![Using querySelector() in DevTools Console](./images/devtools-queryselector.webp) +![Using querySelector() in DevTools Console](../scraping_basics/images/devtools-queryselector.webp) CSS selectors can get quite complex, but the basics are enough to scrape most of the Warehouse store. Let's cover two simple types and how they can combine. @@ -114,13 +114,13 @@ The product card has four classes: `product-item`, `product-item--vertical`, `1/ This class is also unique enough in the page's context. If it were something generic like `item`, there would be a higher risk that developers of the website might use it for unrelated elements. In the **Elements** tab, we can see a parent element `product-list` that contains all the product cards marked as `product-item`. This structure aligns with the data we're after. -![Overview of all the product cards in DevTools](./images/devtools-product-list.png) +![Overview of all the product cards in DevTools](../scraping_basics/images/devtools-product-list.png) ## Locating all product cards In the **Console**, hovering our cursor over objects representing HTML elements highlights the corresponding elements on the page. This way we can verify that when we query `.product-item`, the result represents the JBL Flip speaker—the first product card in the list. -![Highlighting a querySelector() result](./images/devtools-hover-queryselector.png) +![Highlighting a querySelector() result](../scraping_basics/images/devtools-hover-queryselector.png) But what if we want to scrape details about the Sony subwoofer we inspected earlier? For that, we need a method that selects more than just the first match: [`querySelectorAll()`](https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelectorAll). As the name suggests, it takes a CSS selector string and returns all matching HTML elements. Let's type this into the **Console**: @@ -132,7 +132,7 @@ The returned value is a [`NodeList`](https://developer.mozilla.org/en-US/docs/We We'll expand the result by clicking the small arrow, then hover our cursor over the third element in the list. Indexing starts at 0, so the third element is at index 2. There it is—the product card for the subwoofer! -![Highlighting a querySelectorAll() result](./images/devtools-hover-queryselectorall.png) +![Highlighting a querySelectorAll() result](../scraping_basics/images/devtools-hover-queryselectorall.png) To save the subwoofer in a variable for further inspection, we can use index access with brackets, just like with regular JavaScript arrays: @@ -151,7 +151,7 @@ Even though we're just playing in the browser's **Console**, we're inching close On English Wikipedia's [Main Page](https://en.wikipedia.org/wiki/Main_Page), use CSS selectors in the **Console** to list the HTML elements representing headings of the colored boxes (including the grey ones). -![Wikipedia's Main Page headings](./images/devtools-exercise-wikipedia.png) +![Wikipedia's Main Page headings](../scraping_basics/images/devtools-exercise-wikipedia.png)
Solution @@ -169,7 +169,7 @@ On English Wikipedia's [Main Page](https://en.wikipedia.org/wiki/Main_Page), use Go to Shein's [Jewelry & Accessories](https://shein.com/RecommendSelection/Jewelry-Accessories-sc-017291431.html) category. In the **Console**, use CSS selectors to list all HTML elements representing the products. -![Products in Shein's Jewelry & Accessories category](./images/devtools-exercise-shein.png) +![Products in Shein's Jewelry & Accessories category](../scraping_basics/images/devtools-exercise-shein.png)
Solution @@ -194,7 +194,7 @@ Learn about the [descendant combinator](https://developer.mozilla.org/en-US/docs ::: -![Articles on Guardian's page about F1](./images/devtools-exercise-guardian1.png) +![Articles on Guardian's page about F1](../scraping_basics/images/devtools-exercise-guardian1.png)
Solution diff --git a/sources/academy/webscraping/scraping_basics_javascript2/03_devtools_extracting_data.md b/sources/academy/webscraping/scraping_basics_javascript2/03_devtools_extracting_data.md index aeb6fc7ed6..4acffaf356 100644 --- a/sources/academy/webscraping/scraping_basics_javascript2/03_devtools_extracting_data.md +++ b/sources/academy/webscraping/scraping_basics_javascript2/03_devtools_extracting_data.md @@ -6,7 +6,7 @@ slug: /scraping-basics-javascript2/devtools-extracting-data unlisted: true --- -import Exercises from './_exercises.mdx'; +import Exercises from '../scraping_basics/_exercises.mdx'; **In this lesson we'll use the browser tools for developers to manually extract product data from an e-commerce website.** @@ -31,7 +31,7 @@ subwoofer.textContent; That indeed outputs all the text, but in a form which would be hard to break down to relevant pieces. -![Printing text content of the parent element](./images/devtools-extracting-text.png) +![Printing text content of the parent element](../scraping_basics/images/devtools-extracting-text.png) We'll need to first locate relevant child elements and extract the data from each of them individually. @@ -39,7 +39,7 @@ We'll need to first locate relevant child elements and extract the data from eac We'll use the **Elements** tab of DevTools to inspect all child elements of the product card for the Sony subwoofer. We can see that the title of the product is inside an `a` element with several classes. From those the `product-item__title` seems like a great choice to locate the element. -![Finding child elements](./images/devtools-product-details.png) +![Finding child elements](../scraping_basics/images/devtools-product-details.png) Browser JavaScript represents HTML elements as [Element](https://developer.mozilla.org/en-US/docs/Web/API/Element) objects. Among properties we've already played with, such as `textContent` or `outerHTML`, it also has the [`querySelector()`](https://developer.mozilla.org/en-US/docs/Web/API/Element/querySelector) method. Here the method looks for matches only within children of the element: @@ -50,13 +50,13 @@ title.textContent; Notice we're calling `querySelector()` on the `subwoofer` variable, not `document`. And just like this, we've scraped our first piece of data! We've extracted the product title: -![Extracting product title](./images/devtools-extracting-title.png) +![Extracting product title](../scraping_basics/images/devtools-extracting-title.png) ## Extracting price To figure out how to get the price, we'll use the **Elements** tab of DevTools again. We notice there are two prices, a regular price and a sale price. For the purposes of watching prices we'll need the sale price. Both are `span` elements with the `price` class. -![Finding child elements](./images/devtools-product-details.png) +![Finding child elements](../scraping_basics/images/devtools-product-details.png) We could either rely on the fact that the sale price is likely to be always the one which is highlighted, or that it's always the first price. For now we'll rely on the later and we'll let `querySelector()` to simply return the first result: @@ -67,7 +67,7 @@ price.textContent; It works, but the price isn't alone in the result. Before we'd use such data, we'd need to do some **data cleaning**: -![Extracting product price](./images/devtools-extracting-price.png) +![Extracting product price](../scraping_basics/images/devtools-extracting-price.png) But for now that's okay. We're just testing the waters now, so that we have an idea about what our scraper will need to do. Once we'll get to extracting prices in Node.js, we'll figure out how to get the values as numbers. @@ -100,7 +100,7 @@ At IKEA's [Artificial plants & flowers listing](https://www.ikea.com/se/en/cat/a On Fandom's [Movies page](https://www.fandom.com/topics/movies), use CSS selectors and HTML element manipulation in the **Console** to extract the name of the top wiki. Use the [`trim()`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/trim) method to remove white space around the name. -![Fandom's Movies page](./images/devtools-exercise-fandom.png) +![Fandom's Movies page](../scraping_basics/images/devtools-exercise-fandom.png)
Solution @@ -119,7 +119,7 @@ On Fandom's [Movies page](https://www.fandom.com/topics/movies), use CSS selecto On the Guardian's [F1 news page](https://www.theguardian.com/sport/formulaone), use CSS selectors and HTML manipulation in the **Console** to extract details about the first post. Specifically, extract its title, lead paragraph, and URL of the associated photo. -![F1 news page](./images/devtools-exercise-guardian2.png) +![F1 news page](../scraping_basics/images/devtools-exercise-guardian2.png)
Solution diff --git a/sources/academy/webscraping/scraping_basics_javascript2/04_downloading_html.md b/sources/academy/webscraping/scraping_basics_javascript2/04_downloading_html.md index f5ff62a6c8..df9130eca4 100644 --- a/sources/academy/webscraping/scraping_basics_javascript2/04_downloading_html.md +++ b/sources/academy/webscraping/scraping_basics_javascript2/04_downloading_html.md @@ -6,7 +6,7 @@ slug: /scraping-basics-javascript2/downloading-html unlisted: true --- -import Exercises from './_exercises.mdx'; +import Exercises from '../scraping_basics/_exercises.mdx'; **In this lesson we'll start building a Node.js application for watching prices. As a first step, we'll use the Fetch API to download HTML code of a product listing page.** diff --git a/sources/academy/webscraping/scraping_basics_javascript2/05_parsing_html.md b/sources/academy/webscraping/scraping_basics_javascript2/05_parsing_html.md index f641e263df..7a3c6d9f79 100644 --- a/sources/academy/webscraping/scraping_basics_javascript2/05_parsing_html.md +++ b/sources/academy/webscraping/scraping_basics_javascript2/05_parsing_html.md @@ -6,7 +6,7 @@ slug: /scraping-basics-javascript2/parsing-html unlisted: true --- -import Exercises from './_exercises.mdx'; +import Exercises from '../scraping_basics/_exercises.mdx'; **In this lesson we'll look for products in the downloaded HTML. We'll use Cheerio to turn the HTML into objects which we can work with in our Node.js program.** @@ -14,7 +14,7 @@ import Exercises from './_exercises.mdx'; From lessons about browser DevTools we know that the HTML elements representing individual products have a `class` attribute which, among other values, contains `product-item`. -![Products have the ‘product-item’ class](./images/product-item.png) +![Products have the ‘product-item’ class](../scraping_basics/images/product-item.png) As a first step, let's try counting how many products are on the listing page. @@ -50,7 +50,7 @@ Being comfortable around installing Node.js packages is a prerequisite of this c Now let's import the package and use it for parsing the HTML. The `cheerio` module allows us to work with the HTML elements in a structured way. As a demonstration, we'll first get the `

` element, which represents the main heading of the page. -![Element of the main heading](./images/h1.png) +![Element of the main heading](../scraping_basics/images/h1.png) We'll update our code to the following: diff --git a/sources/academy/webscraping/scraping_basics_javascript2/06_locating_elements.md b/sources/academy/webscraping/scraping_basics_javascript2/06_locating_elements.md index 09101ee358..6201df7b4f 100644 --- a/sources/academy/webscraping/scraping_basics_javascript2/06_locating_elements.md +++ b/sources/academy/webscraping/scraping_basics_javascript2/06_locating_elements.md @@ -6,7 +6,7 @@ slug: /scraping-basics-javascript2/locating-elements unlisted: true --- -import Exercises from './_exercises.mdx'; +import Exercises from '../scraping_basics/_exercises.mdx'; **In this lesson we'll locate product data in the downloaded HTML. We'll use Cheerio to find those HTML elements which contain details about each product, such as title or price.** @@ -64,7 +64,7 @@ To get details about each product in a structured way, we'll need a different ap As in the browser DevTools lessons, we need to change the code so that it locates child elements for each product card. -![Product card's child elements](./images/child-elements.png) +![Product card's child elements](../scraping_basics/images/child-elements.png) We should be looking for elements which have the `product-item__title` and `price` classes. We already know how that translates to CSS selectors: diff --git a/sources/academy/webscraping/scraping_basics_javascript2/07_extracting_data.md b/sources/academy/webscraping/scraping_basics_javascript2/07_extracting_data.md index e7b81e9450..5c9fb95543 100644 --- a/sources/academy/webscraping/scraping_basics_javascript2/07_extracting_data.md +++ b/sources/academy/webscraping/scraping_basics_javascript2/07_extracting_data.md @@ -6,7 +6,7 @@ slug: /scraping-basics-javascript2/extracting-data unlisted: true --- -import Exercises from './_exercises.mdx'; +import Exercises from '../scraping_basics/_exercises.mdx'; **In this lesson we'll finish extracting product data from the downloaded HTML. With help of basic string manipulation we'll focus on cleaning and correctly representing the product price.** diff --git a/sources/academy/webscraping/scraping_basics_javascript2/08_saving_data.md b/sources/academy/webscraping/scraping_basics_javascript2/08_saving_data.md index f3801457d8..f57252bebc 100644 --- a/sources/academy/webscraping/scraping_basics_javascript2/08_saving_data.md +++ b/sources/academy/webscraping/scraping_basics_javascript2/08_saving_data.md @@ -178,7 +178,7 @@ await writeFile("products.csv", csvData); The program should now also produce a `data.csv` file. When browsing the directory on macOS, we can see a nice preview of the file's contents, which proves that the file is correct and that other programs can read it. If you're using a different operating system, try opening the file with any spreadsheet program you have. -![CSV preview](images/csv.png) +![CSV preview](../scraping_basics/images/csv.png) In the CSV format, if a value contains commas, we should enclose it in quotes. If it contains quotes, we should double them. When we open the file in a text editor of our choice, we can see that the library automatically handled this: @@ -232,6 +232,6 @@ Open the `products.csv` file we created in the lesson using a spreadsheet applic 1. Select the header row. Go to **Data > Create filter**. 1. Use the filter icon that appears next to `minPrice`. Choose **Filter by condition**, select **Greater than**, and enter **500** in the text field. Confirm the dialog. You should see only the filtered data. - ![CSV in Google Sheets](images/csv-sheets.png) + ![CSV in Google Sheets](../scraping_basics/images/csv-sheets.png)

diff --git a/sources/academy/webscraping/scraping_basics_javascript2/09_getting_links.md b/sources/academy/webscraping/scraping_basics_javascript2/09_getting_links.md index 6e3be25049..b36e9a7dab 100644 --- a/sources/academy/webscraping/scraping_basics_javascript2/09_getting_links.md +++ b/sources/academy/webscraping/scraping_basics_javascript2/09_getting_links.md @@ -6,7 +6,7 @@ slug: /scraping-basics-javascript2/getting-links unlisted: true --- -import Exercises from './_exercises.mdx'; +import Exercises from '../scraping_basics/_exercises.mdx'; **In this lesson, we'll locate and extract links to individual product pages. We'll use Cheerio to find the relevant bits of HTML.** @@ -205,7 +205,7 @@ The program is much easier to read now. With the `parseProduct()` function handy We turned the whole program upside down, and at the same time, we didn't make any actual changes! This is [refactoring](https://en.wikipedia.org/wiki/Code_refactoring): improving the structure of existing code without changing its behavior. -![Refactoring](images/refactoring.gif) +![Refactoring](../scraping_basics/images/refactoring.gif) ::: @@ -213,7 +213,7 @@ We turned the whole program upside down, and at the same time, we didn't make an With everything in place, we can now start working on a scraper that also scrapes the product pages. For that, we'll need the links to those pages. Let's open the browser DevTools and remind ourselves of the structure of a single product item: -![Product card's child elements](./images/child-elements.png) +![Product card's child elements](../scraping_basics/images/child-elements.png) Several methods exist for transitioning from one page to another, but the most common is a link element, which looks like this: diff --git a/sources/academy/webscraping/scraping_basics_javascript2/10_crawling.md b/sources/academy/webscraping/scraping_basics_javascript2/10_crawling.md index 85ad4acad2..ff121f4d05 100644 --- a/sources/academy/webscraping/scraping_basics_javascript2/10_crawling.md +++ b/sources/academy/webscraping/scraping_basics_javascript2/10_crawling.md @@ -6,7 +6,7 @@ slug: /scraping-basics-javascript2/crawling unlisted: true --- -import Exercises from './_exercises.mdx'; +import Exercises from '../scraping_basics/_exercises.mdx'; **In this lesson, we'll follow links to individual product pages. We'll use the Fetch API to download them and Cheerio to process them.** @@ -82,7 +82,7 @@ await writeFile('products.csv', await exportCSV(data)); Each product URL points to a so-called _product detail page_, or PDP. If we open one of the product URLs in the browser, e.g. the one about [Sony XBR-950G BRAVIA](https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv), we can see that it contains a vendor name, [SKU](https://en.wikipedia.org/wiki/Stock_keeping_unit), number of reviews, product images, product variants, stock availability, description, and perhaps more. -![Product detail page](./images/pdp.png) +![Product detail page](../scraping_basics/images/pdp.png) Depending on what's valuable for our use case, we can now use the same techniques as in previous lessons to extract any of the above. As a demonstration, let's scrape the vendor name. In browser DevTools, we can see that the HTML around the vendor name has the following structure: @@ -197,7 +197,7 @@ Scraping the vendor's name is nice, but the main reason we started checking the Looking at the [Sony XBR-950G BRAVIA](https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv), it's clear that the listing only shows min prices, because some products have variants, each with a different price. And different stock availability. And different SKUs… -![Morpheus revealing the existence of product variants](images/variants.png) +![Morpheus revealing the existence of product variants](../scraping_basics/images/variants.png) In the next lesson, we'll scrape the product detail pages so that each product variant is represented as a separate item in our dataset. diff --git a/sources/academy/webscraping/scraping_basics_javascript2/11_scraping_variants.md b/sources/academy/webscraping/scraping_basics_javascript2/11_scraping_variants.md index 04d340119a..03e25269ed 100644 --- a/sources/academy/webscraping/scraping_basics_javascript2/11_scraping_variants.md +++ b/sources/academy/webscraping/scraping_basics_javascript2/11_scraping_variants.md @@ -6,7 +6,7 @@ slug: /scraping-basics-javascript2/scraping-variants unlisted: true --- -import Exercises from './_exercises.mdx'; +import Exercises from '../scraping_basics/_exercises.mdx'; **In this lesson, we'll scrape the product detail pages to represent each product variant as a separate item in our dataset.** @@ -39,7 +39,7 @@ First, let's extract information about the variants. If we go to [Sony XBR-950G Nice! We can extract the variant names, but we also need to extract the price for each variant. Switching the variants using the buttons shows us that the HTML changes dynamically. This means the page uses JavaScript to display this information. -![Switching variants](images/variants-js.gif) +![Switching variants](../scraping_basics/images/variants-js.gif) If we can't find a workaround, we'd need our scraper to run browser JavaScript. That's not impossible. Scrapers can spin up their own browser instance and automate clicking on buttons, but it's slow and resource-intensive. Ideally, we want to stick to plain HTTP requests and Cheerio as much as possible. diff --git a/sources/academy/webscraping/scraping_basics_javascript2/12_framework.md b/sources/academy/webscraping/scraping_basics_javascript2/12_framework.md index bc43ea0508..98022814de 100644 --- a/sources/academy/webscraping/scraping_basics_javascript2/12_framework.md +++ b/sources/academy/webscraping/scraping_basics_javascript2/12_framework.md @@ -6,7 +6,7 @@ slug: /scraping-basics-javascript2/framework unlisted: true --- -import Exercises from './_exercises.mdx'; +import Exercises from '../scraping_basics/_exercises.mdx'; **In this lesson, we'll rework our application for watching prices so that it builds on top of a scraping framework. We'll use Crawlee to make the program simpler, faster, and more robust.** @@ -273,7 +273,7 @@ const crawler = new CheerioCrawler({ That's it! If we run the program now, there should be a `storage` directory alongside the `index.js` file. Crawlee uses it to store its internal state. If we go to the `storage/datasets/default` subdirectory, we'll see over 30 JSON files, each representing a single item. -![Single dataset item](images/dataset-item.png) +![Single dataset item](../scraping_basics/images/dataset-item.png) We can also export all the items to a single file of our choice. We'll do it at the end of the program, after the crawler has finished scraping: diff --git a/sources/academy/webscraping/scraping_basics_javascript2/13_platform.md b/sources/academy/webscraping/scraping_basics_javascript2/13_platform.md index cc1fc3b7d1..eca6ebaac6 100644 --- a/sources/academy/webscraping/scraping_basics_javascript2/13_platform.md +++ b/sources/academy/webscraping/scraping_basics_javascript2/13_platform.md @@ -172,11 +172,11 @@ Actor build detail https://console.apify.com/actors/a123bCDefghiJkLMN#/builds/0. After opening the link in our browser, assuming we're logged in, we should see the **Source** screen on the Actor's detail page. We'll go to the **Input** tab of that screen. We won't change anything—just hit **Start**, and we should see logs similar to what we see locally, but this time our scraper will be running in the cloud. -![Actor's detail page, screen Source, tab Input](./images/actor-input.webp) +![Actor's detail page, screen Source, tab Input](../scraping_basics/images/actor-input.webp) When the run finishes, the interface will turn green. On the **Output** tab, we can preview the results as a table or JSON. We can even export the data to formats like CSV, XML, Excel, RSS, and more. -![Actor's detail page, screen Source, tab Output](./images/actor-output.webp) +![Actor's detail page, screen Source, tab Output](../scraping_basics/images/actor-output.webp) :::info Accessing data @@ -190,7 +190,7 @@ Now that our scraper is deployed, let's automate its execution. In the Apify web From now on, the Actor will execute daily. We can inspect each run, view logs, check collected data, [monitor stats and charts](https://docs.apify.com/platform/monitoring), and even set up alerts. -![Schedule detail page](./images/actor-schedule.webp) +![Schedule detail page](../scraping_basics/images/actor-schedule.webp) ## Adding support for proxies @@ -298,7 +298,7 @@ Run: Building Actor warehouse-watchdog Back in the Apify console, we'll go to the **Source** screen and switch to the **Input** tab. We should see the new **Proxy config** option, which defaults to **Datacenter - Automatic**. -![Actor's detail page, screen Source, tab Input with proxies](./images/actor-input-proxies.webp) +![Actor's detail page, screen Source, tab Input with proxies](../scraping_basics/images/actor-input-proxies.webp) We'll leave it as is and click **Start**. This time, the logs should show `Using proxy: yes`, as the scraper uses proxies provided by the platform: diff --git a/sources/academy/webscraping/scraping_basics_javascript2/images b/sources/academy/webscraping/scraping_basics_javascript2/images deleted file mode 120000 index 535a050e4a..0000000000 --- a/sources/academy/webscraping/scraping_basics_javascript2/images +++ /dev/null @@ -1 +0,0 @@ -../scraping_basics_python/images \ No newline at end of file diff --git a/sources/academy/webscraping/scraping_basics_javascript2/index.md b/sources/academy/webscraping/scraping_basics_javascript2/index.md index 3751f05efb..b7465a9aaf 100644 --- a/sources/academy/webscraping/scraping_basics_javascript2/index.md +++ b/sources/academy/webscraping/scraping_basics_javascript2/index.md @@ -15,7 +15,7 @@ import DocCardList from '@theme/DocCardList'; In this course we'll use JavaScript to create an application for watching prices. It'll be able to scrape all product pages of an e-commerce website and record prices. Data from several runs of such program would be useful for seeing trends in price changes, detecting discounts, etc. -![E-commerce listing on the left, JSON with data on the right](./images/scraping.webp) +![E-commerce listing on the left, JSON with data on the right](../scraping_basics/images/scraping.webp) ## What we'll do diff --git a/sources/academy/webscraping/scraping_basics_python/01_devtools_inspecting.md b/sources/academy/webscraping/scraping_basics_python/01_devtools_inspecting.md index 0332766a62..9a06641d28 100644 --- a/sources/academy/webscraping/scraping_basics_python/01_devtools_inspecting.md +++ b/sources/academy/webscraping/scraping_basics_python/01_devtools_inspecting.md @@ -5,7 +5,7 @@ description: Lesson about using the browser tools for developers to inspect and slug: /scraping-basics-python/devtools-inspecting --- -import Exercises from './_exercises.mdx'; +import Exercises from '../scraping_basics/_exercises.mdx'; **In this lesson we'll use the browser tools for developers to inspect and manipulate the structure of a website.** @@ -27,11 +27,11 @@ Google Chrome is currently the most popular browser, and many others use the sam Now let's peek behind the scenes of a real-world website—say, Wikipedia. We'll open Google Chrome and visit [wikipedia.org](https://www.wikipedia.org/). Then, let's press **F12**, or right-click anywhere on the page and select **Inspect**. -![Wikipedia with Chrome DevTools open](./images/devtools-wikipedia.png) +![Wikipedia with Chrome DevTools open](../scraping_basics/images/devtools-wikipedia.png) Websites are built with three main technologies: HTML, CSS, and JavaScript. In the **Elements** tab, DevTools shows the HTML and CSS of the current page: -![Elements tab in Chrome DevTools](./images/devtools-elements-tab.png) +![Elements tab in Chrome DevTools](../scraping_basics/images/devtools-elements-tab.png) :::warning Screen adaptations @@ -61,17 +61,17 @@ While HTML and CSS describe what the browser should display, [JavaScript](https: In DevTools, the **Console** tab allows ad-hoc experimenting with JavaScript. If you don't see it, press ESC to toggle the Console. Running commands in the Console lets us manipulate the loaded page—we’ll try this shortly. -![Console in Chrome DevTools](./images/devtools-console.png) +![Console in Chrome DevTools](../scraping_basics/images/devtools-console.png) ## Selecting an element In the top-left corner of DevTools, let's find the icon with an arrow pointing to a square. -![Chrome DevTools element selection tool](./images/devtools-element-selection.png) +![Chrome DevTools element selection tool](../scraping_basics/images/devtools-element-selection.png) We'll click the icon and hover your cursor over Wikipedia's subtitle, **The Free Encyclopedia**. As we move our cursor, DevTools will display information about the HTML element under it. We'll click on the subtitle. In the **Elements** tab, DevTools will highlight the HTML element that represents the subtitle. -![Chrome DevTools element hover](./images/devtools-hover.png) +![Chrome DevTools element hover](../scraping_basics/images/devtools-hover.png) The highlighted section should look something like this: @@ -107,7 +107,7 @@ We won't be creating Python scrapers just yet. Let's first get familiar with wha In the **Elements** tab, with the subtitle element highlighted, let's right-click the element to open the context menu. There, we'll choose **Store as global variable**. The **Console** should appear, with a `temp1` variable ready. -![Global variable in Chrome DevTools Console](./images/devtools-console-variable.png) +![Global variable in Chrome DevTools Console](../scraping_basics/images/devtools-console-variable.png) The Console allows us to run JavaScript in the context of the loaded page, similar to Python's [interactive REPL](https://realpython.com/interacting-with-python/). We can use it to play around with elements. @@ -131,7 +131,7 @@ temp1.textContent = 'Hello World!'; When we change elements in the Console, those changes reflect immediately on the page! -![Changing textContent in Chrome DevTools Console](./images/devtools-console-textcontent.png) +![Changing textContent in Chrome DevTools Console](../scraping_basics/images/devtools-console-textcontent.png) But don't worry—we haven't hacked Wikipedia. The change only happens in our browser. If we reload the page, the change will disappear. This, however, is an easy way to craft a screenshot with fake content. That's why screenshots shouldn't be trusted as evidence. @@ -160,7 +160,7 @@ You're looking for an [`img`](https://developer.mozilla.org/en-US/docs/Web/HTML/ 1. Send the highlighted element to the **Console** using the **Store as global variable** option from the context menu. 1. In the console, type `temp1.src` and hit **Enter**. - ![DevTools exercise result](./images/devtools-exercise-fifa.png) + ![DevTools exercise result](../scraping_basics/images/devtools-exercise-fifa.png)
@@ -177,6 +177,6 @@ Open a news website, such as [CNN](https://cnn.com). Use the Console to change t 1. Send the highlighted element to the **Console** using the **Store as global variable** option from the context menu. 1. In the console, type `temp1.textContent = 'Something something'` and hit **Enter**. - ![DevTools exercise result](./images/devtools-exercise-cnn.png) + ![DevTools exercise result](../scraping_basics/images/devtools-exercise-cnn.png)
diff --git a/sources/academy/webscraping/scraping_basics_python/02_devtools_locating_elements.md b/sources/academy/webscraping/scraping_basics_python/02_devtools_locating_elements.md index 154c7d1a19..515cf1f5e1 100644 --- a/sources/academy/webscraping/scraping_basics_python/02_devtools_locating_elements.md +++ b/sources/academy/webscraping/scraping_basics_python/02_devtools_locating_elements.md @@ -5,7 +5,7 @@ description: Lesson about using the browser tools for developers to manually fin slug: /scraping-basics-python/devtools-locating-elements --- -import Exercises from './_exercises.mdx'; +import Exercises from '../scraping_basics/_exercises.mdx'; **In this lesson we'll use the browser tools for developers to manually find products on an e-commerce website.** @@ -29,17 +29,17 @@ That said, we designed all the additional exercises to work with live websites. As mentioned in the previous lesson, before building a scraper, we need to understand structure of the target page and identify the specific elements our program should extract. Let's figure out how to select details for each product on the [Sales page](https://warehouse-theme-metal.myshopify.com/collections/sales). -![Warehouse store with DevTools open](./images/devtools-warehouse.png) +![Warehouse store with DevTools open](../scraping_basics/images/devtools-warehouse.png) The page displays a grid of product cards, each showing a product's title and picture. Let's open DevTools and locate the title of the **Sony SACS9 Active Subwoofer**. We'll highlight it in the **Elements** tab by clicking on it. -![Selecting an element with DevTools](./images/devtools-product-title.png) +![Selecting an element with DevTools](../scraping_basics/images/devtools-product-title.png) Next, let's find all the elements containing details about this subwoofer—its price, number of reviews, image, and more. In the **Elements** tab, we'll move our cursor up from the `a` element containing the subwoofer's title. On the way, we'll hover over each element until we highlight the entire product card. Alternatively, we can use the arrow-up key. The `div` element we land on is the **parent element**, and all nested elements are its **child elements**. -![Selecting an element with hover](./images/devtools-hover-product.png) +![Selecting an element with hover](../scraping_basics/images/devtools-hover-product.png) At this stage, we could use the **Store as global variable** option to send the element to the **Console**. While helpful for manual inspection, this isn't something a program can do. @@ -65,7 +65,7 @@ document.querySelector('.product-item'); It will return the HTML element for the first product card in the listing: -![Using querySelector() in DevTools Console](./images/devtools-queryselector.webp) +![Using querySelector() in DevTools Console](../scraping_basics/images/devtools-queryselector.webp) CSS selectors can get quite complex, but the basics are enough to scrape most of the Warehouse store. Let's cover two simple types and how they can combine. @@ -115,13 +115,13 @@ The product card has four classes: `product-item`, `product-item--vertical`, `1/ This class is also unique enough in the page's context. If it were something generic like `item`, there would be a higher risk that developers of the website might use it for unrelated elements. In the **Elements** tab, we can see a parent element `product-list` that contains all the product cards marked as `product-item`. This structure aligns with the data we're after. -![Overview of all the product cards in DevTools](./images/devtools-product-list.png) +![Overview of all the product cards in DevTools](../scraping_basics/images/devtools-product-list.png) ## Locating all product cards In the **Console**, hovering our cursor over objects representing HTML elements highlights the corresponding elements on the page. This way we can verify that when we query `.product-item`, the result represents the JBL Flip speaker—the first product card in the list. -![Highlighting a querySelector() result](./images/devtools-hover-queryselector.png) +![Highlighting a querySelector() result](../scraping_basics/images/devtools-hover-queryselector.png) But what if we want to scrape details about the Sony subwoofer we inspected earlier? For that, we need a method that selects more than just the first match: [`querySelectorAll()`](https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelectorAll). As the name suggests, it takes a CSS selector string and returns all matching HTML elements. Let's type this into the **Console**: @@ -133,7 +133,7 @@ The returned value is a [`NodeList`](https://developer.mozilla.org/en-US/docs/We We'll expand the result by clicking the small arrow, then hover our cursor over the third element in the list. Indexing starts at 0, so the third element is at index 2. There it is—the product card for the subwoofer! -![Highlighting a querySelectorAll() result](./images/devtools-hover-queryselectorall.png) +![Highlighting a querySelectorAll() result](../scraping_basics/images/devtools-hover-queryselectorall.png) To save the subwoofer in a variable for further inspection, we can use index access with brackets, just like with Python lists (or JavaScript arrays): @@ -152,7 +152,7 @@ Even though we're just playing with JavaScript in the browser's **Console**, we' On English Wikipedia's [Main Page](https://en.wikipedia.org/wiki/Main_Page), use CSS selectors in the **Console** to list the HTML elements representing headings of the colored boxes (including the grey ones). -![Wikipedia's Main Page headings](./images/devtools-exercise-wikipedia.png) +![Wikipedia's Main Page headings](../scraping_basics/images/devtools-exercise-wikipedia.png)
Solution @@ -170,7 +170,7 @@ On English Wikipedia's [Main Page](https://en.wikipedia.org/wiki/Main_Page), use Go to Shein's [Jewelry & Accessories](https://shein.com/RecommendSelection/Jewelry-Accessories-sc-017291431.html) category. In the **Console**, use CSS selectors to list all HTML elements representing the products. -![Products in Shein's Jewelry & Accessories category](./images/devtools-exercise-shein.png) +![Products in Shein's Jewelry & Accessories category](../scraping_basics/images/devtools-exercise-shein.png)
Solution @@ -195,7 +195,7 @@ Learn about the [descendant combinator](https://developer.mozilla.org/en-US/docs ::: -![Articles on Guardian's page about F1](./images/devtools-exercise-guardian1.png) +![Articles on Guardian's page about F1](../scraping_basics/images/devtools-exercise-guardian1.png)
Solution diff --git a/sources/academy/webscraping/scraping_basics_python/03_devtools_extracting_data.md b/sources/academy/webscraping/scraping_basics_python/03_devtools_extracting_data.md index 43fb6264f3..f864362f8a 100644 --- a/sources/academy/webscraping/scraping_basics_python/03_devtools_extracting_data.md +++ b/sources/academy/webscraping/scraping_basics_python/03_devtools_extracting_data.md @@ -5,7 +5,7 @@ description: Lesson about using the browser tools for developers to manually ext slug: /scraping-basics-python/devtools-extracting-data --- -import Exercises from './_exercises.mdx'; +import Exercises from '../scraping_basics/_exercises.mdx'; **In this lesson we'll use the browser tools for developers to manually extract product data from an e-commerce website.** @@ -30,7 +30,7 @@ subwoofer.textContent; That indeed outputs all the text, but in a form which would be hard to break down to relevant pieces. -![Printing text content of the parent element](./images/devtools-extracting-text.png) +![Printing text content of the parent element](../scraping_basics/images/devtools-extracting-text.png) We'll need to first locate relevant child elements and extract the data from each of them individually. @@ -38,7 +38,7 @@ We'll need to first locate relevant child elements and extract the data from eac We'll use the **Elements** tab of DevTools to inspect all child elements of the product card for the Sony subwoofer. We can see that the title of the product is inside an `a` element with several classes. From those the `product-item__title` seems like a great choice to locate the element. -![Finding child elements](./images/devtools-product-details.png) +![Finding child elements](../scraping_basics/images/devtools-product-details.png) JavaScript represents HTML elements as [Element](https://developer.mozilla.org/en-US/docs/Web/API/Element) objects. Among properties we've already played with, such as `textContent` or `outerHTML`, it also has the [`querySelector()`](https://developer.mozilla.org/en-US/docs/Web/API/Element/querySelector) method. Here the method looks for matches only within children of the element: @@ -49,13 +49,13 @@ title.textContent; Notice we're calling `querySelector()` on the `subwoofer` variable, not `document`. And just like this, we've scraped our first piece of data! We've extracted the product title: -![Extracting product title](./images/devtools-extracting-title.png) +![Extracting product title](../scraping_basics/images/devtools-extracting-title.png) ## Extracting price To figure out how to get the price, we'll use the **Elements** tab of DevTools again. We notice there are two prices, a regular price and a sale price. For the purposes of watching prices we'll need the sale price. Both are `span` elements with the `price` class. -![Finding child elements](./images/devtools-product-details.png) +![Finding child elements](../scraping_basics/images/devtools-product-details.png) We could either rely on the fact that the sale price is likely to be always the one which is highlighted, or that it's always the first price. For now we'll rely on the later and we'll let `querySelector()` to simply return the first result: @@ -66,7 +66,7 @@ price.textContent; It works, but the price isn't alone in the result. Before we'd use such data, we'd need to do some **data cleaning**: -![Extracting product price](./images/devtools-extracting-price.png) +![Extracting product price](../scraping_basics/images/devtools-extracting-price.png) But for now that's okay. We're just testing the waters now, so that we have an idea about what our scraper will need to do. Once we'll get to extracting prices in Python, we'll figure out how to get the values as numbers. @@ -99,7 +99,7 @@ At IKEA's [Artificial plants & flowers listing](https://www.ikea.com/se/en/cat/a On Fandom's [Movies page](https://www.fandom.com/topics/movies), use CSS selectors and HTML element manipulation in the **Console** to extract the name of the top wiki. Use JavaScript's [`trim()`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/trim) method to remove white space around the name. -![Fandom's Movies page](./images/devtools-exercise-fandom.png) +![Fandom's Movies page](../scraping_basics/images/devtools-exercise-fandom.png)
Solution @@ -118,7 +118,7 @@ On Fandom's [Movies page](https://www.fandom.com/topics/movies), use CSS selecto On the Guardian's [F1 news page](https://www.theguardian.com/sport/formulaone), use CSS selectors and HTML manipulation in the **Console** to extract details about the first post. Specifically, extract its title, lead paragraph, and URL of the associated photo. -![F1 news page](./images/devtools-exercise-guardian2.png) +![F1 news page](../scraping_basics/images/devtools-exercise-guardian2.png)
Solution diff --git a/sources/academy/webscraping/scraping_basics_python/04_downloading_html.md b/sources/academy/webscraping/scraping_basics_python/04_downloading_html.md index e0f2304e61..e3866cfcb2 100644 --- a/sources/academy/webscraping/scraping_basics_python/04_downloading_html.md +++ b/sources/academy/webscraping/scraping_basics_python/04_downloading_html.md @@ -5,7 +5,7 @@ description: Lesson about building a Python application for watching prices. Usi slug: /scraping-basics-python/downloading-html --- -import Exercises from './_exercises.mdx'; +import Exercises from '../scraping_basics/_exercises.mdx'; **In this lesson we'll start building a Python application for watching prices. As a first step, we'll use the HTTPX library to download HTML code of a product listing page.** diff --git a/sources/academy/webscraping/scraping_basics_python/05_parsing_html.md b/sources/academy/webscraping/scraping_basics_python/05_parsing_html.md index 80a4974f79..dbfa52cb9a 100644 --- a/sources/academy/webscraping/scraping_basics_python/05_parsing_html.md +++ b/sources/academy/webscraping/scraping_basics_python/05_parsing_html.md @@ -5,7 +5,7 @@ description: Lesson about building a Python application for watching prices. Usi slug: /scraping-basics-python/parsing-html --- -import Exercises from './_exercises.mdx'; +import Exercises from '../scraping_basics/_exercises.mdx'; **In this lesson we'll look for products in the downloaded HTML. We'll use BeautifulSoup to turn the HTML into objects which we can work with in our Python program.** @@ -13,7 +13,7 @@ import Exercises from './_exercises.mdx'; From lessons about browser DevTools we know that the HTML elements representing individual products have a `class` attribute which, among other values, contains `product-item`. -![Products have the ‘product-item’ class](./images/product-item.png) +![Products have the ‘product-item’ class](../scraping_basics/images/product-item.png) As a first step, let's try counting how many products are on the listing page. @@ -42,7 +42,7 @@ Successfully installed beautifulsoup4-4.0.0 soupsieve-0.0 Now let's use it for parsing the HTML. The `BeautifulSoup` object allows us to work with the HTML elements in a structured way. As a demonstration, we'll first get the `

` element, which represents the main heading of the page. -![Element of the main heading](./images/h1.png) +![Element of the main heading](../scraping_basics/images/h1.png) We'll update our code to the following: diff --git a/sources/academy/webscraping/scraping_basics_python/06_locating_elements.md b/sources/academy/webscraping/scraping_basics_python/06_locating_elements.md index fa8a38fc6d..0708dc071e 100644 --- a/sources/academy/webscraping/scraping_basics_python/06_locating_elements.md +++ b/sources/academy/webscraping/scraping_basics_python/06_locating_elements.md @@ -5,7 +5,7 @@ description: Lesson about building a Python application for watching prices. Usi slug: /scraping-basics-python/locating-elements --- -import Exercises from './_exercises.mdx'; +import Exercises from '../scraping_basics/_exercises.mdx'; **In this lesson we'll locate product data in the downloaded HTML. We'll use BeautifulSoup to find those HTML elements which contain details about each product, such as title or price.** @@ -60,7 +60,7 @@ To get details about each product in a structured way, we'll need a different ap As in the browser DevTools lessons, we need to change the code so that it locates child elements for each product card. -![Product card's child elements](./images/child-elements.png) +![Product card's child elements](../scraping_basics/images/child-elements.png) We should be looking for elements which have the `product-item__title` and `price` classes. We already know how that translates to CSS selectors: diff --git a/sources/academy/webscraping/scraping_basics_python/07_extracting_data.md b/sources/academy/webscraping/scraping_basics_python/07_extracting_data.md index 01814edde9..eb49b7ce69 100644 --- a/sources/academy/webscraping/scraping_basics_python/07_extracting_data.md +++ b/sources/academy/webscraping/scraping_basics_python/07_extracting_data.md @@ -5,7 +5,7 @@ description: Lesson about building a Python application for watching prices. Usi slug: /scraping-basics-python/extracting-data --- -import Exercises from './_exercises.mdx'; +import Exercises from '../scraping_basics/_exercises.mdx'; **In this lesson we'll finish extracting product data from the downloaded HTML. With help of basic string manipulation we'll focus on cleaning and correctly representing the product price.** diff --git a/sources/academy/webscraping/scraping_basics_python/08_saving_data.md b/sources/academy/webscraping/scraping_basics_python/08_saving_data.md index 8c3ddedc31..a0d6d94743 100644 --- a/sources/academy/webscraping/scraping_basics_python/08_saving_data.md +++ b/sources/academy/webscraping/scraping_basics_python/08_saving_data.md @@ -147,7 +147,7 @@ In the CSV format, if a value contains commas, we should enclose it in quotes. W When browsing the directory on macOS, we can see a nice preview of the file's contents, which proves that the file is correct and that other programs can read it. If you're using a different operating system, try opening the file with any spreadsheet program you have. -![CSV example preview](images/csv-example.png) +![CSV example preview](../scraping_basics/images/csv-example.png) Now that's nice, but we didn't want Alice, Bob, kickbox, or TypeScript. What we actually want is a CSV containing `Sony XBR-950G BRAVIA 4K HDR Ultra HD TV`, right? Let's do this! First, let's add `csv` to our imports: @@ -174,7 +174,7 @@ with open("products.csv", "w") as file: The program should now also produce a CSV file with the following content: -![CSV preview](images/csv.png) +![CSV preview](../scraping_basics/images/csv.png) We've built a Python application that downloads a product listing, parses the data, and saves it in a structured format for further use. But the data still has gaps: for some products, we only have the min price, not the actual prices. In the next lesson, we'll attempt to scrape more details from all the product pages. @@ -218,6 +218,6 @@ Open the `products.csv` file we created in the lesson using a spreadsheet applic 1. Select the header row. Go to **Data > Create filter**. 1. Use the filter icon that appears next to `min_price`. Choose **Filter by condition**, select **Greater than**, and enter **500** in the text field. Confirm the dialog. You should see only the filtered data. - ![CSV in Google Sheets](images/csv-sheets.png) + ![CSV in Google Sheets](../scraping_basics/images/csv-sheets.png)

diff --git a/sources/academy/webscraping/scraping_basics_python/09_getting_links.md b/sources/academy/webscraping/scraping_basics_python/09_getting_links.md index 6da32e836d..883ba050f3 100644 --- a/sources/academy/webscraping/scraping_basics_python/09_getting_links.md +++ b/sources/academy/webscraping/scraping_basics_python/09_getting_links.md @@ -5,7 +5,7 @@ description: Lesson about building a Python application for watching prices. Usi slug: /scraping-basics-python/getting-links --- -import Exercises from './_exercises.mdx'; +import Exercises from '../scraping_basics/_exercises.mdx'; **In this lesson, we'll locate and extract links to individual product pages. We'll use BeautifulSoup to find the relevant bits of HTML.** @@ -204,7 +204,7 @@ The program is much easier to read now. With the `parse_product()` function hand We turned the whole program upside down, and at the same time, we didn't make any actual changes! This is [refactoring](https://en.wikipedia.org/wiki/Code_refactoring): improving the structure of existing code without changing its behavior. -![Refactoring](images/refactoring.gif) +![Refactoring](../scraping_basics/images/refactoring.gif) ::: @@ -212,7 +212,7 @@ We turned the whole program upside down, and at the same time, we didn't make an With everything in place, we can now start working on a scraper that also scrapes the product pages. For that, we'll need the links to those pages. Let's open the browser DevTools and remind ourselves of the structure of a single product item: -![Product card's child elements](./images/child-elements.png) +![Product card's child elements](../scraping_basics/images/child-elements.png) Several methods exist for transitioning from one page to another, but the most common is a link element, which looks like this: diff --git a/sources/academy/webscraping/scraping_basics_python/10_crawling.md b/sources/academy/webscraping/scraping_basics_python/10_crawling.md index a18ee39632..836dadad3a 100644 --- a/sources/academy/webscraping/scraping_basics_python/10_crawling.md +++ b/sources/academy/webscraping/scraping_basics_python/10_crawling.md @@ -5,7 +5,7 @@ description: Lesson about building a Python application for watching prices. Usi slug: /scraping-basics-python/crawling --- -import Exercises from './_exercises.mdx'; +import Exercises from '../scraping_basics/_exercises.mdx'; **In this lesson, we'll follow links to individual product pages. We'll use HTTPX to download them and BeautifulSoup to process them.** @@ -81,7 +81,7 @@ with open("products.csv", "w") as file: Each product URL points to a so-called _product detail page_, or PDP. If we open one of the product URLs in the browser, e.g. the one about [Sony XBR-950G BRAVIA](https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv), we can see that it contains a vendor name, [SKU](https://en.wikipedia.org/wiki/Stock_keeping_unit), number of reviews, product images, product variants, stock availability, description, and perhaps more. -![Product detail page](./images/pdp.png) +![Product detail page](../scraping_basics/images/pdp.png) Depending on what's valuable for our use case, we can now use the same techniques as in previous lessons to extract any of the above. As a demonstration, let's scrape the vendor name. In browser DevTools, we can see that the HTML around the vendor name has the following structure: @@ -172,7 +172,7 @@ Scraping the vendor's name is nice, but the main reason we started checking the Looking at the [Sony XBR-950G BRAVIA](https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv), it's clear that the listing only shows min prices, because some products have variants, each with a different price. And different stock availability. And different SKUs… -![Morpheus revealing the existence of product variants](images/variants.png) +![Morpheus revealing the existence of product variants](../scraping_basics/images/variants.png) In the next lesson, we'll scrape the product detail pages so that each product variant is represented as a separate item in our dataset. diff --git a/sources/academy/webscraping/scraping_basics_python/11_scraping_variants.md b/sources/academy/webscraping/scraping_basics_python/11_scraping_variants.md index cdd3496af6..e47affbaec 100644 --- a/sources/academy/webscraping/scraping_basics_python/11_scraping_variants.md +++ b/sources/academy/webscraping/scraping_basics_python/11_scraping_variants.md @@ -5,7 +5,7 @@ description: Lesson about building a Python application for watching prices. Usi slug: /scraping-basics-python/scraping-variants --- -import Exercises from './_exercises.mdx'; +import Exercises from '../scraping_basics/_exercises.mdx'; **In this lesson, we'll scrape the product detail pages to represent each product variant as a separate item in our dataset.** @@ -38,7 +38,7 @@ First, let's extract information about the variants. If we go to [Sony XBR-950G Nice! We can extract the variant names, but we also need to extract the price for each variant. Switching the variants using the buttons shows us that the HTML changes dynamically. This means the page uses JavaScript to display this information. -![Switching variants](images/variants-js.gif) +![Switching variants](../scraping_basics/images/variants-js.gif) If we can't find a workaround, we'd need our scraper to run JavaScript. That's not impossible. Scrapers can spin up their own browser instance and automate clicking on buttons, but it's slow and resource-intensive. Ideally, we want to stick to plain HTTP requests and Beautiful Soup as much as possible. diff --git a/sources/academy/webscraping/scraping_basics_python/12_framework.md b/sources/academy/webscraping/scraping_basics_python/12_framework.md index 8f64594f92..6f8861785d 100644 --- a/sources/academy/webscraping/scraping_basics_python/12_framework.md +++ b/sources/academy/webscraping/scraping_basics_python/12_framework.md @@ -5,7 +5,7 @@ description: Lesson about building a Python application for watching prices. Usi slug: /scraping-basics-python/framework --- -import Exercises from './_exercises.mdx'; +import Exercises from '../scraping_basics/_exercises.mdx'; **In this lesson, we'll rework our application for watching prices so that it builds on top of a scraping framework. We'll use Crawlee to make the program simpler, faster, and more robust.** @@ -321,7 +321,7 @@ async def main(): That's it! If we run the program now, there should be a `storage` directory alongside the `main.py` file. Crawlee uses it to store its internal state. If we go to the `storage/datasets/default` subdirectory, we'll see over 30 JSON files, each representing a single item. -![Single dataset item](images/dataset-item.png) +![Single dataset item](../scraping_basics/images/dataset-item.png) We can also export all the items to a single file of our choice. We'll do it at the end of the `main()` function, after the crawler has finished scraping: diff --git a/sources/academy/webscraping/scraping_basics_python/13_platform.md b/sources/academy/webscraping/scraping_basics_python/13_platform.md index 23f042a048..7496a63661 100644 --- a/sources/academy/webscraping/scraping_basics_python/13_platform.md +++ b/sources/academy/webscraping/scraping_basics_python/13_platform.md @@ -84,7 +84,7 @@ The file contains a single asynchronous function, `main()`. At the beginning, it Every program that runs on the Apify platform first needs to be packaged as a so-called [Actor](https://docs.apify.com/platform/actors)—a standardized container with designated places for input and output. Crawlee scrapers automatically connect their default dataset to the Actor output, but input must be handled explicitly in the code. -![The expected file structure](./images/actor-file-structure.webp) +![The expected file structure](../scraping_basics/images/actor-file-structure.webp) We'll now adjust the template so that it runs our program for watching prices. As the first step, we'll create a new empty file, `crawler.py`, inside the `warehouse-watchdog/src` directory. Then, we'll fill this file with final, unchanged code from the previous lesson: @@ -258,11 +258,11 @@ Actor build detail https://console.apify.com/actors/a123bCDefghiJkLMN#/builds/0. After opening the link in our browser, assuming we're logged in, we should see the **Source** screen on the Actor's detail page. We'll go to the **Input** tab of that screen. We won't change anything—just hit **Start**, and we should see logs similar to what we see locally, but this time our scraper will be running in the cloud. -![Actor's detail page, screen Source, tab Input](./images/actor-input.webp) +![Actor's detail page, screen Source, tab Input](../scraping_basics/images/actor-input.webp) When the run finishes, the interface will turn green. On the **Output** tab, we can preview the results as a table or JSON. We can even export the data to formats like CSV, XML, Excel, RSS, and more. -![Actor's detail page, screen Source, tab Output](./images/actor-output.webp) +![Actor's detail page, screen Source, tab Output](../scraping_basics/images/actor-output.webp) :::info Accessing data @@ -276,7 +276,7 @@ Now that our scraper is deployed, let's automate its execution. In the Apify web From now on, the Actor will execute daily. We can inspect each run, view logs, check collected data, [monitor stats and charts](https://docs.apify.com/platform/monitoring), and even set up alerts. -![Schedule detail page](./images/actor-schedule.webp) +![Schedule detail page](../scraping_basics/images/actor-schedule.webp) ## Adding support for proxies @@ -391,7 +391,7 @@ Run: Building Actor warehouse-watchdog Back in the Apify console, we'll go to the **Source** screen and switch to the **Input** tab. We should see the new **Proxy config** option, which defaults to **Datacenter - Automatic**. -![Actor's detail page, screen Source, tab Input with proxies](./images/actor-input-proxies.webp) +![Actor's detail page, screen Source, tab Input with proxies](../scraping_basics/images/actor-input-proxies.webp) We'll leave it as is and click **Start**. This time, the logs should show `Using proxy: yes`, as the scraper uses proxies provided by the platform: diff --git a/sources/academy/webscraping/scraping_basics_python/_exercises.mdx b/sources/academy/webscraping/scraping_basics_python/_exercises.mdx deleted file mode 100644 index ba254f4022..0000000000 --- a/sources/academy/webscraping/scraping_basics_python/_exercises.mdx +++ /dev/null @@ -1,10 +0,0 @@ - -## Exercises - -These challenges are here to help you test what you’ve learned in this lesson. Try to resist the urge to peek at the solutions right away. Remember, the best learning happens when you dive in and do it yourself! - -:::caution Real world - -You're about to touch the real web, which is practical and exciting! But websites change, so some exercises might break. If you run into any issues, please leave a comment below or [file a GitHub Issue](https://github.com/apify/apify-docs/issues). - -::: diff --git a/sources/academy/webscraping/scraping_basics_python/index.md b/sources/academy/webscraping/scraping_basics_python/index.md index 4de160a3a3..6ef1e6d78d 100644 --- a/sources/academy/webscraping/scraping_basics_python/index.md +++ b/sources/academy/webscraping/scraping_basics_python/index.md @@ -14,7 +14,7 @@ import DocCardList from '@theme/DocCardList'; In this course we'll use Python to create an application for watching prices. It'll be able to scrape all product pages of an e-commerce website and record prices. Data from several runs of such program would be useful for seeing trends in price changes, detecting discounts, etc. -![E-commerce listing on the left, JSON with data on the right](./images/scraping.webp) +![E-commerce listing on the left, JSON with data on the right](../scraping_basics/images/scraping.webp) ## What we'll do