diff --git a/sources/academy/webscraping/scraping_basics_javascript2/_exercises.mdx b/sources/academy/webscraping/scraping_basics/_exercises.mdx
similarity index 100%
rename from sources/academy/webscraping/scraping_basics_javascript2/_exercises.mdx
rename to sources/academy/webscraping/scraping_basics/_exercises.mdx
diff --git a/sources/academy/webscraping/scraping_basics_python/images/actor-file-structure.webp b/sources/academy/webscraping/scraping_basics/images/actor-file-structure.webp
similarity index 100%
rename from sources/academy/webscraping/scraping_basics_python/images/actor-file-structure.webp
rename to sources/academy/webscraping/scraping_basics/images/actor-file-structure.webp
diff --git a/sources/academy/webscraping/scraping_basics_python/images/actor-input-proxies.webp b/sources/academy/webscraping/scraping_basics/images/actor-input-proxies.webp
similarity index 100%
rename from sources/academy/webscraping/scraping_basics_python/images/actor-input-proxies.webp
rename to sources/academy/webscraping/scraping_basics/images/actor-input-proxies.webp
diff --git a/sources/academy/webscraping/scraping_basics_python/images/actor-input.webp b/sources/academy/webscraping/scraping_basics/images/actor-input.webp
similarity index 100%
rename from sources/academy/webscraping/scraping_basics_python/images/actor-input.webp
rename to sources/academy/webscraping/scraping_basics/images/actor-input.webp
diff --git a/sources/academy/webscraping/scraping_basics_python/images/actor-output.webp b/sources/academy/webscraping/scraping_basics/images/actor-output.webp
similarity index 100%
rename from sources/academy/webscraping/scraping_basics_python/images/actor-output.webp
rename to sources/academy/webscraping/scraping_basics/images/actor-output.webp
diff --git a/sources/academy/webscraping/scraping_basics_python/images/actor-schedule.webp b/sources/academy/webscraping/scraping_basics/images/actor-schedule.webp
similarity index 100%
rename from sources/academy/webscraping/scraping_basics_python/images/actor-schedule.webp
rename to sources/academy/webscraping/scraping_basics/images/actor-schedule.webp
diff --git a/sources/academy/webscraping/scraping_basics_python/images/child-elements.png b/sources/academy/webscraping/scraping_basics/images/child-elements.png
similarity index 100%
rename from sources/academy/webscraping/scraping_basics_python/images/child-elements.png
rename to sources/academy/webscraping/scraping_basics/images/child-elements.png
diff --git a/sources/academy/webscraping/scraping_basics_python/images/csv-example.png b/sources/academy/webscraping/scraping_basics/images/csv-example.png
similarity index 100%
rename from sources/academy/webscraping/scraping_basics_python/images/csv-example.png
rename to sources/academy/webscraping/scraping_basics/images/csv-example.png
diff --git a/sources/academy/webscraping/scraping_basics_python/images/csv-sheets.png b/sources/academy/webscraping/scraping_basics/images/csv-sheets.png
similarity index 100%
rename from sources/academy/webscraping/scraping_basics_python/images/csv-sheets.png
rename to sources/academy/webscraping/scraping_basics/images/csv-sheets.png
diff --git a/sources/academy/webscraping/scraping_basics_python/images/csv.png b/sources/academy/webscraping/scraping_basics/images/csv.png
similarity index 100%
rename from sources/academy/webscraping/scraping_basics_python/images/csv.png
rename to sources/academy/webscraping/scraping_basics/images/csv.png
diff --git a/sources/academy/webscraping/scraping_basics_python/images/dataset-item.png b/sources/academy/webscraping/scraping_basics/images/dataset-item.png
similarity index 100%
rename from sources/academy/webscraping/scraping_basics_python/images/dataset-item.png
rename to sources/academy/webscraping/scraping_basics/images/dataset-item.png
diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-console-textcontent.png b/sources/academy/webscraping/scraping_basics/images/devtools-console-textcontent.png
similarity index 100%
rename from sources/academy/webscraping/scraping_basics_python/images/devtools-console-textcontent.png
rename to sources/academy/webscraping/scraping_basics/images/devtools-console-textcontent.png
diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-console-variable.png b/sources/academy/webscraping/scraping_basics/images/devtools-console-variable.png
similarity index 100%
rename from sources/academy/webscraping/scraping_basics_python/images/devtools-console-variable.png
rename to sources/academy/webscraping/scraping_basics/images/devtools-console-variable.png
diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-console.png b/sources/academy/webscraping/scraping_basics/images/devtools-console.png
similarity index 100%
rename from sources/academy/webscraping/scraping_basics_python/images/devtools-console.png
rename to sources/academy/webscraping/scraping_basics/images/devtools-console.png
diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-element-selection.png b/sources/academy/webscraping/scraping_basics/images/devtools-element-selection.png
similarity index 100%
rename from sources/academy/webscraping/scraping_basics_python/images/devtools-element-selection.png
rename to sources/academy/webscraping/scraping_basics/images/devtools-element-selection.png
diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-elements-tab.png b/sources/academy/webscraping/scraping_basics/images/devtools-elements-tab.png
similarity index 100%
rename from sources/academy/webscraping/scraping_basics_python/images/devtools-elements-tab.png
rename to sources/academy/webscraping/scraping_basics/images/devtools-elements-tab.png
diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-exercise-cnn.png b/sources/academy/webscraping/scraping_basics/images/devtools-exercise-cnn.png
similarity index 100%
rename from sources/academy/webscraping/scraping_basics_python/images/devtools-exercise-cnn.png
rename to sources/academy/webscraping/scraping_basics/images/devtools-exercise-cnn.png
diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-exercise-fandom.png b/sources/academy/webscraping/scraping_basics/images/devtools-exercise-fandom.png
similarity index 100%
rename from sources/academy/webscraping/scraping_basics_python/images/devtools-exercise-fandom.png
rename to sources/academy/webscraping/scraping_basics/images/devtools-exercise-fandom.png
diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-exercise-fifa.png b/sources/academy/webscraping/scraping_basics/images/devtools-exercise-fifa.png
similarity index 100%
rename from sources/academy/webscraping/scraping_basics_python/images/devtools-exercise-fifa.png
rename to sources/academy/webscraping/scraping_basics/images/devtools-exercise-fifa.png
diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-exercise-guardian1.png b/sources/academy/webscraping/scraping_basics/images/devtools-exercise-guardian1.png
similarity index 100%
rename from sources/academy/webscraping/scraping_basics_python/images/devtools-exercise-guardian1.png
rename to sources/academy/webscraping/scraping_basics/images/devtools-exercise-guardian1.png
diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-exercise-guardian2.png b/sources/academy/webscraping/scraping_basics/images/devtools-exercise-guardian2.png
similarity index 100%
rename from sources/academy/webscraping/scraping_basics_python/images/devtools-exercise-guardian2.png
rename to sources/academy/webscraping/scraping_basics/images/devtools-exercise-guardian2.png
diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-exercise-shein.png b/sources/academy/webscraping/scraping_basics/images/devtools-exercise-shein.png
similarity index 100%
rename from sources/academy/webscraping/scraping_basics_python/images/devtools-exercise-shein.png
rename to sources/academy/webscraping/scraping_basics/images/devtools-exercise-shein.png
diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-exercise-wikipedia.png b/sources/academy/webscraping/scraping_basics/images/devtools-exercise-wikipedia.png
similarity index 100%
rename from sources/academy/webscraping/scraping_basics_python/images/devtools-exercise-wikipedia.png
rename to sources/academy/webscraping/scraping_basics/images/devtools-exercise-wikipedia.png
diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-extracting-price.png b/sources/academy/webscraping/scraping_basics/images/devtools-extracting-price.png
similarity index 100%
rename from sources/academy/webscraping/scraping_basics_python/images/devtools-extracting-price.png
rename to sources/academy/webscraping/scraping_basics/images/devtools-extracting-price.png
diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-extracting-text.png b/sources/academy/webscraping/scraping_basics/images/devtools-extracting-text.png
similarity index 100%
rename from sources/academy/webscraping/scraping_basics_python/images/devtools-extracting-text.png
rename to sources/academy/webscraping/scraping_basics/images/devtools-extracting-text.png
diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-extracting-title.png b/sources/academy/webscraping/scraping_basics/images/devtools-extracting-title.png
similarity index 100%
rename from sources/academy/webscraping/scraping_basics_python/images/devtools-extracting-title.png
rename to sources/academy/webscraping/scraping_basics/images/devtools-extracting-title.png
diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-hover-product.png b/sources/academy/webscraping/scraping_basics/images/devtools-hover-product.png
similarity index 100%
rename from sources/academy/webscraping/scraping_basics_python/images/devtools-hover-product.png
rename to sources/academy/webscraping/scraping_basics/images/devtools-hover-product.png
diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-hover-queryselector.png b/sources/academy/webscraping/scraping_basics/images/devtools-hover-queryselector.png
similarity index 100%
rename from sources/academy/webscraping/scraping_basics_python/images/devtools-hover-queryselector.png
rename to sources/academy/webscraping/scraping_basics/images/devtools-hover-queryselector.png
diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-hover-queryselectorall.png b/sources/academy/webscraping/scraping_basics/images/devtools-hover-queryselectorall.png
similarity index 100%
rename from sources/academy/webscraping/scraping_basics_python/images/devtools-hover-queryselectorall.png
rename to sources/academy/webscraping/scraping_basics/images/devtools-hover-queryselectorall.png
diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-hover.png b/sources/academy/webscraping/scraping_basics/images/devtools-hover.png
similarity index 100%
rename from sources/academy/webscraping/scraping_basics_python/images/devtools-hover.png
rename to sources/academy/webscraping/scraping_basics/images/devtools-hover.png
diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-product-details.png b/sources/academy/webscraping/scraping_basics/images/devtools-product-details.png
similarity index 100%
rename from sources/academy/webscraping/scraping_basics_python/images/devtools-product-details.png
rename to sources/academy/webscraping/scraping_basics/images/devtools-product-details.png
diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-product-list.png b/sources/academy/webscraping/scraping_basics/images/devtools-product-list.png
similarity index 100%
rename from sources/academy/webscraping/scraping_basics_python/images/devtools-product-list.png
rename to sources/academy/webscraping/scraping_basics/images/devtools-product-list.png
diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-product-title.png b/sources/academy/webscraping/scraping_basics/images/devtools-product-title.png
similarity index 100%
rename from sources/academy/webscraping/scraping_basics_python/images/devtools-product-title.png
rename to sources/academy/webscraping/scraping_basics/images/devtools-product-title.png
diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-queryselector.webp b/sources/academy/webscraping/scraping_basics/images/devtools-queryselector.webp
similarity index 100%
rename from sources/academy/webscraping/scraping_basics_python/images/devtools-queryselector.webp
rename to sources/academy/webscraping/scraping_basics/images/devtools-queryselector.webp
diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-warehouse.png b/sources/academy/webscraping/scraping_basics/images/devtools-warehouse.png
similarity index 100%
rename from sources/academy/webscraping/scraping_basics_python/images/devtools-warehouse.png
rename to sources/academy/webscraping/scraping_basics/images/devtools-warehouse.png
diff --git a/sources/academy/webscraping/scraping_basics_python/images/devtools-wikipedia.png b/sources/academy/webscraping/scraping_basics/images/devtools-wikipedia.png
similarity index 100%
rename from sources/academy/webscraping/scraping_basics_python/images/devtools-wikipedia.png
rename to sources/academy/webscraping/scraping_basics/images/devtools-wikipedia.png
diff --git a/sources/academy/webscraping/scraping_basics_python/images/h1.png b/sources/academy/webscraping/scraping_basics/images/h1.png
similarity index 100%
rename from sources/academy/webscraping/scraping_basics_python/images/h1.png
rename to sources/academy/webscraping/scraping_basics/images/h1.png
diff --git a/sources/academy/webscraping/scraping_basics_python/images/pdp.png b/sources/academy/webscraping/scraping_basics/images/pdp.png
similarity index 100%
rename from sources/academy/webscraping/scraping_basics_python/images/pdp.png
rename to sources/academy/webscraping/scraping_basics/images/pdp.png
diff --git a/sources/academy/webscraping/scraping_basics_python/images/product-item.png b/sources/academy/webscraping/scraping_basics/images/product-item.png
similarity index 100%
rename from sources/academy/webscraping/scraping_basics_python/images/product-item.png
rename to sources/academy/webscraping/scraping_basics/images/product-item.png
diff --git a/sources/academy/webscraping/scraping_basics_python/images/refactoring.gif b/sources/academy/webscraping/scraping_basics/images/refactoring.gif
similarity index 100%
rename from sources/academy/webscraping/scraping_basics_python/images/refactoring.gif
rename to sources/academy/webscraping/scraping_basics/images/refactoring.gif
diff --git a/sources/academy/webscraping/scraping_basics_python/images/scraping.webp b/sources/academy/webscraping/scraping_basics/images/scraping.webp
similarity index 100%
rename from sources/academy/webscraping/scraping_basics_python/images/scraping.webp
rename to sources/academy/webscraping/scraping_basics/images/scraping.webp
diff --git a/sources/academy/webscraping/scraping_basics_python/images/variants-js.gif b/sources/academy/webscraping/scraping_basics/images/variants-js.gif
similarity index 100%
rename from sources/academy/webscraping/scraping_basics_python/images/variants-js.gif
rename to sources/academy/webscraping/scraping_basics/images/variants-js.gif
diff --git a/sources/academy/webscraping/scraping_basics_python/images/variants.png b/sources/academy/webscraping/scraping_basics/images/variants.png
similarity index 100%
rename from sources/academy/webscraping/scraping_basics_python/images/variants.png
rename to sources/academy/webscraping/scraping_basics/images/variants.png
diff --git a/sources/academy/webscraping/scraping_basics_javascript2/01_devtools_inspecting.md b/sources/academy/webscraping/scraping_basics_javascript2/01_devtools_inspecting.md
index 2540bfd21b..3df10ee4a2 100644
--- a/sources/academy/webscraping/scraping_basics_javascript2/01_devtools_inspecting.md
+++ b/sources/academy/webscraping/scraping_basics_javascript2/01_devtools_inspecting.md
@@ -6,7 +6,7 @@ slug: /scraping-basics-javascript2/devtools-inspecting
unlisted: true
---
-import Exercises from './_exercises.mdx';
+import Exercises from '../scraping_basics/_exercises.mdx';
**In this lesson we'll use the browser tools for developers to inspect and manipulate the structure of a website.**
@@ -28,11 +28,11 @@ Google Chrome is currently the most popular browser, and many others use the sam
Now let's peek behind the scenes of a real-world website—say, Wikipedia. We'll open Google Chrome and visit [wikipedia.org](https://www.wikipedia.org/). Then, let's press **F12**, or right-click anywhere on the page and select **Inspect**.
-
+
Websites are built with three main technologies: HTML, CSS, and JavaScript. In the **Elements** tab, DevTools shows the HTML and CSS of the current page:
-
+
:::warning Screen adaptations
@@ -62,17 +62,17 @@ While HTML and CSS describe what the browser should display, JavaScript adds int
If you don't see it, press ESC to toggle the Console. Running commands in the Console lets us manipulate the loaded page—we’ll try this shortly.
-
+
## Selecting an element
In the top-left corner of DevTools, let's find the icon with an arrow pointing to a square.
-
+
We'll click the icon and hover your cursor over Wikipedia's subtitle, **The Free Encyclopedia**. As we move our cursor, DevTools will display information about the HTML element under it. We'll click on the subtitle. In the **Elements** tab, DevTools will highlight the HTML element that represents the subtitle.
-
+
The highlighted section should look something like this:
@@ -108,7 +108,7 @@ We won't be creating Node.js scrapers just yet. Let's first get familiar with wh
In the **Elements** tab, with the subtitle element highlighted, let's right-click the element to open the context menu. There, we'll choose **Store as global variable**. The **Console** should appear, with a `temp1` variable ready.
-
+
The Console allows us to run code in the context of the loaded page. We can use it to play around with elements.
@@ -132,7 +132,7 @@ temp1.textContent = 'Hello World!';
When we change elements in the Console, those changes reflect immediately on the page!
-
+
But don't worry—we haven't hacked Wikipedia. The change only happens in our browser. If we reload the page, the change will disappear. This, however, is an easy way to craft a screenshot with fake content. That's why screenshots shouldn't be trusted as evidence.
@@ -161,7 +161,7 @@ You're looking for an [`img`](https://developer.mozilla.org/en-US/docs/Web/HTML/
1. Send the highlighted element to the **Console** using the **Store as global variable** option from the context menu.
1. In the console, type `temp1.src` and hit **Enter**.
- 
+ 
@@ -178,6 +178,6 @@ Open a news website, such as [CNN](https://cnn.com). Use the Console to change t
1. Send the highlighted element to the **Console** using the **Store as global variable** option from the context menu.
1. In the console, type `temp1.textContent = 'Something something'` and hit **Enter**.
- 
+ 
diff --git a/sources/academy/webscraping/scraping_basics_javascript2/02_devtools_locating_elements.md b/sources/academy/webscraping/scraping_basics_javascript2/02_devtools_locating_elements.md
index 0796418c9e..f8a86f124d 100644
--- a/sources/academy/webscraping/scraping_basics_javascript2/02_devtools_locating_elements.md
+++ b/sources/academy/webscraping/scraping_basics_javascript2/02_devtools_locating_elements.md
@@ -6,7 +6,7 @@ slug: /scraping-basics-javascript2/devtools-locating-elements
unlisted: true
---
-import Exercises from './_exercises.mdx';
+import Exercises from '../scraping_basics/_exercises.mdx';
**In this lesson we'll use the browser tools for developers to manually find products on an e-commerce website.**
@@ -30,17 +30,17 @@ That said, we designed all the additional exercises to work with live websites.
As mentioned in the previous lesson, before building a scraper, we need to understand structure of the target page and identify the specific elements our program should extract. Let's figure out how to select details for each product on the [Sales page](https://warehouse-theme-metal.myshopify.com/collections/sales).
-
+
The page displays a grid of product cards, each showing a product's title and picture. Let's open DevTools and locate the title of the **Sony SACS9 Active Subwoofer**. We'll highlight it in the **Elements** tab by clicking on it.
-
+
Next, let's find all the elements containing details about this subwoofer—its price, number of reviews, image, and more.
In the **Elements** tab, we'll move our cursor up from the `a` element containing the subwoofer's title. On the way, we'll hover over each element until we highlight the entire product card. Alternatively, we can use the arrow-up key. The `div` element we land on is the **parent element**, and all nested elements are its **child elements**.
-
+
At this stage, we could use the **Store as global variable** option to send the element to the **Console**. While helpful for manual inspection, this isn't something a program can do.
@@ -64,7 +64,7 @@ document.querySelector('.product-item');
It will return the HTML element for the first product card in the listing:
-
+
CSS selectors can get quite complex, but the basics are enough to scrape most of the Warehouse store. Let's cover two simple types and how they can combine.
@@ -114,13 +114,13 @@ The product card has four classes: `product-item`, `product-item--vertical`, `1/
This class is also unique enough in the page's context. If it were something generic like `item`, there would be a higher risk that developers of the website might use it for unrelated elements. In the **Elements** tab, we can see a parent element `product-list` that contains all the product cards marked as `product-item`. This structure aligns with the data we're after.
-
+
## Locating all product cards
In the **Console**, hovering our cursor over objects representing HTML elements highlights the corresponding elements on the page. This way we can verify that when we query `.product-item`, the result represents the JBL Flip speaker—the first product card in the list.
-
+
But what if we want to scrape details about the Sony subwoofer we inspected earlier? For that, we need a method that selects more than just the first match: [`querySelectorAll()`](https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelectorAll). As the name suggests, it takes a CSS selector string and returns all matching HTML elements. Let's type this into the **Console**:
@@ -132,7 +132,7 @@ The returned value is a [`NodeList`](https://developer.mozilla.org/en-US/docs/We
We'll expand the result by clicking the small arrow, then hover our cursor over the third element in the list. Indexing starts at 0, so the third element is at index 2. There it is—the product card for the subwoofer!
-
+
To save the subwoofer in a variable for further inspection, we can use index access with brackets, just like with regular JavaScript arrays:
@@ -151,7 +151,7 @@ Even though we're just playing in the browser's **Console**, we're inching close
On English Wikipedia's [Main Page](https://en.wikipedia.org/wiki/Main_Page), use CSS selectors in the **Console** to list the HTML elements representing headings of the colored boxes (including the grey ones).
-
+
Solution
@@ -169,7 +169,7 @@ On English Wikipedia's [Main Page](https://en.wikipedia.org/wiki/Main_Page), use
Go to Shein's [Jewelry & Accessories](https://shein.com/RecommendSelection/Jewelry-Accessories-sc-017291431.html) category. In the **Console**, use CSS selectors to list all HTML elements representing the products.
-
+
Solution
@@ -194,7 +194,7 @@ Learn about the [descendant combinator](https://developer.mozilla.org/en-US/docs
:::
-
+
Solution
diff --git a/sources/academy/webscraping/scraping_basics_javascript2/03_devtools_extracting_data.md b/sources/academy/webscraping/scraping_basics_javascript2/03_devtools_extracting_data.md
index aeb6fc7ed6..4acffaf356 100644
--- a/sources/academy/webscraping/scraping_basics_javascript2/03_devtools_extracting_data.md
+++ b/sources/academy/webscraping/scraping_basics_javascript2/03_devtools_extracting_data.md
@@ -6,7 +6,7 @@ slug: /scraping-basics-javascript2/devtools-extracting-data
unlisted: true
---
-import Exercises from './_exercises.mdx';
+import Exercises from '../scraping_basics/_exercises.mdx';
**In this lesson we'll use the browser tools for developers to manually extract product data from an e-commerce website.**
@@ -31,7 +31,7 @@ subwoofer.textContent;
That indeed outputs all the text, but in a form which would be hard to break down to relevant pieces.
-
+
We'll need to first locate relevant child elements and extract the data from each of them individually.
@@ -39,7 +39,7 @@ We'll need to first locate relevant child elements and extract the data from eac
We'll use the **Elements** tab of DevTools to inspect all child elements of the product card for the Sony subwoofer. We can see that the title of the product is inside an `a` element with several classes. From those the `product-item__title` seems like a great choice to locate the element.
-
+
Browser JavaScript represents HTML elements as [Element](https://developer.mozilla.org/en-US/docs/Web/API/Element) objects. Among properties we've already played with, such as `textContent` or `outerHTML`, it also has the [`querySelector()`](https://developer.mozilla.org/en-US/docs/Web/API/Element/querySelector) method. Here the method looks for matches only within children of the element:
@@ -50,13 +50,13 @@ title.textContent;
Notice we're calling `querySelector()` on the `subwoofer` variable, not `document`. And just like this, we've scraped our first piece of data! We've extracted the product title:
-
+
## Extracting price
To figure out how to get the price, we'll use the **Elements** tab of DevTools again. We notice there are two prices, a regular price and a sale price. For the purposes of watching prices we'll need the sale price. Both are `span` elements with the `price` class.
-
+
We could either rely on the fact that the sale price is likely to be always the one which is highlighted, or that it's always the first price. For now we'll rely on the later and we'll let `querySelector()` to simply return the first result:
@@ -67,7 +67,7 @@ price.textContent;
It works, but the price isn't alone in the result. Before we'd use such data, we'd need to do some **data cleaning**:
-
+
But for now that's okay. We're just testing the waters now, so that we have an idea about what our scraper will need to do. Once we'll get to extracting prices in Node.js, we'll figure out how to get the values as numbers.
@@ -100,7 +100,7 @@ At IKEA's [Artificial plants & flowers listing](https://www.ikea.com/se/en/cat/a
On Fandom's [Movies page](https://www.fandom.com/topics/movies), use CSS selectors and HTML element manipulation in the **Console** to extract the name of the top wiki. Use the [`trim()`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/trim) method to remove white space around the name.
-
+
Solution
@@ -119,7 +119,7 @@ On Fandom's [Movies page](https://www.fandom.com/topics/movies), use CSS selecto
On the Guardian's [F1 news page](https://www.theguardian.com/sport/formulaone), use CSS selectors and HTML manipulation in the **Console** to extract details about the first post. Specifically, extract its title, lead paragraph, and URL of the associated photo.
-
+
Solution
diff --git a/sources/academy/webscraping/scraping_basics_javascript2/04_downloading_html.md b/sources/academy/webscraping/scraping_basics_javascript2/04_downloading_html.md
index f5ff62a6c8..df9130eca4 100644
--- a/sources/academy/webscraping/scraping_basics_javascript2/04_downloading_html.md
+++ b/sources/academy/webscraping/scraping_basics_javascript2/04_downloading_html.md
@@ -6,7 +6,7 @@ slug: /scraping-basics-javascript2/downloading-html
unlisted: true
---
-import Exercises from './_exercises.mdx';
+import Exercises from '../scraping_basics/_exercises.mdx';
**In this lesson we'll start building a Node.js application for watching prices. As a first step, we'll use the Fetch API to download HTML code of a product listing page.**
diff --git a/sources/academy/webscraping/scraping_basics_javascript2/05_parsing_html.md b/sources/academy/webscraping/scraping_basics_javascript2/05_parsing_html.md
index f641e263df..7a3c6d9f79 100644
--- a/sources/academy/webscraping/scraping_basics_javascript2/05_parsing_html.md
+++ b/sources/academy/webscraping/scraping_basics_javascript2/05_parsing_html.md
@@ -6,7 +6,7 @@ slug: /scraping-basics-javascript2/parsing-html
unlisted: true
---
-import Exercises from './_exercises.mdx';
+import Exercises from '../scraping_basics/_exercises.mdx';
**In this lesson we'll look for products in the downloaded HTML. We'll use Cheerio to turn the HTML into objects which we can work with in our Node.js program.**
@@ -14,7 +14,7 @@ import Exercises from './_exercises.mdx';
From lessons about browser DevTools we know that the HTML elements representing individual products have a `class` attribute which, among other values, contains `product-item`.
-
+
As a first step, let's try counting how many products are on the listing page.
@@ -50,7 +50,7 @@ Being comfortable around installing Node.js packages is a prerequisite of this c
Now let's import the package and use it for parsing the HTML. The `cheerio` module allows us to work with the HTML elements in a structured way. As a demonstration, we'll first get the `` element, which represents the main heading of the page.
-
+
We'll update our code to the following:
diff --git a/sources/academy/webscraping/scraping_basics_javascript2/06_locating_elements.md b/sources/academy/webscraping/scraping_basics_javascript2/06_locating_elements.md
index 09101ee358..6201df7b4f 100644
--- a/sources/academy/webscraping/scraping_basics_javascript2/06_locating_elements.md
+++ b/sources/academy/webscraping/scraping_basics_javascript2/06_locating_elements.md
@@ -6,7 +6,7 @@ slug: /scraping-basics-javascript2/locating-elements
unlisted: true
---
-import Exercises from './_exercises.mdx';
+import Exercises from '../scraping_basics/_exercises.mdx';
**In this lesson we'll locate product data in the downloaded HTML. We'll use Cheerio to find those HTML elements which contain details about each product, such as title or price.**
@@ -64,7 +64,7 @@ To get details about each product in a structured way, we'll need a different ap
As in the browser DevTools lessons, we need to change the code so that it locates child elements for each product card.
-
+
We should be looking for elements which have the `product-item__title` and `price` classes. We already know how that translates to CSS selectors:
diff --git a/sources/academy/webscraping/scraping_basics_javascript2/07_extracting_data.md b/sources/academy/webscraping/scraping_basics_javascript2/07_extracting_data.md
index e7b81e9450..5c9fb95543 100644
--- a/sources/academy/webscraping/scraping_basics_javascript2/07_extracting_data.md
+++ b/sources/academy/webscraping/scraping_basics_javascript2/07_extracting_data.md
@@ -6,7 +6,7 @@ slug: /scraping-basics-javascript2/extracting-data
unlisted: true
---
-import Exercises from './_exercises.mdx';
+import Exercises from '../scraping_basics/_exercises.mdx';
**In this lesson we'll finish extracting product data from the downloaded HTML. With help of basic string manipulation we'll focus on cleaning and correctly representing the product price.**
diff --git a/sources/academy/webscraping/scraping_basics_javascript2/08_saving_data.md b/sources/academy/webscraping/scraping_basics_javascript2/08_saving_data.md
index f3801457d8..f57252bebc 100644
--- a/sources/academy/webscraping/scraping_basics_javascript2/08_saving_data.md
+++ b/sources/academy/webscraping/scraping_basics_javascript2/08_saving_data.md
@@ -178,7 +178,7 @@ await writeFile("products.csv", csvData);
The program should now also produce a `data.csv` file. When browsing the directory on macOS, we can see a nice preview of the file's contents, which proves that the file is correct and that other programs can read it. If you're using a different operating system, try opening the file with any spreadsheet program you have.
-
+
In the CSV format, if a value contains commas, we should enclose it in quotes. If it contains quotes, we should double them. When we open the file in a text editor of our choice, we can see that the library automatically handled this:
@@ -232,6 +232,6 @@ Open the `products.csv` file we created in the lesson using a spreadsheet applic
1. Select the header row. Go to **Data > Create filter**.
1. Use the filter icon that appears next to `minPrice`. Choose **Filter by condition**, select **Greater than**, and enter **500** in the text field. Confirm the dialog. You should see only the filtered data.
- 
+ 
diff --git a/sources/academy/webscraping/scraping_basics_javascript2/09_getting_links.md b/sources/academy/webscraping/scraping_basics_javascript2/09_getting_links.md
index 6e3be25049..b36e9a7dab 100644
--- a/sources/academy/webscraping/scraping_basics_javascript2/09_getting_links.md
+++ b/sources/academy/webscraping/scraping_basics_javascript2/09_getting_links.md
@@ -6,7 +6,7 @@ slug: /scraping-basics-javascript2/getting-links
unlisted: true
---
-import Exercises from './_exercises.mdx';
+import Exercises from '../scraping_basics/_exercises.mdx';
**In this lesson, we'll locate and extract links to individual product pages. We'll use Cheerio to find the relevant bits of HTML.**
@@ -205,7 +205,7 @@ The program is much easier to read now. With the `parseProduct()` function handy
We turned the whole program upside down, and at the same time, we didn't make any actual changes! This is [refactoring](https://en.wikipedia.org/wiki/Code_refactoring): improving the structure of existing code without changing its behavior.
-
+
:::
@@ -213,7 +213,7 @@ We turned the whole program upside down, and at the same time, we didn't make an
With everything in place, we can now start working on a scraper that also scrapes the product pages. For that, we'll need the links to those pages. Let's open the browser DevTools and remind ourselves of the structure of a single product item:
-
+
Several methods exist for transitioning from one page to another, but the most common is a link element, which looks like this:
diff --git a/sources/academy/webscraping/scraping_basics_javascript2/10_crawling.md b/sources/academy/webscraping/scraping_basics_javascript2/10_crawling.md
index 85ad4acad2..ff121f4d05 100644
--- a/sources/academy/webscraping/scraping_basics_javascript2/10_crawling.md
+++ b/sources/academy/webscraping/scraping_basics_javascript2/10_crawling.md
@@ -6,7 +6,7 @@ slug: /scraping-basics-javascript2/crawling
unlisted: true
---
-import Exercises from './_exercises.mdx';
+import Exercises from '../scraping_basics/_exercises.mdx';
**In this lesson, we'll follow links to individual product pages. We'll use the Fetch API to download them and Cheerio to process them.**
@@ -82,7 +82,7 @@ await writeFile('products.csv', await exportCSV(data));
Each product URL points to a so-called _product detail page_, or PDP. If we open one of the product URLs in the browser, e.g. the one about [Sony XBR-950G BRAVIA](https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv), we can see that it contains a vendor name, [SKU](https://en.wikipedia.org/wiki/Stock_keeping_unit), number of reviews, product images, product variants, stock availability, description, and perhaps more.
-
+
Depending on what's valuable for our use case, we can now use the same techniques as in previous lessons to extract any of the above. As a demonstration, let's scrape the vendor name. In browser DevTools, we can see that the HTML around the vendor name has the following structure:
@@ -197,7 +197,7 @@ Scraping the vendor's name is nice, but the main reason we started checking the
Looking at the [Sony XBR-950G BRAVIA](https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv), it's clear that the listing only shows min prices, because some products have variants, each with a different price. And different stock availability. And different SKUs…
-
+
In the next lesson, we'll scrape the product detail pages so that each product variant is represented as a separate item in our dataset.
diff --git a/sources/academy/webscraping/scraping_basics_javascript2/11_scraping_variants.md b/sources/academy/webscraping/scraping_basics_javascript2/11_scraping_variants.md
index 04d340119a..03e25269ed 100644
--- a/sources/academy/webscraping/scraping_basics_javascript2/11_scraping_variants.md
+++ b/sources/academy/webscraping/scraping_basics_javascript2/11_scraping_variants.md
@@ -6,7 +6,7 @@ slug: /scraping-basics-javascript2/scraping-variants
unlisted: true
---
-import Exercises from './_exercises.mdx';
+import Exercises from '../scraping_basics/_exercises.mdx';
**In this lesson, we'll scrape the product detail pages to represent each product variant as a separate item in our dataset.**
@@ -39,7 +39,7 @@ First, let's extract information about the variants. If we go to [Sony XBR-950G
Nice! We can extract the variant names, but we also need to extract the price for each variant. Switching the variants using the buttons shows us that the HTML changes dynamically. This means the page uses JavaScript to display this information.
-
+
If we can't find a workaround, we'd need our scraper to run browser JavaScript. That's not impossible. Scrapers can spin up their own browser instance and automate clicking on buttons, but it's slow and resource-intensive. Ideally, we want to stick to plain HTTP requests and Cheerio as much as possible.
diff --git a/sources/academy/webscraping/scraping_basics_javascript2/12_framework.md b/sources/academy/webscraping/scraping_basics_javascript2/12_framework.md
index bc43ea0508..98022814de 100644
--- a/sources/academy/webscraping/scraping_basics_javascript2/12_framework.md
+++ b/sources/academy/webscraping/scraping_basics_javascript2/12_framework.md
@@ -6,7 +6,7 @@ slug: /scraping-basics-javascript2/framework
unlisted: true
---
-import Exercises from './_exercises.mdx';
+import Exercises from '../scraping_basics/_exercises.mdx';
**In this lesson, we'll rework our application for watching prices so that it builds on top of a scraping framework. We'll use Crawlee to make the program simpler, faster, and more robust.**
@@ -273,7 +273,7 @@ const crawler = new CheerioCrawler({
That's it! If we run the program now, there should be a `storage` directory alongside the `index.js` file. Crawlee uses it to store its internal state. If we go to the `storage/datasets/default` subdirectory, we'll see over 30 JSON files, each representing a single item.
-
+
We can also export all the items to a single file of our choice. We'll do it at the end of the program, after the crawler has finished scraping:
diff --git a/sources/academy/webscraping/scraping_basics_javascript2/13_platform.md b/sources/academy/webscraping/scraping_basics_javascript2/13_platform.md
index cc1fc3b7d1..eca6ebaac6 100644
--- a/sources/academy/webscraping/scraping_basics_javascript2/13_platform.md
+++ b/sources/academy/webscraping/scraping_basics_javascript2/13_platform.md
@@ -172,11 +172,11 @@ Actor build detail https://console.apify.com/actors/a123bCDefghiJkLMN#/builds/0.
After opening the link in our browser, assuming we're logged in, we should see the **Source** screen on the Actor's detail page. We'll go to the **Input** tab of that screen. We won't change anything—just hit **Start**, and we should see logs similar to what we see locally, but this time our scraper will be running in the cloud.
-
+
When the run finishes, the interface will turn green. On the **Output** tab, we can preview the results as a table or JSON. We can even export the data to formats like CSV, XML, Excel, RSS, and more.
-
+
:::info Accessing data
@@ -190,7 +190,7 @@ Now that our scraper is deployed, let's automate its execution. In the Apify web
From now on, the Actor will execute daily. We can inspect each run, view logs, check collected data, [monitor stats and charts](https://docs.apify.com/platform/monitoring), and even set up alerts.
-
+
## Adding support for proxies
@@ -298,7 +298,7 @@ Run: Building Actor warehouse-watchdog
Back in the Apify console, we'll go to the **Source** screen and switch to the **Input** tab. We should see the new **Proxy config** option, which defaults to **Datacenter - Automatic**.
-
+
We'll leave it as is and click **Start**. This time, the logs should show `Using proxy: yes`, as the scraper uses proxies provided by the platform:
diff --git a/sources/academy/webscraping/scraping_basics_javascript2/images b/sources/academy/webscraping/scraping_basics_javascript2/images
deleted file mode 120000
index 535a050e4a..0000000000
--- a/sources/academy/webscraping/scraping_basics_javascript2/images
+++ /dev/null
@@ -1 +0,0 @@
-../scraping_basics_python/images
\ No newline at end of file
diff --git a/sources/academy/webscraping/scraping_basics_javascript2/index.md b/sources/academy/webscraping/scraping_basics_javascript2/index.md
index 3751f05efb..b7465a9aaf 100644
--- a/sources/academy/webscraping/scraping_basics_javascript2/index.md
+++ b/sources/academy/webscraping/scraping_basics_javascript2/index.md
@@ -15,7 +15,7 @@ import DocCardList from '@theme/DocCardList';
In this course we'll use JavaScript to create an application for watching prices. It'll be able to scrape all product pages of an e-commerce website and record prices. Data from several runs of such program would be useful for seeing trends in price changes, detecting discounts, etc.
-
+
## What we'll do
diff --git a/sources/academy/webscraping/scraping_basics_python/01_devtools_inspecting.md b/sources/academy/webscraping/scraping_basics_python/01_devtools_inspecting.md
index 0332766a62..9a06641d28 100644
--- a/sources/academy/webscraping/scraping_basics_python/01_devtools_inspecting.md
+++ b/sources/academy/webscraping/scraping_basics_python/01_devtools_inspecting.md
@@ -5,7 +5,7 @@ description: Lesson about using the browser tools for developers to inspect and
slug: /scraping-basics-python/devtools-inspecting
---
-import Exercises from './_exercises.mdx';
+import Exercises from '../scraping_basics/_exercises.mdx';
**In this lesson we'll use the browser tools for developers to inspect and manipulate the structure of a website.**
@@ -27,11 +27,11 @@ Google Chrome is currently the most popular browser, and many others use the sam
Now let's peek behind the scenes of a real-world website—say, Wikipedia. We'll open Google Chrome and visit [wikipedia.org](https://www.wikipedia.org/). Then, let's press **F12**, or right-click anywhere on the page and select **Inspect**.
-
+
Websites are built with three main technologies: HTML, CSS, and JavaScript. In the **Elements** tab, DevTools shows the HTML and CSS of the current page:
-
+
:::warning Screen adaptations
@@ -61,17 +61,17 @@ While HTML and CSS describe what the browser should display, [JavaScript](https:
In DevTools, the **Console** tab allows ad-hoc experimenting with JavaScript. If you don't see it, press ESC to toggle the Console. Running commands in the Console lets us manipulate the loaded page—we’ll try this shortly.
-
+
## Selecting an element
In the top-left corner of DevTools, let's find the icon with an arrow pointing to a square.
-
+
We'll click the icon and hover your cursor over Wikipedia's subtitle, **The Free Encyclopedia**. As we move our cursor, DevTools will display information about the HTML element under it. We'll click on the subtitle. In the **Elements** tab, DevTools will highlight the HTML element that represents the subtitle.
-
+
The highlighted section should look something like this:
@@ -107,7 +107,7 @@ We won't be creating Python scrapers just yet. Let's first get familiar with wha
In the **Elements** tab, with the subtitle element highlighted, let's right-click the element to open the context menu. There, we'll choose **Store as global variable**. The **Console** should appear, with a `temp1` variable ready.
-
+
The Console allows us to run JavaScript in the context of the loaded page, similar to Python's [interactive REPL](https://realpython.com/interacting-with-python/). We can use it to play around with elements.
@@ -131,7 +131,7 @@ temp1.textContent = 'Hello World!';
When we change elements in the Console, those changes reflect immediately on the page!
-
+
But don't worry—we haven't hacked Wikipedia. The change only happens in our browser. If we reload the page, the change will disappear. This, however, is an easy way to craft a screenshot with fake content. That's why screenshots shouldn't be trusted as evidence.
@@ -160,7 +160,7 @@ You're looking for an [`img`](https://developer.mozilla.org/en-US/docs/Web/HTML/
1. Send the highlighted element to the **Console** using the **Store as global variable** option from the context menu.
1. In the console, type `temp1.src` and hit **Enter**.
- 
+ 
@@ -177,6 +177,6 @@ Open a news website, such as [CNN](https://cnn.com). Use the Console to change t
1. Send the highlighted element to the **Console** using the **Store as global variable** option from the context menu.
1. In the console, type `temp1.textContent = 'Something something'` and hit **Enter**.
- 
+ 
diff --git a/sources/academy/webscraping/scraping_basics_python/02_devtools_locating_elements.md b/sources/academy/webscraping/scraping_basics_python/02_devtools_locating_elements.md
index 154c7d1a19..515cf1f5e1 100644
--- a/sources/academy/webscraping/scraping_basics_python/02_devtools_locating_elements.md
+++ b/sources/academy/webscraping/scraping_basics_python/02_devtools_locating_elements.md
@@ -5,7 +5,7 @@ description: Lesson about using the browser tools for developers to manually fin
slug: /scraping-basics-python/devtools-locating-elements
---
-import Exercises from './_exercises.mdx';
+import Exercises from '../scraping_basics/_exercises.mdx';
**In this lesson we'll use the browser tools for developers to manually find products on an e-commerce website.**
@@ -29,17 +29,17 @@ That said, we designed all the additional exercises to work with live websites.
As mentioned in the previous lesson, before building a scraper, we need to understand structure of the target page and identify the specific elements our program should extract. Let's figure out how to select details for each product on the [Sales page](https://warehouse-theme-metal.myshopify.com/collections/sales).
-
+
The page displays a grid of product cards, each showing a product's title and picture. Let's open DevTools and locate the title of the **Sony SACS9 Active Subwoofer**. We'll highlight it in the **Elements** tab by clicking on it.
-
+
Next, let's find all the elements containing details about this subwoofer—its price, number of reviews, image, and more.
In the **Elements** tab, we'll move our cursor up from the `a` element containing the subwoofer's title. On the way, we'll hover over each element until we highlight the entire product card. Alternatively, we can use the arrow-up key. The `div` element we land on is the **parent element**, and all nested elements are its **child elements**.
-
+
At this stage, we could use the **Store as global variable** option to send the element to the **Console**. While helpful for manual inspection, this isn't something a program can do.
@@ -65,7 +65,7 @@ document.querySelector('.product-item');
It will return the HTML element for the first product card in the listing:
-
+
CSS selectors can get quite complex, but the basics are enough to scrape most of the Warehouse store. Let's cover two simple types and how they can combine.
@@ -115,13 +115,13 @@ The product card has four classes: `product-item`, `product-item--vertical`, `1/
This class is also unique enough in the page's context. If it were something generic like `item`, there would be a higher risk that developers of the website might use it for unrelated elements. In the **Elements** tab, we can see a parent element `product-list` that contains all the product cards marked as `product-item`. This structure aligns with the data we're after.
-
+
## Locating all product cards
In the **Console**, hovering our cursor over objects representing HTML elements highlights the corresponding elements on the page. This way we can verify that when we query `.product-item`, the result represents the JBL Flip speaker—the first product card in the list.
-
+
But what if we want to scrape details about the Sony subwoofer we inspected earlier? For that, we need a method that selects more than just the first match: [`querySelectorAll()`](https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelectorAll). As the name suggests, it takes a CSS selector string and returns all matching HTML elements. Let's type this into the **Console**:
@@ -133,7 +133,7 @@ The returned value is a [`NodeList`](https://developer.mozilla.org/en-US/docs/We
We'll expand the result by clicking the small arrow, then hover our cursor over the third element in the list. Indexing starts at 0, so the third element is at index 2. There it is—the product card for the subwoofer!
-
+
To save the subwoofer in a variable for further inspection, we can use index access with brackets, just like with Python lists (or JavaScript arrays):
@@ -152,7 +152,7 @@ Even though we're just playing with JavaScript in the browser's **Console**, we'
On English Wikipedia's [Main Page](https://en.wikipedia.org/wiki/Main_Page), use CSS selectors in the **Console** to list the HTML elements representing headings of the colored boxes (including the grey ones).
-
+
Solution
@@ -170,7 +170,7 @@ On English Wikipedia's [Main Page](https://en.wikipedia.org/wiki/Main_Page), use
Go to Shein's [Jewelry & Accessories](https://shein.com/RecommendSelection/Jewelry-Accessories-sc-017291431.html) category. In the **Console**, use CSS selectors to list all HTML elements representing the products.
-
+
Solution
@@ -195,7 +195,7 @@ Learn about the [descendant combinator](https://developer.mozilla.org/en-US/docs
:::
-
+
Solution
diff --git a/sources/academy/webscraping/scraping_basics_python/03_devtools_extracting_data.md b/sources/academy/webscraping/scraping_basics_python/03_devtools_extracting_data.md
index 43fb6264f3..f864362f8a 100644
--- a/sources/academy/webscraping/scraping_basics_python/03_devtools_extracting_data.md
+++ b/sources/academy/webscraping/scraping_basics_python/03_devtools_extracting_data.md
@@ -5,7 +5,7 @@ description: Lesson about using the browser tools for developers to manually ext
slug: /scraping-basics-python/devtools-extracting-data
---
-import Exercises from './_exercises.mdx';
+import Exercises from '../scraping_basics/_exercises.mdx';
**In this lesson we'll use the browser tools for developers to manually extract product data from an e-commerce website.**
@@ -30,7 +30,7 @@ subwoofer.textContent;
That indeed outputs all the text, but in a form which would be hard to break down to relevant pieces.
-
+
We'll need to first locate relevant child elements and extract the data from each of them individually.
@@ -38,7 +38,7 @@ We'll need to first locate relevant child elements and extract the data from eac
We'll use the **Elements** tab of DevTools to inspect all child elements of the product card for the Sony subwoofer. We can see that the title of the product is inside an `a` element with several classes. From those the `product-item__title` seems like a great choice to locate the element.
-
+
JavaScript represents HTML elements as [Element](https://developer.mozilla.org/en-US/docs/Web/API/Element) objects. Among properties we've already played with, such as `textContent` or `outerHTML`, it also has the [`querySelector()`](https://developer.mozilla.org/en-US/docs/Web/API/Element/querySelector) method. Here the method looks for matches only within children of the element:
@@ -49,13 +49,13 @@ title.textContent;
Notice we're calling `querySelector()` on the `subwoofer` variable, not `document`. And just like this, we've scraped our first piece of data! We've extracted the product title:
-
+
## Extracting price
To figure out how to get the price, we'll use the **Elements** tab of DevTools again. We notice there are two prices, a regular price and a sale price. For the purposes of watching prices we'll need the sale price. Both are `span` elements with the `price` class.
-
+
We could either rely on the fact that the sale price is likely to be always the one which is highlighted, or that it's always the first price. For now we'll rely on the later and we'll let `querySelector()` to simply return the first result:
@@ -66,7 +66,7 @@ price.textContent;
It works, but the price isn't alone in the result. Before we'd use such data, we'd need to do some **data cleaning**:
-
+
But for now that's okay. We're just testing the waters now, so that we have an idea about what our scraper will need to do. Once we'll get to extracting prices in Python, we'll figure out how to get the values as numbers.
@@ -99,7 +99,7 @@ At IKEA's [Artificial plants & flowers listing](https://www.ikea.com/se/en/cat/a
On Fandom's [Movies page](https://www.fandom.com/topics/movies), use CSS selectors and HTML element manipulation in the **Console** to extract the name of the top wiki. Use JavaScript's [`trim()`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/trim) method to remove white space around the name.
-
+
Solution
@@ -118,7 +118,7 @@ On Fandom's [Movies page](https://www.fandom.com/topics/movies), use CSS selecto
On the Guardian's [F1 news page](https://www.theguardian.com/sport/formulaone), use CSS selectors and HTML manipulation in the **Console** to extract details about the first post. Specifically, extract its title, lead paragraph, and URL of the associated photo.
-
+
Solution
diff --git a/sources/academy/webscraping/scraping_basics_python/04_downloading_html.md b/sources/academy/webscraping/scraping_basics_python/04_downloading_html.md
index e0f2304e61..e3866cfcb2 100644
--- a/sources/academy/webscraping/scraping_basics_python/04_downloading_html.md
+++ b/sources/academy/webscraping/scraping_basics_python/04_downloading_html.md
@@ -5,7 +5,7 @@ description: Lesson about building a Python application for watching prices. Usi
slug: /scraping-basics-python/downloading-html
---
-import Exercises from './_exercises.mdx';
+import Exercises from '../scraping_basics/_exercises.mdx';
**In this lesson we'll start building a Python application for watching prices. As a first step, we'll use the HTTPX library to download HTML code of a product listing page.**
diff --git a/sources/academy/webscraping/scraping_basics_python/05_parsing_html.md b/sources/academy/webscraping/scraping_basics_python/05_parsing_html.md
index 80a4974f79..dbfa52cb9a 100644
--- a/sources/academy/webscraping/scraping_basics_python/05_parsing_html.md
+++ b/sources/academy/webscraping/scraping_basics_python/05_parsing_html.md
@@ -5,7 +5,7 @@ description: Lesson about building a Python application for watching prices. Usi
slug: /scraping-basics-python/parsing-html
---
-import Exercises from './_exercises.mdx';
+import Exercises from '../scraping_basics/_exercises.mdx';
**In this lesson we'll look for products in the downloaded HTML. We'll use BeautifulSoup to turn the HTML into objects which we can work with in our Python program.**
@@ -13,7 +13,7 @@ import Exercises from './_exercises.mdx';
From lessons about browser DevTools we know that the HTML elements representing individual products have a `class` attribute which, among other values, contains `product-item`.
-
+
As a first step, let's try counting how many products are on the listing page.
@@ -42,7 +42,7 @@ Successfully installed beautifulsoup4-4.0.0 soupsieve-0.0
Now let's use it for parsing the HTML. The `BeautifulSoup` object allows us to work with the HTML elements in a structured way. As a demonstration, we'll first get the `` element, which represents the main heading of the page.
-
+
We'll update our code to the following:
diff --git a/sources/academy/webscraping/scraping_basics_python/06_locating_elements.md b/sources/academy/webscraping/scraping_basics_python/06_locating_elements.md
index fa8a38fc6d..0708dc071e 100644
--- a/sources/academy/webscraping/scraping_basics_python/06_locating_elements.md
+++ b/sources/academy/webscraping/scraping_basics_python/06_locating_elements.md
@@ -5,7 +5,7 @@ description: Lesson about building a Python application for watching prices. Usi
slug: /scraping-basics-python/locating-elements
---
-import Exercises from './_exercises.mdx';
+import Exercises from '../scraping_basics/_exercises.mdx';
**In this lesson we'll locate product data in the downloaded HTML. We'll use BeautifulSoup to find those HTML elements which contain details about each product, such as title or price.**
@@ -60,7 +60,7 @@ To get details about each product in a structured way, we'll need a different ap
As in the browser DevTools lessons, we need to change the code so that it locates child elements for each product card.
-
+
We should be looking for elements which have the `product-item__title` and `price` classes. We already know how that translates to CSS selectors:
diff --git a/sources/academy/webscraping/scraping_basics_python/07_extracting_data.md b/sources/academy/webscraping/scraping_basics_python/07_extracting_data.md
index 01814edde9..eb49b7ce69 100644
--- a/sources/academy/webscraping/scraping_basics_python/07_extracting_data.md
+++ b/sources/academy/webscraping/scraping_basics_python/07_extracting_data.md
@@ -5,7 +5,7 @@ description: Lesson about building a Python application for watching prices. Usi
slug: /scraping-basics-python/extracting-data
---
-import Exercises from './_exercises.mdx';
+import Exercises from '../scraping_basics/_exercises.mdx';
**In this lesson we'll finish extracting product data from the downloaded HTML. With help of basic string manipulation we'll focus on cleaning and correctly representing the product price.**
diff --git a/sources/academy/webscraping/scraping_basics_python/08_saving_data.md b/sources/academy/webscraping/scraping_basics_python/08_saving_data.md
index 8c3ddedc31..a0d6d94743 100644
--- a/sources/academy/webscraping/scraping_basics_python/08_saving_data.md
+++ b/sources/academy/webscraping/scraping_basics_python/08_saving_data.md
@@ -147,7 +147,7 @@ In the CSV format, if a value contains commas, we should enclose it in quotes. W
When browsing the directory on macOS, we can see a nice preview of the file's contents, which proves that the file is correct and that other programs can read it. If you're using a different operating system, try opening the file with any spreadsheet program you have.
-
+
Now that's nice, but we didn't want Alice, Bob, kickbox, or TypeScript. What we actually want is a CSV containing `Sony XBR-950G BRAVIA 4K HDR Ultra HD TV`, right? Let's do this! First, let's add `csv` to our imports:
@@ -174,7 +174,7 @@ with open("products.csv", "w") as file:
The program should now also produce a CSV file with the following content:
-
+
We've built a Python application that downloads a product listing, parses the data, and saves it in a structured format for further use. But the data still has gaps: for some products, we only have the min price, not the actual prices. In the next lesson, we'll attempt to scrape more details from all the product pages.
@@ -218,6 +218,6 @@ Open the `products.csv` file we created in the lesson using a spreadsheet applic
1. Select the header row. Go to **Data > Create filter**.
1. Use the filter icon that appears next to `min_price`. Choose **Filter by condition**, select **Greater than**, and enter **500** in the text field. Confirm the dialog. You should see only the filtered data.
- 
+ 
diff --git a/sources/academy/webscraping/scraping_basics_python/09_getting_links.md b/sources/academy/webscraping/scraping_basics_python/09_getting_links.md
index 6da32e836d..883ba050f3 100644
--- a/sources/academy/webscraping/scraping_basics_python/09_getting_links.md
+++ b/sources/academy/webscraping/scraping_basics_python/09_getting_links.md
@@ -5,7 +5,7 @@ description: Lesson about building a Python application for watching prices. Usi
slug: /scraping-basics-python/getting-links
---
-import Exercises from './_exercises.mdx';
+import Exercises from '../scraping_basics/_exercises.mdx';
**In this lesson, we'll locate and extract links to individual product pages. We'll use BeautifulSoup to find the relevant bits of HTML.**
@@ -204,7 +204,7 @@ The program is much easier to read now. With the `parse_product()` function hand
We turned the whole program upside down, and at the same time, we didn't make any actual changes! This is [refactoring](https://en.wikipedia.org/wiki/Code_refactoring): improving the structure of existing code without changing its behavior.
-
+
:::
@@ -212,7 +212,7 @@ We turned the whole program upside down, and at the same time, we didn't make an
With everything in place, we can now start working on a scraper that also scrapes the product pages. For that, we'll need the links to those pages. Let's open the browser DevTools and remind ourselves of the structure of a single product item:
-
+
Several methods exist for transitioning from one page to another, but the most common is a link element, which looks like this:
diff --git a/sources/academy/webscraping/scraping_basics_python/10_crawling.md b/sources/academy/webscraping/scraping_basics_python/10_crawling.md
index a18ee39632..836dadad3a 100644
--- a/sources/academy/webscraping/scraping_basics_python/10_crawling.md
+++ b/sources/academy/webscraping/scraping_basics_python/10_crawling.md
@@ -5,7 +5,7 @@ description: Lesson about building a Python application for watching prices. Usi
slug: /scraping-basics-python/crawling
---
-import Exercises from './_exercises.mdx';
+import Exercises from '../scraping_basics/_exercises.mdx';
**In this lesson, we'll follow links to individual product pages. We'll use HTTPX to download them and BeautifulSoup to process them.**
@@ -81,7 +81,7 @@ with open("products.csv", "w") as file:
Each product URL points to a so-called _product detail page_, or PDP. If we open one of the product URLs in the browser, e.g. the one about [Sony XBR-950G BRAVIA](https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv), we can see that it contains a vendor name, [SKU](https://en.wikipedia.org/wiki/Stock_keeping_unit), number of reviews, product images, product variants, stock availability, description, and perhaps more.
-
+
Depending on what's valuable for our use case, we can now use the same techniques as in previous lessons to extract any of the above. As a demonstration, let's scrape the vendor name. In browser DevTools, we can see that the HTML around the vendor name has the following structure:
@@ -172,7 +172,7 @@ Scraping the vendor's name is nice, but the main reason we started checking the
Looking at the [Sony XBR-950G BRAVIA](https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv), it's clear that the listing only shows min prices, because some products have variants, each with a different price. And different stock availability. And different SKUs…
-
+
In the next lesson, we'll scrape the product detail pages so that each product variant is represented as a separate item in our dataset.
diff --git a/sources/academy/webscraping/scraping_basics_python/11_scraping_variants.md b/sources/academy/webscraping/scraping_basics_python/11_scraping_variants.md
index cdd3496af6..e47affbaec 100644
--- a/sources/academy/webscraping/scraping_basics_python/11_scraping_variants.md
+++ b/sources/academy/webscraping/scraping_basics_python/11_scraping_variants.md
@@ -5,7 +5,7 @@ description: Lesson about building a Python application for watching prices. Usi
slug: /scraping-basics-python/scraping-variants
---
-import Exercises from './_exercises.mdx';
+import Exercises from '../scraping_basics/_exercises.mdx';
**In this lesson, we'll scrape the product detail pages to represent each product variant as a separate item in our dataset.**
@@ -38,7 +38,7 @@ First, let's extract information about the variants. If we go to [Sony XBR-950G
Nice! We can extract the variant names, but we also need to extract the price for each variant. Switching the variants using the buttons shows us that the HTML changes dynamically. This means the page uses JavaScript to display this information.
-
+
If we can't find a workaround, we'd need our scraper to run JavaScript. That's not impossible. Scrapers can spin up their own browser instance and automate clicking on buttons, but it's slow and resource-intensive. Ideally, we want to stick to plain HTTP requests and Beautiful Soup as much as possible.
diff --git a/sources/academy/webscraping/scraping_basics_python/12_framework.md b/sources/academy/webscraping/scraping_basics_python/12_framework.md
index 8f64594f92..6f8861785d 100644
--- a/sources/academy/webscraping/scraping_basics_python/12_framework.md
+++ b/sources/academy/webscraping/scraping_basics_python/12_framework.md
@@ -5,7 +5,7 @@ description: Lesson about building a Python application for watching prices. Usi
slug: /scraping-basics-python/framework
---
-import Exercises from './_exercises.mdx';
+import Exercises from '../scraping_basics/_exercises.mdx';
**In this lesson, we'll rework our application for watching prices so that it builds on top of a scraping framework. We'll use Crawlee to make the program simpler, faster, and more robust.**
@@ -321,7 +321,7 @@ async def main():
That's it! If we run the program now, there should be a `storage` directory alongside the `main.py` file. Crawlee uses it to store its internal state. If we go to the `storage/datasets/default` subdirectory, we'll see over 30 JSON files, each representing a single item.
-
+
We can also export all the items to a single file of our choice. We'll do it at the end of the `main()` function, after the crawler has finished scraping:
diff --git a/sources/academy/webscraping/scraping_basics_python/13_platform.md b/sources/academy/webscraping/scraping_basics_python/13_platform.md
index 23f042a048..7496a63661 100644
--- a/sources/academy/webscraping/scraping_basics_python/13_platform.md
+++ b/sources/academy/webscraping/scraping_basics_python/13_platform.md
@@ -84,7 +84,7 @@ The file contains a single asynchronous function, `main()`. At the beginning, it
Every program that runs on the Apify platform first needs to be packaged as a so-called [Actor](https://docs.apify.com/platform/actors)—a standardized container with designated places for input and output. Crawlee scrapers automatically connect their default dataset to the Actor output, but input must be handled explicitly in the code.
-
+
We'll now adjust the template so that it runs our program for watching prices. As the first step, we'll create a new empty file, `crawler.py`, inside the `warehouse-watchdog/src` directory. Then, we'll fill this file with final, unchanged code from the previous lesson:
@@ -258,11 +258,11 @@ Actor build detail https://console.apify.com/actors/a123bCDefghiJkLMN#/builds/0.
After opening the link in our browser, assuming we're logged in, we should see the **Source** screen on the Actor's detail page. We'll go to the **Input** tab of that screen. We won't change anything—just hit **Start**, and we should see logs similar to what we see locally, but this time our scraper will be running in the cloud.
-
+
When the run finishes, the interface will turn green. On the **Output** tab, we can preview the results as a table or JSON. We can even export the data to formats like CSV, XML, Excel, RSS, and more.
-
+
:::info Accessing data
@@ -276,7 +276,7 @@ Now that our scraper is deployed, let's automate its execution. In the Apify web
From now on, the Actor will execute daily. We can inspect each run, view logs, check collected data, [monitor stats and charts](https://docs.apify.com/platform/monitoring), and even set up alerts.
-
+
## Adding support for proxies
@@ -391,7 +391,7 @@ Run: Building Actor warehouse-watchdog
Back in the Apify console, we'll go to the **Source** screen and switch to the **Input** tab. We should see the new **Proxy config** option, which defaults to **Datacenter - Automatic**.
-
+
We'll leave it as is and click **Start**. This time, the logs should show `Using proxy: yes`, as the scraper uses proxies provided by the platform:
diff --git a/sources/academy/webscraping/scraping_basics_python/_exercises.mdx b/sources/academy/webscraping/scraping_basics_python/_exercises.mdx
deleted file mode 100644
index ba254f4022..0000000000
--- a/sources/academy/webscraping/scraping_basics_python/_exercises.mdx
+++ /dev/null
@@ -1,10 +0,0 @@
-
-## Exercises
-
-These challenges are here to help you test what you’ve learned in this lesson. Try to resist the urge to peek at the solutions right away. Remember, the best learning happens when you dive in and do it yourself!
-
-:::caution Real world
-
-You're about to touch the real web, which is practical and exciting! But websites change, so some exercises might break. If you run into any issues, please leave a comment below or [file a GitHub Issue](https://github.com/apify/apify-docs/issues).
-
-:::
diff --git a/sources/academy/webscraping/scraping_basics_python/index.md b/sources/academy/webscraping/scraping_basics_python/index.md
index 4de160a3a3..6ef1e6d78d 100644
--- a/sources/academy/webscraping/scraping_basics_python/index.md
+++ b/sources/academy/webscraping/scraping_basics_python/index.md
@@ -14,7 +14,7 @@ import DocCardList from '@theme/DocCardList';
In this course we'll use Python to create an application for watching prices. It'll be able to scrape all product pages of an e-commerce website and record prices. Data from several runs of such program would be useful for seeing trends in price changes, detecting discounts, etc.
-
+
## What we'll do