Skip to content

Commit c169871

Browse files
authored
Merge branch 'apify:master' into feat/agno-integration
2 parents e6a9577 + 401f243 commit c169871

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

42 files changed

+2773
-1489
lines changed

.github/styles/config/vocabularies/Docs/accept.txt

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -88,19 +88,20 @@ preconfigured
8888

8989
devs
9090
asyncio
91-
Langflow
9291
backlinks?
9392
captchas?
9493
Chatbot
9594
combinator
9695
deduplicating
96+
dev
9797
Fakestore
9898
Fandom('s)?
9999
IMDb
100100
influencers
101101
iPads?
102102
iPhones?
103103
jQuery
104+
Langflow
104105
learnings
105106
livestreams
106107
outro

package-lock.json

Lines changed: 2291 additions & 1403 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

package.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@
5757
"fs-extra": "^11.1.1",
5858
"globals": "^16.0.0",
5959
"globby": "^14.0.0",
60-
"markdownlint": "^0.37.0",
60+
"markdownlint": "^0.38.0",
6161
"markdownlint-cli": "^0.44.0",
6262
"path-browserify": "^1.0.1",
6363
"patch-package": "^8.0.0",
@@ -66,7 +66,7 @@
6666
"typescript-eslint": "^8.29.1"
6767
},
6868
"dependencies": {
69-
"@apify/ui-library": "^0.65.0",
69+
"@apify/ui-library": "^0.66.0",
7070
"@docusaurus/core": "3.7.0",
7171
"@docusaurus/faster": "3.7.0",
7272
"@docusaurus/plugin-client-redirects": "3.7.0",

sources/academy/webscraping/scraping_basics_python/01_devtools_inspecting.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,6 @@
22
title: Inspecting web pages with browser DevTools
33
sidebar_label: "DevTools: Inspecting"
44
description: Lesson about using the browser tools for developers to inspect and manipulate the structure of an e-commerce website.
5-
sidebar_position: 1
65
slug: /scraping-basics-python/devtools-inspecting
76
---
87

sources/academy/webscraping/scraping_basics_python/02_devtools_locating_elements.md

Lines changed: 8 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,6 @@
22
title: Locating HTML elements on a web page with browser DevTools
33
sidebar_label: "DevTools: Locating HTML elements"
44
description: Lesson about using the browser tools for developers to manually find products on an e-commerce website.
5-
sidebar_position: 2
65
slug: /scraping-basics-python/devtools-locating-elements
76
---
87

@@ -32,13 +31,13 @@ As mentioned in the previous lesson, before building a scraper, we need to under
3231

3332
![Warehouse store with DevTools open](./images/devtools-warehouse.png)
3433

35-
The page displays a grid of product cards, each showing a product's name and picture. Open DevTools and locate the name of the **Sony SACS9 Active Subwoofer**. Highlight it in the **Elements** tab by clicking on it.
34+
The page displays a grid of product cards, each showing a product's title and picture. Open DevTools and locate the title of the **Sony SACS9 Active Subwoofer**. Highlight it in the **Elements** tab by clicking on it.
3635

37-
![Selecting an element with DevTools](./images/devtools-product-name.png)
36+
![Selecting an element with DevTools](./images/devtools-product-title.png)
3837

3938
Next, let's find all the elements containing details about this subwoofer—its price, number of reviews, image, and more.
4039

41-
In the **Elements** tab, move your cursor up from the `a` element containing the subwoofer's name. On the way, hover over each element until you highlight the entire product card. Alternatively, use the arrow-up key. The `div` element you land on is the **parent element**, and all nested elements are its **child elements**.
40+
In the **Elements** tab, move your cursor up from the `a` element containing the subwoofer's title. On the way, hover over each element until you highlight the entire product card. Alternatively, use the arrow-up key. The `div` element you land on is the **parent element**, and all nested elements are its **child elements**.
4241

4342
![Selecting an element with hover](./images/devtools-hover-product.png)
4443

@@ -66,13 +65,7 @@ document.querySelector('.product-item');
6665

6766
It will return the HTML element for the first product card in the listing:
6867

69-
![Using querySelector() in DevTools Console](./images/devtools-queryselector.png)
70-
71-
:::note About the missing semicolon
72-
73-
In the screenshot, there is a missing semicolon `;` at the end of the line. In JavaScript, semicolons are optional, so it doesn't make a difference here.
74-
75-
:::
68+
![Using querySelector() in DevTools Console](./images/devtools-queryselector.webp)
7669

7770
CSS selectors can get quite complex, but the basics are enough to scrape most of the Warehouse store. Let's cover two simple types and how they can combine.
7871

@@ -167,9 +160,9 @@ On English Wikipedia's [Main Page](https://en.wikipedia.org/wiki/Main_Page), use
167160
1. Open the [Main Page](https://en.wikipedia.org/wiki/Main_Page).
168161
1. Activate the element selection tool in your DevTools.
169162
1. Click on several headings to examine the markup.
170-
1. Notice that all headings are `h2` tags with the `mp-h2` class.
163+
1. Notice that all headings are `h2` elements with the `mp-h2` class.
171164
1. In the **Console**, execute `document.querySelectorAll('h2')`.
172-
1. At the time of writing, this selector returns 8 headings. Each corresponds to a box, and there are no other `h2` tags on the page. Thus, the selector is sufficient as is.
165+
1. At the time of writing, this selector returns 8 headings. Each corresponds to a box, and there are no other `h2` elements on the page. Thus, the selector is sufficient as is.
173166

174167
</details>
175168

@@ -185,7 +178,7 @@ Go to Shein's [Jewelry & Accessories](https://shein.com/RecommendSelection/Jewel
185178
1. Visit the [Jewelry & Accessories](https://shein.com/RecommendSelection/Jewelry-Accessories-sc-017291431.html) page. Close any pop-ups or promotions.
186179
1. Activate the element selection tool in your DevTools.
187180
1. Click on the first product to inspect its markup. Repeat with a few others.
188-
1. Observe that all products are `section` tags with multiple classes, including `product-card`.
181+
1. Observe that all products are `section` elements with multiple classes, including `product-card`.
189182
1. Since `section` is a generic wrapper, focus on the `product-card` class.
190183
1. In the **Console**, execute `document.querySelectorAll('.product-card')`.
191184
1. At the time of writing, this selector returns 120 results, all representing products. No further narrowing is necessary.
@@ -206,7 +199,7 @@ Hint: Learn about the [descendant combinator](https://developer.mozilla.org/en-U
206199
1. Open the [page about F1](https://www.theguardian.com/sport/formulaone).
207200
1. Activate the element selection tool in your DevTools.
208201
1. Click on an article to inspect its structure. Check several articles, including the ones with smaller cards.
209-
1. Note that all articles are `li` tags, but their classes (e.g., `dcr-1qmyfxi`) are dynamically generated and unreliable.
202+
1. Note that all articles are `li` elements, but their classes (e.g., `dcr-1qmyfxi`) are dynamically generated and unreliable.
210203
1. Using `document.querySelectorAll('li')` returns too many results, including unrelated items like navigation links.
211204
1. Inspect the page structure. The `main` element contains the primary content, including articles. Use the descendant combinator to target `li` elements within `main`.
212205
1. In the **Console**, execute `document.querySelectorAll('main li')`.

sources/academy/webscraping/scraping_basics_python/03_devtools_extracting_data.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,6 @@
22
title: Extracting data from a web page with browser DevTools
33
sidebar_label: "DevTools: Extracting data"
44
description: Lesson about using the browser tools for developers to manually extract product data from an e-commerce website.
5-
sidebar_position: 3
65
slug: /scraping-basics-python/devtools-extracting-data
76
---
87

@@ -127,7 +126,7 @@ On the Guardian's [F1 news page](https://www.theguardian.com/sport/formulaone),
127126
1. Open the [F1 news page](https://www.theguardian.com/sport/formulaone).
128127
1. Activate the element selection tool in your DevTools.
129128
1. Click on the first post.
130-
1. Notice that the markup does not provide clear, reusable class names for this task. The structure uses generic tags and randomized classes, requiring you to rely on the element hierarchy and order instead.
129+
1. Notice that the markup does not provide clear, reusable class names for this task. The structure uses generic tag names and randomized classes, requiring you to rely on the element hierarchy and order instead.
131130
1. In the **Console**, execute `post = document.querySelector('#maincontent ul li')`. This returns the element representing the first post.
132131
1. Extract the post's title by executing `post.querySelector('h3').textContent`.
133132
1. Extract the lead paragraph by executing `post.querySelector('span div').textContent`.

sources/academy/webscraping/scraping_basics_python/04_downloading_html.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,6 @@
22
title: Downloading HTML with Python
33
sidebar_label: Downloading HTML
44
description: Lesson about building a Python application for watching prices. Using the HTTPX library to download HTML code of a product listing page.
5-
sidebar_position: 4
65
slug: /scraping-basics-python/downloading-html
76
---
87

sources/academy/webscraping/scraping_basics_python/05_parsing_html.md

Lines changed: 7 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,6 @@
22
title: Parsing HTML with Python
33
sidebar_label: Parsing HTML
44
description: Lesson about building a Python application for watching prices. Using the Beautiful Soup library to parse HTML code of a product listing page.
5-
sidebar_position: 5
65
slug: /scraping-basics-python/parsing-html
76
---
87

@@ -12,7 +11,7 @@ import Exercises from './_exercises.mdx';
1211

1312
---
1413

15-
From lessons about browser DevTools we know that the HTML tags representing individual products have a `class` attribute which, among other values, contains `product-item`.
14+
From lessons about browser DevTools we know that the HTML elements representing individual products have a `class` attribute which, among other values, contains `product-item`.
1615

1716
![Products have the ‘product-item’ class](./images/product-item.png)
1817

@@ -38,9 +37,9 @@ $ pip install beautifulsoup4
3837
Successfully installed beautifulsoup4-4.0.0 soupsieve-0.0
3938
```
4039

41-
Now let's use it for parsing the HTML. The `BeautifulSoup` object allows us to work with the HTML elements in a structured way. As a demonstration, we'll first get the `<h1>` tag, which represents the main heading of the page.
40+
Now let's use it for parsing the HTML. The `BeautifulSoup` object allows us to work with the HTML elements in a structured way. As a demonstration, we'll first get the `<h1>` element, which represents the main heading of the page.
4241

43-
![Tag of the main heading](./images/h1.png)
42+
![Element of the main heading](./images/h1.png)
4443

4544
Update your code to the following:
4645

@@ -64,15 +63,15 @@ $ python main.py
6463
[<h1 class="collection__title heading h1">Sales</h1>]
6564
```
6665

67-
Our code lists all `<h1>` tags it can find on the page. It's the case that there's just one, so in the result we can see a list with a single item. What if we want to print just the text? Let's change the end of the program to the following:
66+
Our code lists all `h1` elements it can find on the page. It's the case that there's just one, so in the result we can see a list with a single item. What if we want to print just the text? Let's change the end of the program to the following:
6867

6968
```py
7069
headings = soup.select("h1")
7170
first_heading = headings[0]
7271
print(first_heading.text)
7372
```
7473

75-
If we run our scraper again, it prints the text of the first `<h1>` tag:
74+
If we run our scraper again, it prints the text of the first `h1` element:
7675

7776
```text
7877
$ python main.py
@@ -133,7 +132,7 @@ https://www.formula1.com/en/teams
133132

134133
html_code = response.text
135134
soup = BeautifulSoup(html_code, "html.parser")
136-
print(len(soup.select(".outline")))
135+
print(len(soup.select(".group")))
137136
```
138137

139138
</details>
@@ -155,7 +154,7 @@ Use the same URL as in the previous exercise, but this time print a total count
155154

156155
html_code = response.text
157156
soup = BeautifulSoup(html_code, "html.parser")
158-
print(len(soup.select(".f1-grid")))
157+
print(len(soup.select(".f1-team-driver-name")))
159158
```
160159

161160
</details>

sources/academy/webscraping/scraping_basics_python/06_locating_elements.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,6 @@
22
title: Locating HTML elements with Python
33
sidebar_label: Locating HTML elements
44
description: Lesson about building a Python application for watching prices. Using the Beautiful Soup library to locate products on the product listing page.
5-
sidebar_position: 6
65
slug: /scraping-basics-python/locating-elements
76
---
87

sources/academy/webscraping/scraping_basics_python/07_extracting_data.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,6 @@
22
title: Extracting data from HTML with Python
33
sidebar_label: Extracting data from HTML
44
description: Lesson about building a Python application for watching prices. Using string manipulation to extract and clean data scraped from the product listing page.
5-
sidebar_position: 7
65
slug: /scraping-basics-python/extracting-data
76
---
87

@@ -313,7 +312,7 @@ Max Verstappen wins Canadian Grand Prix: F1 – as it happened 2024-06-09
313312

314313
Hints:
315314

316-
- HTML's `<time>` tag can have an attribute `datetime`, which [contains data in a machine-readable format](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/time), such as the ISO 8601.
315+
- HTML's `time` element can have an attribute `datetime`, which [contains data in a machine-readable format](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/time), such as the ISO 8601.
317316
- Beautiful Soup gives you [access to attributes as if they were dictionary keys](https://beautiful-soup-4.readthedocs.io/en/latest/#attributes).
318317
- In Python you can create `datetime` objects using `datetime.fromisoformat()`, a [built-in method for parsing ISO 8601 strings](https://docs.python.org/3/library/datetime.html#datetime.datetime.fromisoformat).
319318
- To get just the date part, you can call `.date()` on any `datetime` object.

0 commit comments

Comments
 (0)