Skip to content

Commit 3d315aa

Browse files
committed
fix: improve English
1 parent 36262de commit 3d315aa

File tree

1 file changed

+17
-17
lines changed

1 file changed

+17
-17
lines changed

sources/academy/webscraping/scraping_basics_python/09_getting_links.md

Lines changed: 17 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -23,9 +23,9 @@ We'll use a technique called crawling, i.e. following links to scrape multiple p
2323
1. Visit the start URL.
2424
1. Extract new URLs (and data), and save them.
2525
1. Visit one of the newly found URLs and save data and/or more URLs from it.
26-
1. Repeat 2 and 3 until you have everything you need.
26+
1. Repeat steps 2 and 3 until you have everything you need.
2727

28-
This will help us figure out the actual prices of products, as right now, for some, we're only getting the min price. Implementing the algorithm would require quite a few changes to our code though.
28+
This will help us figure out the actual prices of products, as right now, for some, we're only getting the min price. Implementing the algorithm will require quite a few changes to our code, though.
2929

3030
## Restructuring code
3131

@@ -81,7 +81,7 @@ with open("products.json", "w") as file:
8181
json.dump(data, file, default=serialize)
8282
```
8383

84-
Let's introduce several functions which will make the whole thing easier to digest. First, we can turn the beginning of our program into this `download()` function, which takes a URL and returns a `BeautifulSoup` instance:
84+
Let's introduce several functions to make the whole thing easier to digest. First, we can turn the beginning of our program into this `download()` function, which takes a URL and returns a `BeautifulSoup` instance:
8585

8686
```py
8787
def download(url):
@@ -92,7 +92,7 @@ def download(url):
9292
return BeautifulSoup(html_code, "html.parser")
9393
```
9494

95-
Next, we can put parsing to a `parse_product()` function, which takes the product item element, and returns the dictionary with data:
95+
Next, we can put parsing into a `parse_product()` function, which takes the product item element and returns the dictionary with data:
9696

9797
```py
9898
def parse_product(product):
@@ -116,7 +116,7 @@ def parse_product(product):
116116
return {"title": title, "min_price": min_price, "price": price}
117117
```
118118

119-
Now the CSV export. We'll make a small change here. Having to specify the field names here is not ideal. What if we add more field names in the parsing function? We'd have to always remember we need to go and edit the export function as well. If we could figure out the field names in place, we'd remove this dependency. One way would be to infer the field names from dictionary keys of the first row:
119+
Now the CSV export. We'll make a small change here. Having to specify the field names is not ideal. What if we add more field names in the parsing function? We'd always have to remember to go and edit the export function as well. If we could figure out the field names in place, we'd remove this dependency. One way would be to infer the field names from the dictionary keys of the first row:
120120

121121
```py
122122
def export_csv(file, data):
@@ -130,11 +130,11 @@ def export_csv(file, data):
130130

131131
:::note Fragile code
132132

133-
The code above assumes that the `data` variable contains at least one item, and that all the items have the same keys. This isn't robust and could break, but in our program this isn't a problem and omitting these corner cases allows us to keep the code examples more succinct.
133+
The code above assumes the `data` variable contains at least one item, and that all the items have the same keys. This isn't robust and could break, but in our program, this isn't a problem, and omitting these corner cases allows us to keep the code examples more succinct.
134134

135135
:::
136136

137-
Last function we'll add will take care of the JSON export. For better readability of the JSON export, let's make a small change here too, and set the indentation level to two spaces:
137+
The last function we'll add will take care of the JSON export. For better readability of the JSON export, let's make a small change here too and set the indentation level to two spaces:
138138

139139
```py
140140
def export_json(file, data):
@@ -179,29 +179,29 @@ with open("products.json", "w") as file:
179179
export_json(file, data)
180180
```
181181

182-
The program is much easier to read now. With the `parse_product()` function handy we could also replace the convoluted loop with a [list comprehension](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions).
182+
The program is much easier to read now. With the `parse_product()` function handy, we could also replace the convoluted loop with a [list comprehension](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions).
183183

184184
:::tip Refactoring
185185

186-
We turned the whole program upside down, and at the same time, we didnt make any actual changes! This is [refactoring](https://en.wikipedia.org/wiki/Code_refactoring): improving the structure of existing code without changing its behavior.
186+
We turned the whole program upside down, and at the same time, we didn't make any actual changes! This is [refactoring](https://en.wikipedia.org/wiki/Code_refactoring): improving the structure of existing code without changing its behavior.
187187

188188
![Refactoring](images/refactoring.gif)
189189

190190
:::
191191

192192
## Extracting links
193193

194-
With everything in place, we can now start working towards a scraper which scrapes also the product pages. For that we'll need links of those pages. Let's open browser DevTools and remind ourselves about the structure of a single product item:
194+
With everything in place, we can now start working on a scraper that also scrapes the product pages. For that, we'll need the links to those pages. Let's open the browser DevTools and remind ourselves of the structure of a single product item:
195195

196196
![Product card's child elements](./images/child-elements.png)
197197

198-
Several ways how to transition from one page to another exist, but the most common one is a link tag, which looks like this:
198+
Several methods exist for transitioning from one page to another, but the most common is a link tag, which looks like this:
199199

200200
```html
201201
<a href="https://example.com">Text of the link</a>
202202
```
203203

204-
In DevTools we can see that each product title is in fact also a link tag. We already locate the titles, so that makes our task easier. We only need to modify the code in a way that it extracts not only the text of the element, but also the `href` attribute. Beautiful Soup elements support accessing attributes as if they were dictionary keys:
204+
In DevTools, we can see that each product title is, in fact, also a link tag. We already locate the titles, so that makes our task easier. We just need to edit the code so that it extracts not only the text of the element but also the `href` attribute. Beautiful Soup elements support accessing attributes as if they were dictionary keys:
205205

206206
```py
207207
def parse_product(product):
@@ -214,7 +214,7 @@ def parse_product(product):
214214
return {"title": title, "min_price": min_price, "price": price, "url": url}
215215
```
216216

217-
In the code above we've also already added the URL to the dictionary returned by the function. If we run the scraper now, it should produce exports where each product contains also a link to its product page:
217+
In the code above, we've also added the URL to the dictionary returned by the function. If we run the scraper now, it should produce exports where each product contains a link to its product page:
218218

219219
<!-- eslint-skip -->
220220
```json title=products.json
@@ -235,11 +235,11 @@ In the code above we've also already added the URL to the dictionary returned by
235235
]
236236
```
237237

238-
Hmm, but that isn't what we wanted! Where is the beginning of each URL? It turns out the HTML contains so called relative links.
238+
Hmm, but that isn't what we wanted! Where is the beginning of each URL? It turns out the HTML contains so-called relative links.
239239

240240
## Turning relative links into absolute
241241

242-
Browsers reading the HTML know the base address and automatically resolve such links, but we'll have to do this manually. Function [`urljoin`](https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urljoin) from the Python's standard library will help us. Let's add it to our imports first:
242+
Browsers reading the HTML know the base address and automatically resolve such links, but we'll have to do this manually. The function [`urljoin`](https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urljoin) from Python's standard library will help us. Let's add it to our imports first:
243243

244244
```py
245245
import httpx
@@ -251,7 +251,7 @@ import json
251251
from urllib.parse import urljoin
252252
```
253253

254-
Next, we'll change the `parse_product()` function so that it also takes the base URL as an argument, and then joins it with the relative URL to the product page:
254+
Next, we'll change the `parse_product()` function so that it also takes the base URL as an argument and then joins it with the relative URL to the product page:
255255

256256
```py
257257
# highlight-next-line
@@ -296,7 +296,7 @@ When we run the scraper now, we should see full URLs in our exports:
296296
]
297297
```
298298

299-
Ta-da! We managed to get links leading to the product pages. In the next lesson we'll crawl these URLs so that we can have more details about the products in our dataset.
299+
Ta-da! We've managed to get links leading to the product pages. In the next lesson, we'll crawl these URLs so that we can gather more details about the products in our dataset.
300300

301301
---
302302

0 commit comments

Comments
 (0)