Skip to content

Commit 0dd6498

Browse files
committed
style: improve English and flow
1 parent 1bf6814 commit 0dd6498

File tree

1 file changed

+35
-34
lines changed

1 file changed

+35
-34
lines changed

sources/academy/webscraping/scraping_basics_python/13_platform.md

Lines changed: 35 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ That said, the main goal of this lesson is to show how deploying to **any platfo
3131

3232
## Registering
3333

34-
First, let's [create a new Apify account](https://console.apify.com/sign-up). The process includes several verifications that you're a human being and that your e-mail address is valid. While annoying, these are necessary measures to prevent abuse of the platform.
34+
First, let's [create a new Apify account](https://console.apify.com/sign-up). The process includes several verifications that you're a human being and that your email address is valid. While annoying, these are necessary measures to prevent abuse of the platform.
3535

3636
Apify serves both as an infrastructure where to privately deploy and run own scrapers, and as a marketplace, where anyone can offer their ready scrapers to others for rent. But we'll overcome our curiosity for now and leave exploring the Apify Store for later.
3737

@@ -56,13 +56,13 @@ Success: You are logged in to Apify as user1234!
5656

5757
## Starting a real-world project
5858

59-
Until now we kept our scrapers minimal, each being represented by just a single Python module, such as `main.py`. Also we've been adding dependencies to our project by only installing them by `pip` inside an activated virtual environment.
59+
Until now, we've kept our scrapers minimal, each represented by just a single Python module, such as `main.py`. Also, we've been adding dependencies to our project only by installing them with `pip` inside an activated virtual environment.
6060

61-
If we were to send our code to a friend like this, they wouldn't know what they need to install before they can run the scraper without import errors. The same applies if we were to send our code to a cloud platform.
61+
If we were to send our code to a friend like this, they wouldn't know what they needed to install before running the scraper without import errors. The same applies if we were to deploy our code to a cloud platform.
6262

63-
To be able to share what we've built we need a packaged Python project. The best way to do that is to follow the official [Python Packaging User Guide](https://packaging.python.org/), but for the sake of this course let's take a shortcut with the Apify CLI.
63+
To share what we've built, we need a packaged Python project. The best way to do that is by following the official [Python Packaging User Guide](https://packaging.python.org/), but for the sake of this course, let's take a shortcut with the Apify CLI.
6464

65-
Change directory in your terminal to a place where you start new projects. Then run the following command-it will create a new subdirectory called `warehouse-watchdog` for the new project, which will contain all the necessary files:
65+
Change to a directory where you start new projects in your terminal. Then, run the following commandit will create a new subdirectory called `warehouse-watchdog` for the new project, containing all the necessary files:
6666

6767
```text
6868
$ apify create warehouse-watchdog --template=python-crawlee-beautifulsoup
@@ -76,13 +76,13 @@ Info: To install additional Python packages, you need to activate the virtual en
7676

7777
## Adjusting the template
7878

79-
Inside the `warehouse-watchdog` directory, we should see a `src` subdirectory containing several Python files, a `main.py` among them. That is a sample BeautifulSoup scraper provided by the template.
79+
Inside the `warehouse-watchdog` directory, we should see a `src` subdirectory containing several Python files, including `main.py`. This is a sample BeautifulSoup scraper provided by the template.
8080

81-
The file contains a single asynchronous function `main()`. In the beginning it handles [input](https://docs.apify.com/platform/actors/running/input-and-output#input), then it passes that input to a small crawler built on top of the Crawlee framework.
81+
The file contains a single asynchronous function, `main()`. At the beginning, it handles [input](https://docs.apify.com/platform/actors/running/input-and-output#input), then passes that input to a small crawler built on top of the Crawlee framework.
8282

83-
Each program which should run on the Apify platform needs to be first packaged as a so-called Actor, a standardized box with places for input and output. Crawlee scrapers automatically connect to the Actor output, but the input needs to be explicitly handled in code.
83+
Every program that runs on the Apify platform first needs to be packaged as a so-called Actora standardized container with designated places for input and output. Crawlee scrapers automatically connect their detault dataset to the Actor output, but input needs to be explicitly handled in the code.
8484

85-
We'll now adjust the template so it runs our program for watching prices. As a first step, we'll create a new empty file `crawler.py` inside the `warehouse-watchdog/src` directory, and we'll fill this new file with the [final code](./12_framework.md#logging) from the previous lesson:
85+
We'll now adjust the template so it runs our program for watching prices. As a first step, we'll create a new empty file, `crawler.py`, inside the `warehouse-watchdog/src` directory. Then, we'll fill this file with the [final code](./12_framework.md#logging) from the previous lesson:
8686

8787
```py title=warehouse-watchdog/src/crawler.py
8888
import asyncio
@@ -100,7 +100,7 @@ async def main():
100100
...
101101
```
102102

103-
Now let's change the contents of `warehouse-watchdog/src/main.py` to this:
103+
Now, let's replace the contents of `warehouse-watchdog/src/main.py` with this:
104104

105105
```py title=warehouse-watchdog/src/main.py
106106
from apify import Actor
@@ -111,9 +111,9 @@ async def main():
111111
await crawl()
112112
```
113113

114-
We imported our program as a function and we await result of that function inside the Actor block. Unlike the sample scraper, our program doesn't expect any input data, so we could delete the code handling that part.
114+
We import our program as a function and await the result inside the Actor block. Unlike the sample scraper, our program doesn't expect any input data, so we can delete the code handling that part.
115115

116-
Now we'll change directory in our terminal so that we're inside `warehouse-watchdog`. There, we'll verify that everything works on our machine before we deploy our project to the cloud:
116+
Next, we'll change to the `warehouse-watchdog` directory in our terminal and verify that everything works locally before deploying the project to the cloud:
117117

118118
```text
119119
$ apify run
@@ -143,9 +143,9 @@ Run: /Users/course/Projects/warehouse-watchdog/.venv/bin/python3 -m src
143143

144144
## Deploying to platform
145145

146-
The Actor configuration coming from the template instructs the platform to expect input, so we better change that before we attempt to run our scraper in the cloud.
146+
The Actor configuration from the template instructs the platform to expect input, so we should change that before running our scraper in the cloud.
147147

148-
Inside `warehouse-watchdog` there is a directory called `.actor`. Inside, we'll edit the `input_schema.json` file. It comes out of the template like this:
148+
Inside `warehouse-watchdog`, there's a directory called `.actor`. Within it, we'll edit the `input_schema.json` file, which looks like this by default:
149149

150150
```json title=warehouse-watchdog/src/.actor/input_schema.json
151151
{
@@ -169,11 +169,11 @@ Inside `warehouse-watchdog` there is a directory called `.actor`. Inside, we'll
169169

170170
:::tip Hidden dot files
171171

172-
Beware that on some systems `.actor` might be by default hidden in the directory listing, because it starts with a dot. Try to locate the file using your editor's built-in file explorer.
172+
On some systems, `.actor` might be hidden in the directory listing because it starts with a dot. Use your editor's built-in file explorer to locate it.
173173

174174
:::
175175

176-
We'll empty the expected properties and remove the list of required ones. After our changes, the file should look like this:
176+
We'll remove the expected properties and the list of required ones. After our changes, the file should look like this:
177177

178178
```json title=warehouse-watchdog/src/.actor/input_schema.json
179179
{
@@ -186,11 +186,11 @@ We'll empty the expected properties and remove the list of required ones. After
186186

187187
:::danger Trailing commas in JSON
188188

189-
Make sure there is no trailing comma after `{}`, otherwise the file won't be valid JSON.
189+
Make sure there's no trailing comma after `{}`, or the file won't be valid JSON.
190190

191191
:::
192192

193-
Now we can proceed to deploying:
193+
Now, we can proceed with deployment:
194194

195195
```text
196196
$ apify push
@@ -203,27 +203,27 @@ Actor build detail https://console.apify.com/actors/a123bCDefghiJkLMN#/builds/0.
203203
? Do you want to open the Actor detail in your browser? (Y/n)
204204
```
205205

206-
After agreeing to opening the Actor detail in our browser, assuming we're logged in we'll see a button to **Start Actor**. Hitting it brings us to a screen for specifying Actor input and run options. Without chaning anything we'll continue by hitting **Start** and immediately we should be presented with logs of the scraper run, similar to what we'd normally see in our terminal, but this time a remote copy of our program runs on a cloud platform.
206+
After agreeing to open the Actor details in our browser, assuming we're logged in, we'll see a **Start Actor** button. Clicking it takes us to a screen where we can specify Actor input and run options. Without changing anything, we'll continue by clicking **Start**, and we should immediately see the scraper's logs—similar to what we'd normally see in our terminal, but now running remotely on a cloud platform.
207207

208-
When the run finishes, the interface should turn green. On the **Output** tab we're able to preview results of the scraper as a table or as JSON, and there's even a button to export the data to many different formats, including CSV, XML, Excel, RSS, and more.
208+
When the run finishes, the interface should turn green. On the **Output** tab, we can preview the scraper's results as a table or JSON. There's even an option to export the data to various formats, including CSV, XML, Excel, RSS, and more.
209209

210210
:::note Accessing data programmatically
211211

212-
You don't need to click buttons to download the data. The same can be achieved by [Apify's API](https://docs.apify.com/api/v2/dataset-items-get), the [`apify datasets`](https://docs.apify.com/cli/docs/reference#datasets) CLI command, or by the [`apify`](https://docs.apify.com/api/client/python/reference/class/DatasetClientAsync) Python SDK.
212+
You don't need to click buttons to download the data. You can also retrieve it using [Apify's API](https://docs.apify.com/api/v2/dataset-items-get), the [`apify datasets`](https://docs.apify.com/cli/docs/reference#datasets) CLI command, or the [`apify`](https://docs.apify.com/api/client/python/reference/class/DatasetClientAsync) Python SDK.
213213

214214
:::
215215

216216
## Running the scraper periodically
217217

218-
Let's say we want our scraper to collect data about sale prices daily. In the Apify web interface, we'll go to [Schedules](https://console.apify.com/schedules). Hitting **Create new** will open a setup screen where we can specify periodicity (daily is the default) and identify Actors which should be started. When we're done, we can hit **Enable**. That's it!
218+
Let's say we want our scraper to collect sale price data daily. In the Apify web interface, we'll go to [Schedules](https://console.apify.com/schedules). Clicking **Create new** will open a setup screen where we can specify the frequency (daily is the default) and select the Actors that should be started. Once we're done, we can click **Enable**—that's it!
219219

220-
From now on, the Actor will execute daily. We'll be able to inspect every single run. For each run we'll have access to its logs, and the data collected. We'll see stats, monitoring charts, and we'll be able to setup alerts sending us a notification under specified conditions.
220+
From now on, the Actor will run daily, and we'll be able to inspect every execution. For each run, we'll have access to its logs and the collected data. We'll also see stats, monitoring charts, and have the option to set up alerts that notify us under specific conditions.
221221

222222
## Adding support for proxies
223223

224-
If monitoring of our scraper starts showing that it regularly fails to reach the Warehouse shop website, we're most likely getting blocked. In such case we can use proxies so that our requests come from different locations in the world and our scraping is less apparent in the volume of the website's standard traffic.
224+
If our monitoring shows that the scraper frequently fails to reach the Warehouse Shop website, we're most likely getting blocked. In that case, we can use proxies to make requests from different locations, reducing the chances of detection and blocking.
225225

226-
Proxy configuration is a type of Actor input, so let's start with re-introducing the code taking care of that. We'll change `warehouse-watchdog/src/main.py` to this:
226+
Proxy configuration is a type of Actor input, so let's start by reintroducing the necessary code. We'll update `warehouse-watchdog/src/main.py` like this:
227227

228228
```py title=warehouse-watchdog/src/main.py
229229
from apify import Actor
@@ -241,8 +241,7 @@ async def main():
241241
await crawl(proxy_config)
242242
```
243243

244-
Now we'll add `proxy_config` as an optional parameter to the scraper in `warehouse-watchdog/src/crawler.py`. Thanks to built-in integration between Apify and Crawlee, it's enough to pass it down to `BeautifulSoupCrawler()` and the class will do the rest:
245-
244+
Next, we'll add `proxy_config` as an optional parameter in `warehouse-watchdog/src/crawler.py`. Thanks to the built-in integration between Apify and Crawlee, we only need to pass it to `BeautifulSoupCrawler()`, and the class will handle the rest:
246245

247246
```py title=warehouse-watchdog/src/crawler.py
248247
import asyncio
@@ -264,7 +263,7 @@ async def main(proxy_config = None):
264263
...
265264
```
266265

267-
Last but not least we'll modify the Actor configuration in `warehouse-watchdog/src/.actor/input_schema.json` so that it includes the `proxyConfig` input parameter:
266+
Finally, we'll modify the Actor configuration in `warehouse-watchdog/src/.actor/input_schema.json` to include the `proxyConfig` input parameter:
268267

269268
```json title=warehouse-watchdog/src/.actor/input_schema.json
270269
{
@@ -290,7 +289,7 @@ Last but not least we'll modify the Actor configuration in `warehouse-watchdog/s
290289
}
291290
```
292291

293-
Now if we run the scraper locally, it should all work without error. We'll use the `apify run` command again, but with the `--purge` option, which makes sure we're not re-using anything from the previous run.
292+
Now, if we run the scraper locally, everything should work without errors. We'll use the `apify run` command again, but this time with the `--purge` option to ensure we're not reusing data from a previous run:
294293

295294
```text
296295
$ apify run --purge
@@ -320,7 +319,7 @@ Run: /Users/course/Projects/warehouse-watchdog/.venv/bin/python3 -m src
320319
...
321320
```
322321

323-
In the logs we can notice a line `Using proxy: no`. When running the scraper locally, the Actor input doesn't contain any proxy configuration. The requests will be all made from our location, the same way as previously. Now let's update our cloud copy of the scraper with `apify push` so that it's based on our latest edits to the code:
322+
In the logs, we should see a line like `Using proxy: no`. When running the scraper locally, the Actor input doesn't include a proxy configuration, so all requests will be made from our own location, just as before. Now, let's update our cloud copy of the scraper with `apify push` to reflect our latest changes:
324323

325324
```text
326325
$ apify push
@@ -332,7 +331,7 @@ Run: Building Actor warehouse-watchdog
332331
? Do you want to open the Actor detail in your browser? (Y/n)
333332
```
334333

335-
After opening the Actor detail in our browser, we should see the **Source** screen. We'll switch to the **Input** tab of that screen. There we can now see the **Proxy config** input option. By default it's set to **Datacenter - Automatic**, and we'll leave it like that. Let's hit the **Start** button! In the logs we should see the following:
334+
After opening the Actor detail in our browser, we should see the **Source** screen. We'll switch to the **Input** tab, where we can now see the **Proxy config** input option. By default, it's set to **Datacenter - Automatic**, and we'll leave it as is. Let's click **Start**! In the logs, we should see the following:
336335

337336
```text
338337
(timestamp) ACTOR: Pulling Docker image of build o6vHvr5KwA1sGNxP0 from repository.
@@ -362,13 +361,15 @@ After opening the Actor detail in our browser, we should see the **Source** scre
362361
...
363362
```
364363

365-
The logs contain `Using proxy: yes`, which confirms that the scraper now uses proxies provided by the Apify platform.
364+
The logs should now include `Using proxy: yes`, confirming that the scraper is successfully using proxies provided by the Apify platform.
366365

367366
## Congratulations!
368367

369-
You reached the end of the course, congratulations! Together we've built a program which crawls a shop, extracts data about products and their prices and exports the data. We managed to simplify our work with a framework and deployed our scraper to a scraping platform so that it can periodically run unassisted and accumulate data over time. The platform also helps us with monitoring or anti-scraping.
368+
You've reached the end of the course—congratulations! 🎉
369+
370+
Together, we've built a program that crawls a shop, extracts product and pricing data, and exports the results. We've also simplified our work using a framework and deployed our scraper to a cloud platform, enabling it to run periodically, collect data over time, and provide monitoring and anti-scraping protection.
370371

371-
We hope that this will serve as a good start for your next adventure in the world of scraping. Perhaps creating scrapers which you publish for a fee so that others can re-use them? 😉
372+
We hope this serves as a solid foundation for your next scraping project. Perhaps you'll even start publishing scrapers for others to use—for a fee? 😉
372373

373374
---
374375

0 commit comments

Comments
 (0)